Why is writeVcf() slow?

0

Entering edit mode

Peter Hickey ▴ 750

@petehaitch

Last seen 5 weeks ago

WEHI, Melbourne, Australia

I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. Many thanks for your help. ## Code > library(VariantAnnotation) > vcf <- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') > vcf class: VCF dim: 292155 7 genome: mm9 exptData(1): header fixed(4): REF ALT QUAL FILTER info(21): AC AF ... SB STR geno(5): AD DP GQ GT PL rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 chr19:61340708 rowData values names(1): paramRangeID colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 colData names(1): Samples ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > vcf.2aff.constrained class: VCF dim: 35514 7 genome: mm9 exptData(1): header fixed(4): REF ALT QUAL FILTER info(21): AC AF ... SB STR geno(5): AD DP GQ GT PL rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 chr19:61340453 rowData values names(1): paramRangeID colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 colData names(1): Samples > system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unified Genotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) user system elapsed 10653.768 0.244 10659.163 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}

VariantAnnotation VariantAnnotation VariantAnnotation VariantAnnotation • 1.7k views

ADD COMMENT • link updated 12.4 years ago by Valerie Obenchain ★ 6.8k • written 12.4 years ago by Peter Hickey ▴ 750

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.1 years ago

United States

Hi Peter, I'd say yes, the warning is a cause for concern. Is the vcf you're working with publicly available? If not, can you send a smaller subset that still produces the error? Valerie On 10/23/12 17:29, hickey at wehi.EDU.AU wrote: > I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. > > Many thanks for your help. > > ## Code >> library(VariantAnnotation) >> vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') >> vcf > class: VCF > dim: 292155 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > chr19:61340708 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > >> vcf.2aff.constrained > class: VCF > dim: 35514 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > chr19:61340453 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > >> system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unifie dGenotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) > user system elapsed > 10653.768 0.244 10659.163 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

is part of this an issue with the VCF format itself? http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ or is that a separate concern? thanks, --t On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi Peter, > > I'd say yes, the warning is a cause for concern. Is the vcf you're working > with publicly available? If not, can you send a smaller subset that still > produces the error? > > Valerie > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > >> I am using the writeVcf() function in the VariantAnnotation package, >> which I understand to be 'under construction' from the documentation. I am >> wondering why writeVcf() is taking so long to do it's job and if there is >> anything I can do to speed this up? For example, I have a VCF object >> containing some 36,000 variants that I want to write to file and this takes >> nearly 3 hours. It also produces a warning that I am unsure how to >> interpret, i.e. should I be worried by it? The VCF written to disk appears >> to be valid. >> >> Many thanks for your help. >> >> ## Code >> >>> library(VariantAnnotation) >>> vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.**snps.indels.vcf.gz', >>> genome = 'mm9') >>> vcf >>> >> class: VCF >> dim: 292155 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >> chr19:61340708 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 >> colData names(1): Samples >> >> ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained >> >> vcf.2aff.constrained >>> >> class: VCF >> dim: 35514 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >> chr19:61340453 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 >> colData names(1): Samples >> >> system.time(writeVcf(obj = vcf.2aff.constrained, filename = >>> 'UnifiedGenotyper.autosomal.**snps.indels.constrained.** >>> genotypes.2aff.vcf')) >>> >> user system elapsed >> 10653.768 0.244 10659.163 >> Warning message: >> In .Method(..., deparse.level = deparse.level) : >> number of rows of result is not a multiple of vector length (arg 1) >> >> sessionInfo() >>> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 >> [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >> >> ------------------------------**-- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au >> http://www.wehi.edu.au >> >> >> ______________________________**______________________________** >> __________ >> The information in this email is confidential and intend...{{dropped:8}} >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Valerie - I'll send you a subset of the VCF off list. Tim - This isn't something I'd considered. I'd always thought of VCF as just another delimited format but that post highlights the "intricacies" of VCF that I hadn't considered. In saying that, the files I'm dealing with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), especially compared to the 1KG VCFs which are multi-Gb in size and contain thousands of samples. Pete On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > is part of this an issue with the VCF format itself? > > http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ > > or is that a separate concern? > > thanks, > > --t > > > On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > Hi Peter, > > I'd say yes, the warning is a cause for concern. Is the vcf you're working with publicly available? If not, can you send a smaller subset that still produces the error? > > Valerie > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. > > Many thanks for your help. > > ## Code > library(VariantAnnotation) > vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') > vcf > class: VCF > dim: 292155 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > chr19:61340708 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > > vcf.2aff.constrained > class: VCF > dim: 35514 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > chr19:61340453 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unified Genotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) > user system elapsed > 10653.768 0.244 10659.163 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago Peter Hickey ▴ 750

0

Entering edit mode

Just a note: this sounds like a pathological case. writeVcf is pretty efficient these days. I regularly output millions of variants in some small number of minutes. Michael On Tue, Oct 23, 2012 at 9:08 PM, <hickey@wehi.edu.au> wrote: > Valerie - I'll send you a subset of the VCF off list. > > Tim - This isn't something I'd considered. I'd always thought of VCF as > just another delimited format but that post highlights the "intricacies" of > VCF that I hadn't considered. In saying that, the files I'm dealing with > here are pretty small (300K variants on 7 samples, 21Mb bgzipped), > especially compared to the 1KG VCFs which are multi-Gb in size and contain > thousands of samples. > > Pete > > > On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > > > is part of this an issue with the VCF format itself? > > > > > http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ > > > > or is that a separate concern? > > > > thanks, > > > > --t > > > > > > On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org> > wrote: > > Hi Peter, > > > > I'd say yes, the warning is a cause for concern. Is the vcf you're > working with publicly available? If not, can you send a smaller subset > that still produces the error? > > > > Valerie > > > > > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > > I am using the writeVcf() function in the VariantAnnotation package, > which I understand to be 'under construction' from the documentation. I am > wondering why writeVcf() is taking so long to do it's job and if there is > anything I can do to speed this up? For example, I have a VCF object > containing some 36,000 variants that I want to write to file and this takes > nearly 3 hours. It also produces a warning that I am unsure how to > interpret, i.e. should I be worried by it? The VCF written to disk appears > to be valid. > > > > Many thanks for your help. > > > > ## Code > > library(VariantAnnotation) > > vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', > genome = 'mm9') > > vcf > > class: VCF > > dim: 292155 7 > > genome: mm9 > > exptData(1): header > > fixed(4): REF ALT QUAL FILTER > > info(21): AC AF ... SB STR > > geno(5): AD DP GQ GT PL > > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > > chr19:61340708 > > rowData values names(1): paramRangeID > > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > > colData names(1): Samples > > > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > > > > vcf.2aff.constrained > > class: VCF > > dim: 35514 7 > > genome: mm9 > > exptData(1): header > > fixed(4): REF ALT QUAL FILTER > > info(21): AC AF ... SB STR > > geno(5): AD DP GQ GT PL > > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > > chr19:61340453 > > rowData values names(1): paramRangeID > > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > > colData names(1): Samples > > > > system.time(writeVcf(obj = vcf.2aff.constrained, filename = > 'UnifiedGenotyper.autosomal.snps.indels.constrained.genotypes.2aff.v cf')) > > user system elapsed > > 10653.768 0.244 10659.163 > > Warning message: > > In .Method(..., deparse.level = deparse.level) : > > number of rows of result is not a multiple of vector length (arg 1) > > > > sessionInfo() > > R version 2.15.1 (2012-06-22) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > > > loaded via a namespace (and not attached): > > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > > > -------------------------------- > > Peter Hickey, > > PhD Student/Research Assistant, > > Bioinformatics Division, > > Walter and Eliza Hall Institute of Medical Research, > > 1G Royal Parade, Parkville, Vic 3052, Australia. > > Ph: +613 9345 2324 > > > > hickey@wehi.edu.au > > http://www.wehi.edu.au > > > > > > ______________________________________________________________________ > > The information in this email is confidential and intend...{{dropped:8}} > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > A model is a lie that helps you see the truth. > > > > Howard Skipper > > > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Pete, Thanks for sending the data. I confirmed your numbers for writing out the sample file. > load('~/Downloads/slow_writeVcf_example_data.RData') > vcf <- vcf.4aff.constrained > vcf class: VCF dim: 2668 7 genome: mm9 ... > system.time(writeVcf(vcf, "mm9.vcf")) user system elapsed 164.362 0.548 165.378 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) Sample data from the package with ~10K variants was also slow. > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > vcf class: VCF dim: 10376 5 genome: hg19 ... > system.time(writeVcf(vcf, "hg19.vcf")) user system elapsed 1113.089 2.052 1117.831 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) The warning came from a problem when creating the FORMAT lines for the geno fields when we both list and non-list matrices are present. The speed issues I identified were parsing bottlenecks when preparing the info and geno data for writing. With the modifications these two examples are showing a >95% speed increase. ## Note that the VCF class structure has changed in VariantAnnotation ## in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ## and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ## objects. See devel ?VCF for details. > load("~/Downloads/slow_writeVcf_example_data.RData") > vcf <- updateObject(vcf.4aff.constrained) > vcf class: CollapsedVCF dim: 2668 7 genome: mm9 ... > system.time(writeVcf(vcf, "mm9.vcf")) user system elapsed 2.276 0.060 2.339 Again the sample file from the package, > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > vcf class: CollapsedVCF dim: 10376 5 genome: hg19 ... > system.time(writeVcf(vcf, "hg19.vcf")) user system elapsed 4.717 0.012 4.738 Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release. Thanks for reporting this. Valerie On 10/23/2012 09:08 PM, hickey at wehi.EDU.AU wrote: > Valerie - I'll send you a subset of the VCF off list. > > Tim - This isn't something I'd considered. I'd always thought of VCF as > just another delimited format but that post highlights the "intricacies" > of VCF that I hadn't considered. In saying that, the files I'm dealing > with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), > especially compared to the 1KG VCFs which are multi-Gb in size and > contain thousands of samples. > > Pete > > > On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > >> is part of this an issue with the VCF format itself? >> >> http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ >> >> or is that a separate concern? >> >> thanks, >> >> --t >> >> >> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha at="" fhcrc.org="">> <mailto:vobencha at="" fhcrc.org="">> wrote: >> >> Hi Peter, >> >> I'd say yes, the warning is a cause for concern. Is the vcf you're >> working with publicly available? If not, can you send a smaller >> subset that still produces the error? >> >> Valerie >> >> >> >> On 10/23/12 17:29, hickey at wehi.EDU.AU <mailto:hickey at="" wehi.edu.au=""> >> wrote: >> >> I am using the writeVcf() function in the VariantAnnotation >> package, which I understand to be 'under construction' from >> the documentation. I am wondering why writeVcf() is taking so >> long to do it's job and if there is anything I can do to speed >> this up? For example, I have a VCF object containing some >> 36,000 variants that I want to write to file and this takes >> nearly 3 hours. It also produces a warning that I am unsure >> how to interpret, i.e. should I be worried by it? The VCF >> written to disk appears to be valid. >> >> Many thanks for your help. >> >> ## Code >> >> library(VariantAnnotation) >> vcf<- readVcf(file = >> 'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome >> = 'mm9') >> vcf >> >> class: VCF >> dim: 292155 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >> chr19:61340708 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >> 506/Bcl2#409 >> colData names(1): Samples >> >> ## Some subsetting of the 'vcf' object to create >> vcf.2aff.constrained >> >> vcf.2aff.constrained >> >> class: VCF >> dim: 35514 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >> chr19:61340453 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >> 506/Bcl2#409 >> colData names(1): Samples >> >> system.time(writeVcf(obj = vcf.2aff.constrained, filename >> = >> 'UnifiedGenotyper.autosomal.__snps.indels.constrained._ _genotypes.2aff.vcf')) >> >> user system elapsed >> 10653.768 0.244 10659.163 >> Warning message: >> In .Method(..., deparse.level = deparse.level) : >> number of rows of result is not a multiple of vector length >> (arg 1) >> >> sessionInfo() >> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 >> Biostrings_2.26.2 >> [4] GenomicRanges_1.10.2 IRanges_1.16.2 >> BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >> >> ------------------------------__-- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 <tel:%2b613%209345%202324> >> >> hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> >> http://www.wehi.edu.au <http: www.wehi.edu.au=""/> >> >> >> ___________________________________________________________ _______________ >> The information in this email is confidential and >> intend...{{dropped:8}} >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD REPLY • link 12.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie! Works a charm. Pete On 27/10/2012, at 9:14 AM, Valerie Obenchain wrote: > Hi Pete, > > Thanks for sending the data. I confirmed your numbers for writing out the sample file. > > > load('~/Downloads/slow_writeVcf_example_data.RData') > > vcf <- vcf.4aff.constrained > > vcf > class: VCF > dim: 2668 7 > genome: mm9 > ... > > system.time(writeVcf(vcf, "mm9.vcf")) > user system elapsed > 164.362 0.548 165.378 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > Sample data from the package with ~10K variants was also slow. > > > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > > vcf <- readVcf(fl, "hg19") > > vcf > class: VCF > dim: 10376 5 > genome: hg19 > ... > > system.time(writeVcf(vcf, "hg19.vcf")) > user system elapsed > 1113.089 2.052 1117.831 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > The warning came from a problem when creating the FORMAT lines for the geno fields when we both list and non-list matrices are present. The speed issues I identified were parsing bottlenecks when preparing the info and geno data for writing. > > With the modifications these two examples are showing a >95% speed increase. > > ## Note that the VCF class structure has changed in VariantAnnotation ## in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ## and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ## objects. See devel ?VCF for details. > > load("~/Downloads/slow_writeVcf_example_data.RData") > > vcf <- updateObject(vcf.4aff.constrained) > > vcf > class: CollapsedVCF > dim: 2668 7 > genome: mm9 > ... > > system.time(writeVcf(vcf, "mm9.vcf")) > user system elapsed > 2.276 0.060 2.339 > > Again the sample file from the package, > > > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > > vcf <- readVcf(fl, "hg19") > > vcf > class: CollapsedVCF > dim: 10376 5 > genome: hg19 > ... > > system.time(writeVcf(vcf, "hg19.vcf")) > user system elapsed > 4.717 0.012 4.738 > > > Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release. Thanks for reporting this. > > Valerie > > > > On 10/23/2012 09:08 PM, hickey@wehi.EDU.AU wrote: >> Valerie - I'll send you a subset of the VCF off list. >> >> Tim - This isn't something I'd considered. I'd always thought of VCF as >> just another delimited format but that post highlights the "intricacies" >> of VCF that I hadn't considered. In saying that, the files I'm dealing >> with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), >> especially compared to the 1KG VCFs which are multi-Gb in size and >> contain thousands of samples. >> >> Pete >> >> >> On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: >> >>> is part of this an issue with the VCF format itself? >>> >>> http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ >>> >>> or is that a separate concern? >>> >>> thanks, >>> >>> --t >>> >>> >>> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org>>> <mailto:vobencha@fhcrc.org>> wrote: >>> >>> Hi Peter, >>> >>> I'd say yes, the warning is a cause for concern. Is the vcf you're >>> working with publicly available? If not, can you send a smaller >>> subset that still produces the error? >>> >>> Valerie >>> >>> >>> >>> On 10/23/12 17:29, hickey@wehi.EDU.AU <mailto:hickey@wehi.edu.au> >>> wrote: >>> >>> I am using the writeVcf() function in the VariantAnnotation >>> package, which I understand to be 'under construction' from >>> the documentation. I am wondering why writeVcf() is taking so >>> long to do it's job and if there is anything I can do to speed >>> this up? For example, I have a VCF object containing some >>> 36,000 variants that I want to write to file and this takes >>> nearly 3 hours. It also produces a warning that I am unsure >>> how to interpret, i.e. should I be worried by it? The VCF >>> written to disk appears to be valid. >>> >>> Many thanks for your help. >>> >>> ## Code >>> >>> library(VariantAnnotation) >>> vcf<- readVcf(file = >>> 'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome >>> = 'mm9') >>> vcf >>> >>> class: VCF >>> dim: 292155 7 >>> genome: mm9 >>> exptData(1): header >>> fixed(4): REF ALT QUAL FILTER >>> info(21): AC AF ... SB STR >>> geno(5): AD DP GQ GT PL >>> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >>> chr19:61340708 >>> rowData values names(1): paramRangeID >>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >>> 506/Bcl2#409 >>> colData names(1): Samples >>> >>> ## Some subsetting of the 'vcf' object to create >>> vcf.2aff.constrained >>> >>> vcf.2aff.constrained >>> >>> class: VCF >>> dim: 35514 7 >>> genome: mm9 >>> exptData(1): header >>> fixed(4): REF ALT QUAL FILTER >>> info(21): AC AF ... SB STR >>> geno(5): AD DP GQ GT PL >>> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >>> chr19:61340453 >>> rowData values names(1): paramRangeID >>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >>> 506/Bcl2#409 >>> colData names(1): Samples >>> >>> system.time(writeVcf(obj = vcf.2aff.constrained, filename >>> = >>> 'UnifiedGenotyper.autosomal.__snps.indels.constrained._ _genotypes.2aff.vcf')) >>> >>> user system elapsed >>> 10653.768 0.244 10659.163 >>> Warning message: >>> In .Method(..., deparse.level = deparse.level) : >>> number of rows of result is not a multiple of vector length >>> (arg 1) >>> >>> sessionInfo() >>> >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 >>> Biostrings_2.26.2 >>> [4] GenomicRanges_1.10.2 IRanges_1.16.2 >>> BiocGenerics_0.4.0 >>> >>> loaded via a namespace (and not attached): >>> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >>> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >>> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >>> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >>> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >>> >>> ------------------------------__-- >>> Peter Hickey, >>> PhD Student/Research Assistant, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> Ph: +613 9345 2324 <tel:%2b613%209345%202324> >>> >>> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >>> http://www.wehi.edu.au <http: www.wehi.edu.au=""/> >>> >>> >>> ___________________________________________________________ _______________ >>> The information in this email is confidential and >>> intend...{{dropped:8}} >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >>> >>> -- >>> /A model is a lie that helps you see the truth./ >>> / >>> / >>> Howard Skipper >>> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >> -------------------------------- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >> http://www.wehi.edu.au >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the addressee. >> You must not disclose, forward, print or use it without the permission >> of the sender. >> ______________________________________________________________________ > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}

ADD REPLY • link 12.3 years ago Peter Hickey ▴ 750

Login before adding your answer.