Why is writeVcf() slow?
1
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 10 weeks ago
WEHI, Melbourne, Australia
I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. Many thanks for your help. ## Code > library(VariantAnnotation) > vcf <- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') > vcf class: VCF dim: 292155 7 genome: mm9 exptData(1): header fixed(4): REF ALT QUAL FILTER info(21): AC AF ... SB STR geno(5): AD DP GQ GT PL rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 chr19:61340708 rowData values names(1): paramRangeID colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 colData names(1): Samples ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > vcf.2aff.constrained class: VCF dim: 35514 7 genome: mm9 exptData(1): header fixed(4): REF ALT QUAL FILTER info(21): AC AF ... SB STR geno(5): AD DP GQ GT PL rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 chr19:61340453 rowData values names(1): paramRangeID colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 colData names(1): Samples > system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unified Genotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) user system elapsed 10653.768 0.244 10659.163 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
VariantAnnotation VariantAnnotation VariantAnnotation VariantAnnotation • 1.5k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.7 years ago
United States
Hi Peter, I'd say yes, the warning is a cause for concern. Is the vcf you're working with publicly available? If not, can you send a smaller subset that still produces the error? Valerie On 10/23/12 17:29, hickey at wehi.EDU.AU wrote: > I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. > > Many thanks for your help. > > ## Code >> library(VariantAnnotation) >> vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') >> vcf > class: VCF > dim: 292155 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > chr19:61340708 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > >> vcf.2aff.constrained > class: VCF > dim: 35514 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > chr19:61340453 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > >> system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unifie dGenotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) > user system elapsed > 10653.768 0.244 10659.163 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
is part of this an issue with the VCF format itself? http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ or is that a separate concern? thanks, --t On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi Peter, > > I'd say yes, the warning is a cause for concern. Is the vcf you're working > with publicly available? If not, can you send a smaller subset that still > produces the error? > > Valerie > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > >> I am using the writeVcf() function in the VariantAnnotation package, >> which I understand to be 'under construction' from the documentation. I am >> wondering why writeVcf() is taking so long to do it's job and if there is >> anything I can do to speed this up? For example, I have a VCF object >> containing some 36,000 variants that I want to write to file and this takes >> nearly 3 hours. It also produces a warning that I am unsure how to >> interpret, i.e. should I be worried by it? The VCF written to disk appears >> to be valid. >> >> Many thanks for your help. >> >> ## Code >> >>> library(VariantAnnotation) >>> vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.**snps.indels.vcf.gz', >>> genome = 'mm9') >>> vcf >>> >> class: VCF >> dim: 292155 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >> chr19:61340708 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 >> colData names(1): Samples >> >> ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained >> >> vcf.2aff.constrained >>> >> class: VCF >> dim: 35514 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >> chr19:61340453 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 >> colData names(1): Samples >> >> system.time(writeVcf(obj = vcf.2aff.constrained, filename = >>> 'UnifiedGenotyper.autosomal.**snps.indels.constrained.** >>> genotypes.2aff.vcf')) >>> >> user system elapsed >> 10653.768 0.244 10659.163 >> Warning message: >> In .Method(..., deparse.level = deparse.level) : >> number of rows of result is not a multiple of vector length (arg 1) >> >> sessionInfo() >>> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 >> [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >> >> ------------------------------**-- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au >> http://www.wehi.edu.au >> >> >> ______________________________**______________________________** >> __________ >> The information in this email is confidential and intend...{{dropped:8}} >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Valerie - I'll send you a subset of the VCF off list. Tim - This isn't something I'd considered. I'd always thought of VCF as just another delimited format but that post highlights the "intricacies" of VCF that I hadn't considered. In saying that, the files I'm dealing with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), especially compared to the 1KG VCFs which are multi-Gb in size and contain thousands of samples. Pete On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > is part of this an issue with the VCF format itself? > > http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ > > or is that a separate concern? > > thanks, > > --t > > > On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > Hi Peter, > > I'd say yes, the warning is a cause for concern. Is the vcf you're working with publicly available? If not, can you send a smaller subset that still produces the error? > > Valerie > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid. > > Many thanks for your help. > > ## Code > library(VariantAnnotation) > vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9') > vcf > class: VCF > dim: 292155 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > chr19:61340708 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > > vcf.2aff.constrained > class: VCF > dim: 35514 7 > genome: mm9 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(21): AC AF ... SB STR > geno(5): AD DP GQ GT PL > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > chr19:61340453 > rowData values names(1): paramRangeID > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > colData names(1): Samples > > system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unified Genotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf')) > user system elapsed > 10653.768 0.244 10659.163 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Just a note: this sounds like a pathological case. writeVcf is pretty efficient these days. I regularly output millions of variants in some small number of minutes. Michael On Tue, Oct 23, 2012 at 9:08 PM, <hickey@wehi.edu.au> wrote: > Valerie - I'll send you a subset of the VCF off list. > > Tim - This isn't something I'd considered. I'd always thought of VCF as > just another delimited format but that post highlights the "intricacies" of > VCF that I hadn't considered. In saying that, the files I'm dealing with > here are pretty small (300K variants on 7 samples, 21Mb bgzipped), > especially compared to the 1KG VCFs which are multi-Gb in size and contain > thousands of samples. > > Pete > > > On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > > > is part of this an issue with the VCF format itself? > > > > > http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ > > > > or is that a separate concern? > > > > thanks, > > > > --t > > > > > > On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org> > wrote: > > Hi Peter, > > > > I'd say yes, the warning is a cause for concern. Is the vcf you're > working with publicly available? If not, can you send a smaller subset > that still produces the error? > > > > Valerie > > > > > > > > On 10/23/12 17:29, hickey@wehi.EDU.AU wrote: > > I am using the writeVcf() function in the VariantAnnotation package, > which I understand to be 'under construction' from the documentation. I am > wondering why writeVcf() is taking so long to do it's job and if there is > anything I can do to speed this up? For example, I have a VCF object > containing some 36,000 variants that I want to write to file and this takes > nearly 3 hours. It also produces a warning that I am unsure how to > interpret, i.e. should I be worried by it? The VCF written to disk appears > to be valid. > > > > Many thanks for your help. > > > > ## Code > > library(VariantAnnotation) > > vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', > genome = 'mm9') > > vcf > > class: VCF > > dim: 292155 7 > > genome: mm9 > > exptData(1): header > > fixed(4): REF ALT QUAL FILTER > > info(21): AC AF ... SB STR > > geno(5): AD DP GQ GT PL > > rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 > > chr19:61340708 > > rowData values names(1): paramRangeID > > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > > colData names(1): Samples > > > > ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained > > > > vcf.2aff.constrained > > class: VCF > > dim: 35514 7 > > genome: mm9 > > exptData(1): header > > fixed(4): REF ALT QUAL FILTER > > info(21): AC AF ... SB STR > > geno(5): AD DP GQ GT PL > > rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 > > chr19:61340453 > > rowData values names(1): paramRangeID > > colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409 > > colData names(1): Samples > > > > system.time(writeVcf(obj = vcf.2aff.constrained, filename = > 'UnifiedGenotyper.autosomal.snps.indels.constrained.genotypes.2aff.v cf')) > > user system elapsed > > 10653.768 0.244 10659.163 > > Warning message: > > In .Method(..., deparse.level = deparse.level) : > > number of rows of result is not a multiple of vector length (arg 1) > > > > sessionInfo() > > R version 2.15.1 (2012-06-22) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2 > > [4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0 > > > > loaded via a namespace (and not attached): > > [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 > > [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 > > [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 > > [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 > > [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 > > > > -------------------------------- > > Peter Hickey, > > PhD Student/Research Assistant, > > Bioinformatics Division, > > Walter and Eliza Hall Institute of Medical Research, > > 1G Royal Parade, Parkville, Vic 3052, Australia. > > Ph: +613 9345 2324 > > > > hickey@wehi.edu.au > > http://www.wehi.edu.au > > > > > > ______________________________________________________________________ > > The information in this email is confidential and intend...{{dropped:8}} > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > > A model is a lie that helps you see the truth. > > > > Howard Skipper > > > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Pete, Thanks for sending the data. I confirmed your numbers for writing out the sample file. > load('~/Downloads/slow_writeVcf_example_data.RData') > vcf <- vcf.4aff.constrained > vcf class: VCF dim: 2668 7 genome: mm9 ... > system.time(writeVcf(vcf, "mm9.vcf")) user system elapsed 164.362 0.548 165.378 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) Sample data from the package with ~10K variants was also slow. > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > vcf class: VCF dim: 10376 5 genome: hg19 ... > system.time(writeVcf(vcf, "hg19.vcf")) user system elapsed 1113.089 2.052 1117.831 Warning message: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 1) The warning came from a problem when creating the FORMAT lines for the geno fields when we both list and non-list matrices are present. The speed issues I identified were parsing bottlenecks when preparing the info and geno data for writing. With the modifications these two examples are showing a >95% speed increase. ## Note that the VCF class structure has changed in VariantAnnotation ## in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ## and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ## objects. See devel ?VCF for details. > load("~/Downloads/slow_writeVcf_example_data.RData") > vcf <- updateObject(vcf.4aff.constrained) > vcf class: CollapsedVCF dim: 2668 7 genome: mm9 ... > system.time(writeVcf(vcf, "mm9.vcf")) user system elapsed 2.276 0.060 2.339 Again the sample file from the package, > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > vcf class: CollapsedVCF dim: 10376 5 genome: hg19 ... > system.time(writeVcf(vcf, "hg19.vcf")) user system elapsed 4.717 0.012 4.738 Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release. Thanks for reporting this. Valerie On 10/23/2012 09:08 PM, hickey at wehi.EDU.AU wrote: > Valerie - I'll send you a subset of the VCF off list. > > Tim - This isn't something I'd considered. I'd always thought of VCF as > just another delimited format but that post highlights the "intricacies" > of VCF that I hadn't considered. In saying that, the files I'm dealing > with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), > especially compared to the 1KG VCFs which are multi-Gb in size and > contain thousands of samples. > > Pete > > > On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: > >> is part of this an issue with the VCF format itself? >> >> http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ >> >> or is that a separate concern? >> >> thanks, >> >> --t >> >> >> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha at="" fhcrc.org="">> <mailto:vobencha at="" fhcrc.org="">> wrote: >> >> Hi Peter, >> >> I'd say yes, the warning is a cause for concern. Is the vcf you're >> working with publicly available? If not, can you send a smaller >> subset that still produces the error? >> >> Valerie >> >> >> >> On 10/23/12 17:29, hickey at wehi.EDU.AU <mailto:hickey at="" wehi.edu.au=""> >> wrote: >> >> I am using the writeVcf() function in the VariantAnnotation >> package, which I understand to be 'under construction' from >> the documentation. I am wondering why writeVcf() is taking so >> long to do it's job and if there is anything I can do to speed >> this up? For example, I have a VCF object containing some >> 36,000 variants that I want to write to file and this takes >> nearly 3 hours. It also produces a warning that I am unsure >> how to interpret, i.e. should I be worried by it? The VCF >> written to disk appears to be valid. >> >> Many thanks for your help. >> >> ## Code >> >> library(VariantAnnotation) >> vcf<- readVcf(file = >> 'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome >> = 'mm9') >> vcf >> >> class: VCF >> dim: 292155 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >> chr19:61340708 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >> 506/Bcl2#409 >> colData names(1): Samples >> >> ## Some subsetting of the 'vcf' object to create >> vcf.2aff.constrained >> >> vcf.2aff.constrained >> >> class: VCF >> dim: 35514 7 >> genome: mm9 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(21): AC AF ... SB STR >> geno(5): AD DP GQ GT PL >> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >> chr19:61340453 >> rowData values names(1): paramRangeID >> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >> 506/Bcl2#409 >> colData names(1): Samples >> >> system.time(writeVcf(obj = vcf.2aff.constrained, filename >> = >> 'UnifiedGenotyper.autosomal.__snps.indels.constrained._ _genotypes.2aff.vcf')) >> >> user system elapsed >> 10653.768 0.244 10659.163 >> Warning message: >> In .Method(..., deparse.level = deparse.level) : >> number of rows of result is not a multiple of vector length >> (arg 1) >> >> sessionInfo() >> >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 >> Biostrings_2.26.2 >> [4] GenomicRanges_1.10.2 IRanges_1.16.2 >> BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >> >> ------------------------------__-- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 <tel:%2b613%209345%202324> >> >> hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> >> http://www.wehi.edu.au <http: www.wehi.edu.au=""/> >> >> >> ___________________________________________________________ _______________ >> The information in this email is confidential and >> intend...{{dropped:8}} >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> >> >> -- >> /A model is a lie that helps you see the truth./ >> / >> / >> Howard Skipper >> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}
ADD REPLY
0
Entering edit mode
Thanks Valerie! Works a charm. Pete On 27/10/2012, at 9:14 AM, Valerie Obenchain wrote: > Hi Pete, > > Thanks for sending the data. I confirmed your numbers for writing out the sample file. > > > load('~/Downloads/slow_writeVcf_example_data.RData') > > vcf <- vcf.4aff.constrained > > vcf > class: VCF > dim: 2668 7 > genome: mm9 > ... > > system.time(writeVcf(vcf, "mm9.vcf")) > user system elapsed > 164.362 0.548 165.378 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > Sample data from the package with ~10K variants was also slow. > > > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > > vcf <- readVcf(fl, "hg19") > > vcf > class: VCF > dim: 10376 5 > genome: hg19 > ... > > system.time(writeVcf(vcf, "hg19.vcf")) > user system elapsed > 1113.089 2.052 1117.831 > Warning message: > In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 1) > > The warning came from a problem when creating the FORMAT lines for the geno fields when we both list and non-list matrices are present. The speed issues I identified were parsing bottlenecks when preparing the info and geno data for writing. > > With the modifications these two examples are showing a >95% speed increase. > > ## Note that the VCF class structure has changed in VariantAnnotation ## in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ## and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ## objects. See devel ?VCF for details. > > load("~/Downloads/slow_writeVcf_example_data.RData") > > vcf <- updateObject(vcf.4aff.constrained) > > vcf > class: CollapsedVCF > dim: 2668 7 > genome: mm9 > ... > > system.time(writeVcf(vcf, "mm9.vcf")) > user system elapsed > 2.276 0.060 2.339 > > Again the sample file from the package, > > > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") > > vcf <- readVcf(fl, "hg19") > > vcf > class: CollapsedVCF > dim: 10376 5 > genome: hg19 > ... > > system.time(writeVcf(vcf, "hg19.vcf")) > user system elapsed > 4.717 0.012 4.738 > > > Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release. Thanks for reporting this. > > Valerie > > > > On 10/23/2012 09:08 PM, hickey@wehi.EDU.AU wrote: >> Valerie - I'll send you a subset of the VCF off list. >> >> Tim - This isn't something I'd considered. I'd always thought of VCF as >> just another delimited format but that post highlights the "intricacies" >> of VCF that I hadn't considered. In saying that, the files I'm dealing >> with here are pretty small (300K variants on 7 samples, 21Mb bgzipped), >> especially compared to the 1KG VCFs which are multi-Gb in size and >> contain thousands of samples. >> >> Pete >> >> >> On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote: >> >>> is part of this an issue with the VCF format itself? >>> >>> http://campagnelab.org/evaluating-goby-against-the-1000-genome- genotype-calls-and-why-is-vcf-so-inefficient/ >>> >>> or is that a separate concern? >>> >>> thanks, >>> >>> --t >>> >>> >>> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha@fhcrc.org>>> <mailto:vobencha@fhcrc.org>> wrote: >>> >>> Hi Peter, >>> >>> I'd say yes, the warning is a cause for concern. Is the vcf you're >>> working with publicly available? If not, can you send a smaller >>> subset that still produces the error? >>> >>> Valerie >>> >>> >>> >>> On 10/23/12 17:29, hickey@wehi.EDU.AU <mailto:hickey@wehi.edu.au> >>> wrote: >>> >>> I am using the writeVcf() function in the VariantAnnotation >>> package, which I understand to be 'under construction' from >>> the documentation. I am wondering why writeVcf() is taking so >>> long to do it's job and if there is anything I can do to speed >>> this up? For example, I have a VCF object containing some >>> 36,000 variants that I want to write to file and this takes >>> nearly 3 hours. It also produces a warning that I am unsure >>> how to interpret, i.e. should I be worried by it? The VCF >>> written to disk appears to be valid. >>> >>> Many thanks for your help. >>> >>> ## Code >>> >>> library(VariantAnnotation) >>> vcf<- readVcf(file = >>> 'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome >>> = 'mm9') >>> vcf >>> >>> class: VCF >>> dim: 292155 7 >>> genome: mm9 >>> exptData(1): header >>> fixed(4): REF ALT QUAL FILTER >>> info(21): AC AF ... SB STR >>> geno(5): AD DP GQ GT PL >>> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696 >>> chr19:61340708 >>> rowData values names(1): paramRangeID >>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >>> 506/Bcl2#409 >>> colData names(1): Samples >>> >>> ## Some subsetting of the 'vcf' object to create >>> vcf.2aff.constrained >>> >>> vcf.2aff.constrained >>> >>> class: VCF >>> dim: 35514 7 >>> genome: mm9 >>> exptData(1): header >>> fixed(4): REF ALT QUAL FILTER >>> info(21): AC AF ... SB STR >>> geno(5): AD DP GQ GT PL >>> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449 >>> chr19:61340453 >>> rowData values names(1): paramRangeID >>> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 >>> 506/Bcl2#409 >>> colData names(1): Samples >>> >>> system.time(writeVcf(obj = vcf.2aff.constrained, filename >>> = >>> 'UnifiedGenotyper.autosomal.__snps.indels.constrained._ _genotypes.2aff.vcf')) >>> >>> user system elapsed >>> 10653.768 0.244 10659.163 >>> Warning message: >>> In .Method(..., deparse.level = deparse.level) : >>> number of rows of result is not a multiple of vector length >>> (arg 1) >>> >>> sessionInfo() >>> >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 >>> Biostrings_2.26.2 >>> [4] GenomicRanges_1.10.2 IRanges_1.16.2 >>> BiocGenerics_0.4.0 >>> >>> loaded via a namespace (and not attached): >>> [1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0 >>> [4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5 >>> [7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1 >>> [10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1 >>> [13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0 >>> >>> ------------------------------__-- >>> Peter Hickey, >>> PhD Student/Research Assistant, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> Ph: +613 9345 2324 <tel:%2b613%209345%202324> >>> >>> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >>> http://www.wehi.edu.au <http: www.wehi.edu.au=""/> >>> >>> >>> ___________________________________________________________ _______________ >>> The information in this email is confidential and >>> intend...{{dropped:8}} >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >>> >>> -- >>> /A model is a lie that helps you see the truth./ >>> / >>> / >>> Howard Skipper >>> <http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >> >> -------------------------------- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >> http://www.wehi.edu.au >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the addressee. >> You must not disclose, forward, print or use it without the permission >> of the sender. >> ______________________________________________________________________ > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
ADD REPLY

Login before adding your answer.

Traffic: 382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6