Entering edit mode
I am using the writeVcf() function in the VariantAnnotation package,
which I understand to be 'under construction' from the documentation.
I am wondering why writeVcf() is taking so long to do it's job and if
there is anything I can do to speed this up? For example, I have a VCF
object containing some 36,000 variants that I want to write to file
and this takes nearly 3 hours. It also produces a warning that I am
unsure how to interpret, i.e. should I be worried by it? The VCF
written to disk appears to be valid.
Many thanks for your help.
## Code
> library(VariantAnnotation)
> vcf <- readVcf(file =
'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9')
> vcf
class: VCF
dim: 292155 7
genome: mm9
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(21): AC AF ... SB STR
geno(5): AD DP GQ GT PL
rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696
chr19:61340708
rowData values names(1): paramRangeID
colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409
colData names(1): Samples
## Some subsetting of the 'vcf' object to create vcf.2aff.constrained
> vcf.2aff.constrained
class: VCF
dim: 35514 7
genome: mm9
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(21): AC AF ... SB STR
geno(5): AD DP GQ GT PL
rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449
chr19:61340453
rowData values names(1): paramRangeID
colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409
colData names(1): Samples
> system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'Unified
Genotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf'))
user system elapsed
10653.768 0.244 10659.163
Warning message:
In .Method(..., deparse.level = deparse.level) :
number of rows of result is not a multiple of vector length (arg 1)
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.4.1 Rsamtools_1.10.1 Biostrings_2.26.2
[4] GenomicRanges_1.10.2 IRanges_1.16.2 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.20.1 Biobase_2.18.0 biomaRt_2.14.0
[4] bitops_1.0-4.1 BSgenome_1.26.1 DBI_0.2-5
[7] GenomicFeatures_1.10.0 parallel_2.15.1 RCurl_1.95-1.1
[10] RSQLite_0.11.2 rtracklayer_1.18.0 stats4_2.15.1
[13] tools_2.15.1 XML_3.95-0.1 zlibbioc_1.4.0
--------------------------------
Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Ph: +613 9345 2324
hickey@wehi.edu.au
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:8}}