Entering edit mode
Hi Valerie,
readVcf is reordering the fields in the FORMAT field of my VCF file
which is causing me problems downstream. I am using VariantAnnotation
to subset my VCF file and then write the subset to file using
writeVcf. However, the reordering of the FORMAT fields means by new
VCF does not comply with VCF spec, specifically the GT-field is not
the first sub-field of the FORMAT field. A reproducible example
follows.
Thanks,
Pete
> sessionInfo()
R version 2.15.2 Patched (2013-02-10 r61900)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3
GenomicRanges_1.10.7
[5] IRanges_1.16.6 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0
bitops_1.0-4.2
[5] BSgenome_1.26.1 DBI_0.2-5
GenomicFeatures_1.10.2 parallel_2.15.2
[9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2
stats4_2.15.2
[13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0
#### Bug report: readVcf() reordering FORMAT column of VCF file in
violation of VCF spec ####
library(VariantAnnotation)
## Download my example VCF
download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz
', destfile = '/tmp/readVcf_problem.vcf.gz')
## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL
system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to
alter this command depending on system; works on my linux machine but
need 'gzcat' on my mac
## Read in the VCF using readVcf()
problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9')
## Note the order of geno(), which stores each field in the FORMAT
column as a list element, is "AD DP GQ GT PL"
geno(problem_vcf)
## Write the file back to disk using writeVcf()
writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE)
## Cat the newly written version of the VCF: Note that the FORMAT
field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says
of the FORMAT field: "The first sub-field must always be the genotype
(GT) if it is present" (see
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-
variant-call-format-version-41)
system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter
this command depending on system; works on my linux machine but need
'gzcat' on my mac
## No such problem with the example VCF
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
good_vcf <- readVcf(fl, genome = 'hg19')
## Sub-fields of FORMAT field are correctly ordered
geno(good_vcf)
--------------------------------
Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Ph: +613 9345 2324
hickey@wehi.edu.au
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:8}}