readVCFAsVRanges in VariantAnnotation broken by an FTZ format?
2
0
Entering edit mode
Andy Lynch ▴ 120
@andy-lynch-6934
Last seen 10 months ago
United Kingdom

I've noticed that the inclusion of a "##FORMAT=<ID=FTZ,Nu..." line in a vcf file seems to break the readVCFAsVRanges() function (although not the readVCF() function). CaVEMan, for example, includes such a line in its vcf files. Replacing all instances of FTZ with, e.g., FUZ in the vcf file avoids the problem, but is not a desirable solution.
 

This can be illustrated with two toy vcf files, the only difference of which is the FTZ/FUZ switch. [Variant location and nature have been changed to ensure that no patient data are presented.]

FTZtest.vcf:

##fileformat=VCFv4.1
##fileDate=20160629
##reference=/datastore/reference_files/genome.fa
##vcfProcessLog=<InputVCF=<.>,InputVCFSource=<CaVEMan>,InputVCFParam=<NORMAL_CONTAMINATION=0.1,PRIOR_MUT_RATE=6e-06,REF_BIAS=0.95,SNP_CUTOFF=0.95,MUT_CUTOFF=0.8,PRIOR_SNP_RATE=0.0001>>
##cavemanVersion=1.9.2
##contig=<ID=1,assembly=GRCh37,length=249250621,species=HUMAN>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=FTZ,Number=1,Type=Integer,Description="Reads presenting a T for this position, forward strand">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NORMAL    TUMOUR
1    16074    ID    C    A    .    PASS    DP=61    GT:FTZ    0|0:20    0|1:15

FUZtest.vcf:

##fileformat=VCFv4.1
##fileDate=20160629
##reference=/datastore/reference_files/genome.fa
##vcfProcessLog=<InputVCF=<.>,InputVCFSource=<CaVEMan>,InputVCFParam=<NORMAL_CONTAMINATION=0.1,PRIOR_MUT_RATE=6e-06,REF_BIAS=0.95,SNP_CUTOFF=0.95,MUT_CUTOFF=0.8,PRIOR_SNP_RATE=0.0001>>
##cavemanVersion=1.9.2
##contig=<ID=1,assembly=GRCh37,length=249250621,species=HUMAN>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=FUZ,Number=1,Type=Integer,Description="Reads presenting a T for this position, forward strand">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NORMAL    TUMOUR
1    16074    ID    C    A    .    PASS    DP=61    GT:FUZ    0|0:20    0|1:15

 

Running the command readVcfAsVRanges("FUZtest.vcf","hg19") works as expected, but readVcfAsVRanges("FTZtest.vcf","hg19") returns an error.

> vcf1<-readVcfAsVRanges("FTZtest.vcf","hg19")
Error in strsplit(x, ";", fixed = TRUE) : non-character argument
> traceback()
10: strsplit(x, ";", fixed = TRUE)
9: parseFilterStrings(as.vector(geno(from)$FT))
8: eval(expr, envir, enclos)
7: eval(quote(list(...)), env)
6: eval(quote(list(...)), env)
5: standardGeneric("cbind")
4: cbind(filter, parseFilterStrings(as.vector(geno(from)$FT)))
3: asMethod(object)
2: as(readVcf(x, genome, param = param, row.names = use.names, ...), 
       "VRanges")
1: readVcfAsVRanges("FTZtest.vcf", 
       "hg19")

 

Anyway. I can't work out why this should be a problem, so thought I would share.  Apologies if I have missed previous references to this.

Andy Lynch (CRUK Cambridge Institute)

 

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.18.5   BiocInstaller_1.22.3       QDNAseq_1.8.0              mclust_5.2                 knitr_1.13                
 [6] GenomicAlignments_1.8.4    Rsamtools_1.24.0           Biostrings_2.40.2          XVector_0.12.0             SummarizedExperiment_1.2.3
[11] Biobase_2.32.0             GenomicRanges_1.24.2       GenomeInfoDb_1.8.1         IRanges_2.6.1              S4Vectors_0.10.2          
[16] BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] formatR_1.4            highr_0.6              R.methodsS3_1.7.1      GenomicFeatures_1.24.3 bitops_1.0-6           R.utils_2.3.0         
 [7] tools_3.3.1            zlibbioc_1.18.0        biomaRt_2.28.0         BSgenome_1.40.1        evaluate_0.9           RSQLite_1.0.0         
[13] DBI_0.4-1              rtracklayer_1.32.1     stringr_1.0.0          CGHbase_1.32.0         impute_1.46.0          marray_1.50.0         
[19] DNAcopy_1.46.0         AnnotationDbi_1.34.4   XML_3.98-1.4           CGHcall_2.34.0         BiocParallel_1.6.2     limma_3.28.14         
[25] magrittr_1.5           matrixStats_0.50.2     BiocStyle_2.0.2        stringi_1.1.1          RCurl_1.95-4.8         R.oo_1.20.0       

 

variantannotation readvcfasvranges vcf • 1.5k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 1 day ago
EMBL Heidelberg

Hi Andy,

This looks like a partial matching problem to me.  The code explicitly looks for the genotype filter FT field, which isn't present in your data, so it then matches to your FTZ field and falls over when this isn't in the format it's expecting.

It should be possible to prevent this by using the [['FT']] method of accessing list entries rather than $FT

If you'd like to temporarily patch this yourself I think editing lines 252 & 253 of methods-VRanges-class.R should be sufficient.

ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

Should be fixed in 1.18.6.

ADD COMMENT

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6