Entering edit mode
I am having a problem reading the info field on vcf files generated by
the NextGene software. Here is an example:
vcfF is the vcf report as exported from NextGene
>hdr=scanVcfHeader(vcfF)
> rownames(info(hdr))
[1] "NS" "DP" "AF" "SGACOV"
[5] "SGASCORE" "SGGENEDIR" "SGANNOT" "SGREFAA"
[9] "SGAAC" "SGMRNAID" "SGCDSID" "SGTI"
[13] "SGPI" "SGGI" "SGSEGDES" "SGSEGPOS"
[17] "SGAMBGAINPEN" "SGAMBLOSSPEN" "SG_Cosmic(Cosmic_68)_68_COSMIC_ID" "SG_Cosmic(Cosmic_68)_68_COSMIC_Count"
[21] "SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score" "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score" "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred" "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_score"
[25] "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_pred"
"SG_dbNSFP(dbNSFP_2.0)_2_LRT_score" "SG_dbNSFP(dbNSFP_2.0)_2_LRT_pred" "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_score"
[29] "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_pred" "SG_dbNSFP(dbNSFP_2.0)_2_GERP_NR" "SG_dbNSFP(dbNSFP_2.0)_2_GERP_RS" "SG_dbNSFP(dbNSFP_2.0)_2_PhyloP"
[33] "SG_dbNSFP(dbNSFP_2.0)_2_29way_logOdds" "SG_dbNSFP(dbNSFP_2.0)_2_LRT_Omega" "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AF" "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AFR_AF"
[37] "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_EUR_AF" "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AMR_AF" "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_ASN_AF" "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_AA_AF"
[41] "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_EA_AF
Note that fields [19:41] are in small caps, which match the header of the vcfF. However, the field descriptors in the info section of the vcfF are all in upper case, e.g.:
NS=1;DP=1196;AF=0.048;SGACOV=58;SGASCORE=0.000;SGGENEDIR=-;SGANNOT=CDS;SGREFAA=G;SGAAC=E;SGMRNAID=8;SGCDSID=7;SGTI=NM_005235.2;SGPI=NP_005226.1;SGGI=ERBB4;SGSEGDES=NT_005403__1..84213159__Genome_Sequence__Chr2;SGSEGPOS=62796652;SGAMBGAINPEN=0.000;SGAMBLOSSPEN=0.000;SG_DBNSFP(DBNSFP_2.0)_2_SIFT_SCORE=0.0300;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_SCORE=1.0000;0.9810;0.9990;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_SCORE=1.0000;0.7430;0.8600;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_PRED=D;P;.;SG_DBNSFP(DBNSFP_2.0)_2_LRT_SCORE=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_LRT_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_SCORE=NA;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_PRED=.;SG_DBNSFP(DBNSFP_2.0)_2_GERP_NR=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_GERP_RS=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_PHYLOP=2.7610;SG_DBNSFP(DBNSFP_2.0)_2_29WAY_LOGODDS=20.1150;SG_DBNSFP(DBNSFP_2.0)_2_LRT_OMEGA=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AFR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_EUR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AMR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_ASN_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_AA_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_EA_AF=NA
Not surprisingly, when the fields don't match exactly, we get all NA:
> info(readVcf(vcfF,'hg19'))[55,c(1:4,20:23)]
DataFrame with 1 row and 8 columns
NS DP AF SGACOV SG_Cosmic(Cosmic_68)_68_COSMIC_Count SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score
<integer> <integer> <NumericList> <IntegerList> <integer> <numeric> <numeric>
2:212587234_C/T 1 1196 0.048 58 NA NA NA
SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred
<character>
2:212587234_C/T NA
Any ideas on how to overcome this problem with the vcf file?
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.12.4 Rsamtools_1.18.2 Biostrings_2.34.0 XVector_0.6.0 GenomicRanges_1.18.3 GenomeInfoDb_1.2.3 IRanges_2.0.0 S4Vectors_0.4.0 [9] Biobase_2.26.0 BiocGenerics_0.12.1 ROC_1.42.0 sva_3.12.0 genefilter_1.48.1 mgcv_1.8-3 nlme_3.1-118 BiocInstaller_1.16.1 [17] data.table_1.9.4 loaded via a namespace (and not attached): [1] annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0 biomaRt_2.22.0 bitops_1.0-6 [9] brew_1.0-6 BSgenome_1.34.0 checkmate_1.5.0 chron_2.3-45 codetools_0.2-9 DBI_0.3.1 digest_0.6.4 fail_1.2 [17] foreach_1.4.2 GenomicAlignments_1.2.1 GenomicFeatures_1.18.2 grid_3.1.2 iterators_1.0.7 lattice_0.20-29 Matrix_1.1-4 plyr_1.8.1 [25] Rcpp_0.11.3 RCurl_1.95-4.3 reshape2_1.4 RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2 [33] survival_2.37-7 tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 zlibbioc_1.12.0