readVCF (VariantAnnotation) problem with mismatched header and info fields in vcf file
1
0
Entering edit mode
jls2282 • 0
@jls2282-7127
Last seen 7.9 years ago
United States

 

I am having a problem reading the info field on vcf files generated by

the NextGene software. Here is an example:

vcfF is the vcf report as exported from NextGene

>hdr=scanVcfHeader(vcfF)
> rownames(info(hdr))
 [1] "NS"                                            "DP"                                            "AF"                                            "SGACOV"                                       
 [5] "SGASCORE"                                      "SGGENEDIR"                                     "SGANNOT"                                       "SGREFAA"                                      
 [9] "SGAAC"                                         "SGMRNAID"                                      "SGCDSID"                                       "SGTI"                                         
[13] "SGPI"                                          "SGGI"                                          "SGSEGDES"                                      "SGSEGPOS"                                     
[17] "SGAMBGAINPEN"                                  "SGAMBLOSSPEN"                                  "SG_Cosmic(Cosmic_68)_68_COSMIC_ID"             "SG_Cosmic(Cosmic_68)_68_COSMIC_Count"         
[21] "SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score"            "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score"  "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred"   "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_score" 
[25] "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_pred"   "SG_dbNSFP(dbNSFP_2.0)_2_LRT_score"             "SG_dbNSFP(dbNSFP_2.0)_2_LRT_pred"              "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_score"
[29] "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_pred"  "SG_dbNSFP(dbNSFP_2.0)_2_GERP_NR"               "SG_dbNSFP(dbNSFP_2.0)_2_GERP_RS"               "SG_dbNSFP(dbNSFP_2.0)_2_PhyloP"               
[33] "SG_dbNSFP(dbNSFP_2.0)_2_29way_logOdds"         "SG_dbNSFP(dbNSFP_2.0)_2_LRT_Omega"             "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AF"            "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AFR_AF"       
[37] "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_EUR_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AMR_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_ASN_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_AA_AF"        
[41] "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_EA_AF

Note that fields [19:41] are in small caps, which match the header of the vcfF. However, the field descriptors in the info section of the vcfF are all in upper case, e.g.:

NS=1;DP=1196;AF=0.048;SGACOV=58;SGASCORE=0.000;SGGENEDIR=-;SGANNOT=CDS;SGREFAA=G;SGAAC=E;SGMRNAID=8;SGCDSID=7;SGTI=NM_005235.2;SGPI=NP_005226.1;SGGI=ERBB4;SGSEGDES=NT_005403__1..84213159__Genome_Sequence__Chr2;SGSEGPOS=62796652;SGAMBGAINPEN=0.000;SGAMBLOSSPEN=0.000;SG_DBNSFP(DBNSFP_2.0)_2_SIFT_SCORE=0.0300;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_SCORE=1.0000;0.9810;0.9990;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_SCORE=1.0000;0.7430;0.8600;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_PRED=D;P;.;SG_DBNSFP(DBNSFP_2.0)_2_LRT_SCORE=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_LRT_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_SCORE=NA;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_PRED=.;SG_DBNSFP(DBNSFP_2.0)_2_GERP_NR=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_GERP_RS=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_PHYLOP=2.7610;SG_DBNSFP(DBNSFP_2.0)_2_29WAY_LOGODDS=20.1150;SG_DBNSFP(DBNSFP_2.0)_2_LRT_OMEGA=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AFR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_EUR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AMR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_ASN_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_AA_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_EA_AF=NA

Not surprisingly, when the fields don't match exactly, we get all NA:

> info(readVcf(vcfF,'hg19'))[55,c(1:4,20:23)]
DataFrame with 1 row and 8 columns
                       NS        DP            AF        SGACOV SG_Cosmic(Cosmic_68)_68_COSMIC_Count SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score
                <integer> <integer> <NumericList> <IntegerList>                            <integer>                          <numeric>                                    <numeric>
2:212587234_C/T         1      1196         0.048            58                                   NA                                 NA                                           NA
                SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred
                                                <character>
2:212587234_C/T                                          NA​


Any ideas on how to overcome this problem with the vcf file?
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.12.4 Rsamtools_1.18.2         Biostrings_2.34.0        XVector_0.6.0            GenomicRanges_1.18.3     GenomeInfoDb_1.2.3       IRanges_2.0.0            S4Vectors_0.4.0         
 [9] Biobase_2.26.0           BiocGenerics_0.12.1      ROC_1.42.0               sva_3.12.0               genefilter_1.48.1        mgcv_1.8-3               nlme_3.1-118             BiocInstaller_1.16.1    
[17] data.table_1.9.4        

loaded via a namespace (and not attached):
 [1] annotate_1.44.0         AnnotationDbi_1.28.1    base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.0      biomaRt_2.22.0          bitops_1.0-6           
 [9] brew_1.0-6              BSgenome_1.34.0         checkmate_1.5.0         chron_2.3-45            codetools_0.2-9         DBI_0.3.1               digest_0.6.4            fail_1.2               
[17] foreach_1.4.2           GenomicAlignments_1.2.1 GenomicFeatures_1.18.2  grid_3.1.2              iterators_1.0.7         lattice_0.20-29         Matrix_1.1-4            plyr_1.8.1             
[25] Rcpp_0.11.3             RCurl_1.95-4.3          reshape2_1.4            RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1         splines_3.1.2           stringr_0.6.2          
[33] survival_2.37-7         tools_3.1.2             XML_3.98-1.1            xtable_1.7-4            zlibbioc_1.12.0   
annotation vcf readvcf variantannotation • 1.6k views
ADD COMMENT
0
Entering edit mode
jls2282 • 0
@jls2282-7127
Last seen 7.9 years ago
United States

OK, maybe not the most elegant solution, but here is code that fixes the vcf header on a vcfF file:

 

#Find 1st header block
library(data.table)
row=1
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep="\n"),1,6)!="##INFO"){
  row=row+1
}
first row<-row
header1<-fread(vcfF,nrows=firstrow-1,header=F,sep='\n')
#Find INFO block
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep='\n'),1,6)=="##INFO"){
  row=row+1
}
mid row<-row
header2<-data.table(NULL)
for(row in firstrow:(midrow-1)){
  line<-fread(vcfF,nrows=1,skip=row-1,header=F,sep=',')
  line$V1<-toupper(line$V1)
  header2<-c(header2,paste(line,collapse=","))
}
# find last INFO block
row<-midrow
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep='\n'),1,2)=="##"){
  row=row+1
}
last row<-row
header3<-fread(vcfF,skip=midrow-1,nrows=lastrow-midrow,header=F,sep='\n')
# get tabbed body
body<-fread(vcfF,skip=lastrow-1,header=F,sep='\t')
# write to file
fw<-file(paste0(vcfF,"_fixed"),open = 'wt')
write.table(header1,fw,col.names=F,row.names=F,quote = F)
write.table(unlist(header2),fw,col.names=F,row.names=F,quote = F)
write.table(header3,fw,col.names=F,row.names=F,quote=F)
write.table(body,fw,sep = '\t',col.names = F,row.names=F,quote=F)
close(fw)
ADD COMMENT

Login before adding your answer.

Traffic: 565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6