Error reading vcf converted from bed file
0
0
Entering edit mode
mjsduncan • 0
@mjsduncan-7510
Last seen 9.7 years ago
United States

greetings!  i'm using the development version of bioconductor with R-devel source from 3/23, rstudio 0.98.1103 running on ubuntu 14.04.

after converting a bed-bim-fam file set to vcf with the latest github pull of plink-ng (plink v1.90), i am getting this error using "readVcf("file.vcf", geno = "B36")" : 

"Error: scanVcf: _DNAencode(): key 73 not in lookup table"

originally i converted file.vcf to file.bcf using "bcftools view -O b -o file.bcf file.vcf" but both ReadVcf() and scanBcfHeader() crashed the R session!  scanVcfHeader() works and using "bcftools view -h" gives:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150324
##source=PLINKv1.90
##contig=<ID=1,length=247189097>
##contig=<ID=2,length=242692821>
##contig=<ID=3,length=199318589>
##contig=<ID=4,length=191182473>
##contig=<ID=5,length=180642522>
##contig=<ID=6,length=170761396>
##contig=<ID=7,length=158812248>
##contig=<ID=8,length=146264219>
##contig=<ID=9,length=140147761>
##contig=<ID=10,length=135284294>
##contig=<ID=11,length=134445627>
##contig=<ID=12,length=132288870>
##contig=<ID=13,length=114109502>
##contig=<ID=14,length=106358709>
##contig=<ID=15,length=100217561>
##contig=<ID=16,length=88677424>
##contig=<ID=17,length=78653170>
##contig=<ID=18,length=76116153>
##contig=<ID=19,length=63785949>
##contig=<ID=20,length=62382908>
##contig=<ID=21,length=46919232>
##contig=<ID=22,length=49558259>
##contig=<ID=23,length=154582607>
##contig=<ID=24,length=27169977>
##contig=<ID=25,length=154908076>
##contig=<ID=26,length=16394>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.2-5-g7fa0d25+htslib-1.2.1-29-gf83dfd2
##bcftools_viewCommand=view -O b -o file.bcf file.vcf
##bcftools_viewCommand=view -h -o w_header file.bcf
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    BI0011_BI0011    BI0017_BI0017 ...

any info on what this error means and if/how the file can be repaired would be greatly appreciated.

> sessionInfo()
R Under development (unstable) (2015-03-23 r68069)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.13.44 Rsamtools_1.19.47         Biostrings_2.35.11        XVector_0.7.4             rtracklayer_1.27.9        GenomicRanges_1.19.47    
 [7] GenomeInfoDb_1.3.15       IRanges_2.1.43            S4Vectors_0.5.22          BiocGenerics_0.13.8      

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.29.20    zlibbioc_1.13.3          GenomicAlignments_1.3.32 BiocParallel_1.1.21      BSgenome_1.35.19         tools_3.3.0             
 [7] Biobase_2.27.2           DBI_0.3.1                lambda.r_1.1.7           futile.logger_1.4        futile.options_1.0.0     bitops_1.0-6            
[13] RCurl_1.95-4.5           biomaRt_2.23.5           RSQLite_1.0.0            GenomicFeatures_1.19.33  XML_3.98-1.1            

 

readvcf plink file conversion bed bioc-devel • 2.3k views
ADD COMMENT
0
Entering edit mode

Please provide us with a reproducible example, i.e., a test file. It sounds like the REF is containing values that do not correspond to DNA letters.

ADD REPLY
0
Entering edit mode

mr lawrence was correct!  using "bcftools query" i pulled out the REF and ALT columns, revealing:

(summary of columns as factors)

#  REF              ALT
#   A:203355   .: 10054
#   C:231249   A:227813
#   D:    69       C:201276
#   G:230594   D:   106
#   I:   110        G:201118
#   T:202890     I:    69
#                       T:227831

this data is almost ten years old, so maybe it's a legacy encoding of insertions & deletions?  removing lines with I or D in REF and ALT with "bfctools view" fixed the import error.

thanks for the tip!

ADD REPLY

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6