Question

Error using readVcf

0

Entering edit mode

Lna • 0

@lna-10651

Last seen 7.9 years ago

Dear all,

I'm trying to obtain annotations for SNPs I analysed in GWASTools. Some time ago it worked fine by using the line

vcfWrite(genoData,vcf.file=vcf.file,id.col="snpID",snp.exclude=snp.excl)

to write the vcf file from the data objects in GWASTools and reading the vcf file with

vcf <- readVcf(vcf.file, genome="hg19")

that is needed to use the VariantAnnotation package.

Now I did the same thing with another set of SNPs and I got the error message:

Fehler: scanVcf: _DNAencode(): invalid DNAString input character: '7' (byte value 55)
path: /home/forschung/data/omni5/GWASTools_corrX/allSNPs.vcf

When I read the "old" vcf file that I created with an older version some months ago, it still works. So I figured that the two vcf files must be different. When I compared them, I found out that the "REF" and "ALT" columns of the old file are empty (containing ".") but there are entries for these columns in the new file.

Could this be the reason for the error? Why is the output different from before? If this is the problem, can I prevent vcfWrite from writing these columns?

Thank you for your help!

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] org.Hs.eg.db_3.4.1                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.28.3                  AnnotationDbi_1.38.1
[5] BiocInstaller_1.26.0                    VariantAnnotation_1.22.1                Rsamtools_1.28.0                        Biostrings_2.44.1
[9] XVector_0.16.0                          SummarizedExperiment_1.6.3              DelayedArray_0.2.7                      matrixStats_0.52.2
[13] GenomicRanges_1.28.3                    GenomeInfoDb_1.12.2                     IRanges_2.10.2                          S4Vectors_0.14.3
[17] GWASTools_1.22.0                        Biobase_2.36.2                          BiocGenerics_0.22.0

loaded via a namespace (and not attached):
[1] quantsmooth_1.42.0       Rcpp_0.12.11             lattice_0.20-35          zoo_1.8-0                digest_0.6.9             lmtest_0.9-35
[7] plyr_1.8.3               MatrixModels_0.4-1       gdsfmt_1.12.0            RSQLite_1.1-2            ggplot2_2.2.1            zlibbioc_1.22.0
[13] rlang_0.1.1              lazyeval_0.2.0           SparseM_1.77             rpart_4.1-11             Matrix_1.2-10            splines_3.4.0
[19] BiocParallel_1.10.1      GWASExactHW_1.01         RCurl_1.95-4.8           biomaRt_2.32.1           munsell_0.4.2            compiler_3.4.0
[25] rtracklayer_1.36.3       mgcv_1.8-17              nnet_7.3-12              tibble_1.3.3             GenomeInfoDbData_0.99.0 DNAcopy_1.50.1
[31] XML_3.98-1.8             GenomicAlignments_1.12.1 MASS_7.3-47              bitops_1.0-6             grid_3.4.0               nlme_3.1-131
[37] gtable_0.1.2             DBI_0.7                  scales_0.4.1             ncdf4_1.16               mice_2.30                sandwich_2.3-4
[43] tools_3.4.0              BSgenome_1.44.0          survival_2.41-3          colorspace_1.2-4         memoise_0.2.1            logistf_1.22
[49] quantreg_5.33

gwastools variantannotation • 2.8k views

ADD COMMENT • link updated 7.9 years ago by Stephanie M. Gogarten ▴ 890 • written 7.9 years ago by Lna • 0

0

Entering edit mode

You could look at the REF column for non-DNA characters. As an aside, it's really unfortunate that two Bioconductor tools are communicating through files.

ADD REPLY • link 7.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

You make a good point about the file communication, but unfortunately VariantAnnotation only takes VCF files as input, so it's up to every other package author to write a coercion method to a VCF object. Based on my experience with doing this for the SeqArray package, that's a substantial amount of work, since the VCF objects are quite complex. Realistically it's not going to happen for GWASTools anytime soon.

A method in VariantAnnotation for creating a VCF object from simpler inputs (integer/character vectors for the fixed columns, a matrix for the genotypes) would make it easier for other packages to communicate with VariantAnnotation.

ADD REPLY • link 7.9 years ago Stephanie M. Gogarten ▴ 890

0

Entering edit mode

If it's easier to write (or at least maintain) your own VCF writer than to construct a VCF object, then indeed the constructor must be very difficult to use. I will have a go at a coercion from GWASTools objects to VCF with an eye towards the difficulties.

ADD REPLY • link 7.9 years ago Michael Lawrence ★ 11k

score 0 · Answer 1 · 2017-06-19

0

Entering edit mode

Stephanie M. Gogarten ▴ 890

@stephanie-m-gogarten-5121

Last seen 10 months ago

University of Washington

Look at getAlleleA(genoData) and getAlleleB(genoData) to see where the non-nucleotide characters are coming from. You might need to correct the "alleleA" and "alleleB" columns in the SNP annotation attached to genoData.

ADD COMMENT • link 7.9 years ago Stephanie M. Gogarten ▴ 890

0

Entering edit mode

Thank you for your answer. When I first used VariantAnnotation the alleleA/alleleB columns were not included in the SNP annotation, so I had no problem with that... In the meanwhile I added the columns from the Illumina table with the SNP information and they include "N","D" and "I" apart from the standard bases. Maybe these entries caused the problem. I deleted the columns from my SNP annotation and now the readVcf function works.

ADD REPLY • link 7.9 years ago Lna • 0

0

Entering edit mode

I get a similar error when trying to read in the first example file.

> vcf1 <- readVcf(fl, "hg19")
Error: scanVcf: scanVcf: scanTabix: '.SigArgs' is shorter than '.SigLength' says it should be
 path: /home/joneill/R/x86_64-pc-linux-gnu-library/3.3/VariantAnnotation/extdata/chr22.vcf.gz
 index: /home/joneill/R/x86_64-pc-linux-gnu-library/3.3/VariantAnnotation/extdata/chr22.vcf.gz.tbi
  path: /home/joneill/R/x86_64-pc-linux-gnu-library/3.3/VariantAnnotation/extdata/chr22.vcf.gz

I have restarted R Studio several times. Is there a work around or a fix?

ADD REPLY • link 6.7 years ago jnoneill • 0