Question

predictCoding error DNAStringSet_translate

0

Entering edit mode

Miss Agnieszka Aleksandra Golicz ▴ 40

@miss-agnieszka-aleksandra-golicz-6531

Last seen 22 months ago

Australia

Hello,

I keep getting an error in predictCoding() which is giving me a bit of a headache.

I read somewhere that it may be due to illegal ALT character but it does not appear to be the case.

> ref <- FaFile("sequences.fa")
> coding <- predictCoding(vcf, txdb, seqSource=ref)
Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
in 'x[[220612]]': not a base at pos 1
> alt(vcf)[[220612]]
A DNAStringSet instance of length 1
width seq
[1] 1 T
> ref(vcf)[[220612]]
1-letter "DNAString" instance
seq: A
> hasOnlyBaseLetters(unlist(alt(vcf)))
[1] TRUE
> hasOnlyBaseLetters(unlist(ref(vcf)))
[1] TRUE

> colSums(alphabetFrequency(unlist(alt(vcf))))
A C G T M R W S Y K
1097599 861657 863029 1099514 0 0 0 0 0 0
V H D B N - + .
0 0 0 0 0 0 0 0

I went ahead and looked up the corresponding line in the .vcf (v4.0) file, but nothing strange seems
to be going on there

C1 29273358 . A T 271 PASS BRF=0.0;FR=0.1031;HP=3;HapScore=3;MGOF=62;MMLQ=35;MQ=41.74;NF=1;NR=8;PP=271;QD= 33.9745791719;SC=AAAACTTCAAATTTAAGGCTT;SbPval=0.65;Source=Platypus;TC=315;TCF=125;TCR=190;TR=9;WE=29273366;WS=29273343 GT:GL:GOF:GQ:NR:NV 0 /0:0.0,-16.23,-207.1:1:99:57:0 1/1:-33.0,-2.23,0.0:31:22:9:9 0/0:0.0,-18.89,-244.1:31:99:66:0 0/0:0.0,-18.27,-233.9:62:99:64:0 0 /0:0.0,-18.27,-232.6:1:99:63:0 0/0:0.0,-9.84,-126.5:1:98:34:0 0/0:0.0,-0.45,-11.0:0:6:3:0 0/0:0.0,-0.4,-11.6:43:5:4:0 0/0:0.0,-0.92,- 19.5:3:10:5:0 0/0:0.0,-2.11,-35.4:5:21:10:0

By the way there must be an easier way to recover SNP coordinates for a given allele (I converted Granges to data.frame)

df <- data.frame(seqnames=seqnames(gr),
starts=start(gr),
ends=end(gr),
names=c(rep(".", length(gr))),
scores=c(rep(".", length(gr))),
strands=strand(gr))

df[220612,]
seqnames starts ends names scores strands
220612 C1 29273358 29273358 . . *

Any idea what else could be causing problems?

Agnieszka

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] VariantAnnotation_1.12.9 Rsamtools_1.18.2 Biostrings_2.34.1
[4] XVector_0.6.0 GenomicFeatures_1.18.3 AnnotationDbi_1.28.1
[7] Biobase_2.22.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[10] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[4] BiocParallel_1.0.3 biomaRt_2.18.0 bitops_1.0-6
[7] brew_1.0-6 BSgenome_1.30.0 checkmate_1.5.1
[10] codetools_0.2-9 DBI_0.3.1 digest_0.6.4
[13] fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.1
[16] iterators_1.0.7 RCurl_1.95-4.3 RSQLite_1.0.0
[19] rtracklayer_1.26.2 sendmailR_1.2-1 stringr_0.6.2
[22] tools_3.1.2 XML_3.98-1.1 zlibbioc_1.8.0

variantannotation predict coding • 2.9k views

ADD COMMENT • link updated 10.1 years ago by Valerie Obenchain ★ 6.8k • written 10.1 years ago by Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

Also checked alt(vcf) is a DNAStringSetList

> alt(vcf)
DNAStringSetList of length 5514356

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

I think that problem may be due to filtering out of non-SNP variants from the original file which contains indels and MNPs. I used a pretty simple custom script which retains only lines which have one nucleotide in ref and one (or multiple single comma separated nucleotides) in ALT. Is the a VariantAnnotation friendly way of doing this (filtering out after reading in vcf file?). I am using Platypus (http://www.well.ox.ac.uk/platypus) for variant calling and as far as I can tell there is no way to make it output SNPs only.

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

score 0 · Answer 1 · 2015-03-17

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.3 years ago

United States

Hi Agnieszka,

predictCoding() should work on all variants, not just SNPS. Can you send the file off-line, email or dropbox so I can take a look?

As to the filtering question, there is a family of functions that filters a VCF on variant type. See the ?isSNV man page.

You also asked about recovering coordinates for a given allele. You can subset the VCF class by a numeric or logical vector. If you want row(s) 'i',

sub <- vcf[i, ]

rowData(sub)

or pull the rowData out and work with that. The rowData() accessor has become rowRanges() in devel.

rowRanges(vcf)[i,]

Isolate SNPs, extract those with allele 'A':

snps <- vcf[isSNV(vcf),]

snps[unlist(alt(snps) %in% "A"), ]

Valerie

ADD COMMENT • link 10.1 years ago Valerie Obenchain ★ 6.8k

1

Entering edit mode

Thanks for sending the files. The problem was that one of the reference codons had an 'N' and could not be translated. The 'N' wasn't present in the original reference allele but it was one of the 3 nucleotides pulled down when the codon was retrieved from the fasta. The code checked for 'N' values in the alternate alleles but not the reference codons. This has been fixed in 1.13.46.

Valerie

ADD REPLY • link 10.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you very much.

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

For the future could you tell me how to remove problematic lines.

I tried:

invalid <- c(123)

vcf2 <- vcf[!invalid], but this did not seem to work.

Agnieszka

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

You shouldn't have to do anything on your end. Lines with 'N' in ref or alt will be ignored, not translated. If they are in the data you'll see a warning (not error):

Warning message:
varAllele values containing 'N' were not translated

or

Warning message:
reference codons containing 'N' were not translated

If you want to remove/keep rows in a VCF object the example you have with c(1,2,3) will work. It's identifying which rows are bad that's difficult. The internal code does some additional subsetting/excluding so the index printed in the error message may not correspond to the same row in the VCF object. Herve just enhanced the error message in Biostrings 2.35.12 so instead of this

'x[[220612]]': not a base at pos 1

the error will print out the exact character that is causing the problems. That will give you a heads up as to what to look for in the VCF.

Valerie

ADD REPLY • link 10.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks again. I understand that the fixes are in the devel version of VariantAnnotation. I will have to wait for my admin to install R-devel and will try it out.

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

I can confirm that the code works on the full data set. Thanks for the fix!

Agnieszka

ADD REPLY • link 10.1 years ago Miss Agnieszka Aleksandra Golicz ▴ 40