Hi,
I'm noticing what appears to be an incorrect translation of a codon with predictCoding(). It's somewhat difficult to see but below is output of all nonsynonymous amino acid changes. However, variant 3 at bp 1002 shows an M for TTG rather than an L. Is this a bug? Clearly TTG is an L in the standard codon table so I'm not sure how to resolve this (although it can be a start codon in some orgs).
Thanks,
-todd
Here is the code I used to generate a txdb object. This is a custom gff3 file and fasta file for an RSV virus.
txdb <- makeTxDbFromGFF(gff3.file, format="gff3") coding <- predictCoding(vcf, txdb, seqSource=FaFile(fsa.file)) nonsyn <- coding[(mcols(coding)[,13] == "nonsynonymous")]
# GFF3 file
A_NLD_13_005275 manual region 1 1725 . + . ID=id0 A_NLD_13_005275 manual gene 1 1725 . + . ID=gene1;Name=F;gbkey=Gene;gene=F A_NLD_13_005275 manual mRNA 1 1725 . + . ID=rna1;Parent=gene1;gbkey=mRNA;gene=F A_NLD_13_005275 manual exon 1 1725 . + . ID=exon1;Parent=rna1;gbkey=mRNA;gene=F A_NLD_13_005275 manual CDS 1 1725 . + . ID=cds1;Parent=rna1;gbkey=CDS;gene=F
# FASTA file
>A_NLD_13_005275 ATGGAGTTGCCAATCCTCAAAACAAATGCTATTACCACAATCCTTGCTGCAGTCACACTCTGTTTCGCTTCCAGTCAAAACATCACTGAAGAATTTTATCAATCAACATGCAGTGCAGTTAGCAAAGGCTATCTTAGTGCTCTAAGAACTGGTTGGTATACTAGTGTTATAACTATAGAATTAAGTAATATCAAGGAAAATAAGTGTAATGGTACAGACGCTAAGGTAAAATTAATAAAACAAGAATTAGATAAATATAAAAATGCTGTAACAGAATTGCAGTTGCTCATGCAAAGCACACCAGCAGCCAACAGTCGAGCCAGAAGAGAACTACCAAGATTTATGAATTATACACTCAACAATACCAAAAACACCAATGTAACATTAAGTAAGAAAAGGAAAAGAAGATTTCTTGGATTTTTGTTAGGTGTTGGATCTGCAATCGCCAGTGGCATTGCCGTATCCAAGGTCCTGCACCTAGAAGGGGAAGTGAACAAAATCAAAAGTGCTCTACTATCCACAAACAAGGCTGTAGTCAGCTTATCTAATGGAGTCAGTGTCTTAACCAGCAAGGTGTTAGACCTCAAAAACTATATAGATAAACAGTTGTTACCTATTGTTAACAAGCAAAGCTGCAGCATATCAAACATTGAAACTGTGATAGAGTTCCAACAAAAGAACAACAGACTACTAGAGATTACCAGAGAATTTAGTGTTAATGCAGGTGTAACTACACCTGTAAGCACTTATATGTTAACTAATAGTGAGTTATTATCATTAATCAATGATATGCCTATAACAAATGATCAGAAAAAGTTAATGTCCAGCAATGTTCAAATAGTTAGACAGCAAAGTTACTCTATCATGTCAATAATAAAAGAGGAAGTCTTAGCATATGTAGTACAATTACCACTATATGGTGTAATAGATACTCCTTGTTGGAAACTACACACATCCCCTCTATGTACAACCAACACAAAGGAAGGATCCAACATCTGCTTGACAAGAACCGACAGAGGATGGTACTGTGACAATGCAGGATCAGTATCCTTTTTCCCACAAGCTGAAACATGTAAAGTTCAATCGAATCGGGTGTTTTGTGACACAATGAACAGTTTAACATTACCAAGTGAGGTAAATCTCTGCAACATTGACATATTCAACCCCAAATATGATTGCAAAATTATGACTTCAAAAACAGATGTAAGCAGCTCCGTTATCACATCTCTAGGAGCCATTGTGTCATGCTATGGCAAAACCAAATGTACAGCATCCAATAAAAATCGTGGGATCATAAAGACATTCTCTAACGGGTGTGATTATGTATCAAATAAGGGGGTGGATACTGTGTCTGTAGGTAATACATTATATTATGTAAATAAGCAAGAAGGCAAAAGTCTCTATGTAAAAGGTGAACCAATAATAAATTTCTATGATCCATTAGTGTTCCCCTCTGATGAATTTGATGCATCAATATCTCAAGTCAATGAGAAAATTAATCAGAGTCTAGCATTTATCCGTAAATCAGATGAATTATTACATAATGTAAATGCTGGTAAATCCACCACAAATATCATGATAACTACCATAATTATAGTAATTATAGTAATATTGTTAGCATTAATTGCAGTTGGACTGCTTCTATACTGCAAGGCCAGAAGCACACCAGTCACATTAAGTAAGGATCAACTGAGTGGTATAAATAATATTGCATTTAGTAACTGA
GRanges object with 5 ranges and 17 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
A_NLD_13_005275:11_C/T A_NLD_13_005275 [ 11, 11] + | <NA>
A_NLD_13_005275:64_T/A A_NLD_13_005275 [ 64, 64] + | <NA>
A_NLD_13_005275:1002_G/A A_NLD_13_005275 [1002, 1002] + | <NA>
A_NLD_13_005275:1541_A/T A_NLD_13_005275 [1541, 1541] + | <NA>
A_NLD_13_005275:1552_G/T A_NLD_13_005275 [1552, 1552] + | <NA>
REF ALT QUAL
<DNAStringSet> <DNAStringSetList> <numeric>
A_NLD_13_005275:11_C/T C T <NA>
A_NLD_13_005275:64_T/A T A <NA>
A_NLD_13_005275:1002_G/A G A <NA>
A_NLD_13_005275:1541_A/T A T <NA>
A_NLD_13_005275:1552_G/T G T <NA>
FILTER varAllele CDSLOC
<character> <DNAStringSet> <IRanges>
A_NLD_13_005275:11_C/T PASS T [ 11, 11]
A_NLD_13_005275:64_T/A PASS A [ 64, 64]
A_NLD_13_005275:1002_G/A PASS A [1002, 1002]
A_NLD_13_005275:1541_A/T PASS T [1541, 1541]
A_NLD_13_005275:1552_G/T PASS T [1552, 1552]
PROTEINLOC QUERYID TXID CDSID
<IntegerList> <integer> <character> <IntegerList>
A_NLD_13_005275:11_C/T 4 1 1 1
A_NLD_13_005275:64_T/A 22 3 1 1
A_NLD_13_005275:1002_G/A 334 6 1 1
A_NLD_13_005275:1541_A/T 514 12 1 1
A_NLD_13_005275:1552_G/T 518 13 1 1
GENEID CONSEQUENCE REFCODON
<character> <factor> <DNAStringSet>
A_NLD_13_005275:11_C/T F nonsynonymous CCA
A_NLD_13_005275:64_T/A F nonsynonymous TTC
A_NLD_13_005275:1002_G/A F nonsynonymous TTG
A_NLD_13_005275:1541_A/T F nonsynonymous CAT
A_NLD_13_005275:1552_G/T F nonsynonymous GCT
VARCODON REFAA VARAA
<DNAStringSet> <AAStringSet> <AAStringSet>
A_NLD_13_005275:11_C/T CTA P L
A_NLD_13_005275:64_T/A ATC F I
A_NLD_13_005275:1002_G/A TTA M L
A_NLD_13_005275:1541_A/T CTT H L
A_NLD_13_005275:1552_G/T TCT A S
-------
sessionInfo():
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin16.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel methods stats graphics grDevices utils [8] datasets base other attached packages: [1] GenomicFeatures_1.28.4 AnnotationDbi_1.38.1 [3] dplyr_0.7.2 plyr_1.8.4 [5] gridExtra_2.2.1 ggplot2_2.2.1 [7] reshape2_1.4.2 VariantAnnotation_1.22.3 [9] Rsamtools_1.28.0 Biostrings_2.44.2 [11] XVector_0.16.0 SummarizedExperiment_1.6.3 [13] DelayedArray_0.2.7 matrixStats_0.52.2 [15] Biobase_2.36.2 GenomicRanges_1.28.4 [17] GenomeInfoDb_1.12.2 IRanges_2.10.2 [19] S4Vectors_0.14.3 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 bindr_0.1 compiler_3.4.1 [4] bitops_1.0-6 tools_3.4.1 zlibbioc_1.22.0 [7] biomaRt_2.32.1 digest_0.6.12 bit_1.1-12 [10] gtable_0.2.0 RSQLite_2.0 memoise_1.1.0 [13] tibble_1.3.3 lattice_0.20-35 BSgenome_1.44.0 [16] pkgconfig_2.0.1 rlang_0.1.1 Matrix_1.2-10 [19] DBI_0.7 bindrcpp_0.2 GenomeInfoDbData_0.99.0 [22] stringr_1.2.0 rtracklayer_1.36.4 bit64_0.9-7 [25] grid_3.4.1 glue_1.1.1 R6_2.2.2 [28] XML_3.98-1.9 BiocParallel_1.10.1 magrittr_1.5 [31] blob_1.1.0 scales_0.4.1 GenomicAlignments_1.12.1 [34] assertthat_0.2.0 colorspace_1.3-2 stringi_1.1.5 [37] lazyeval_0.2.0 munsell_0.4.3 RCurl_1.95-4.8
I'm not sure why we see '**TTG**' for REFCODON - that's where the trouble starts. I can't try to reproduce this unless I know what annotation you used. Please edit your answer to include the annotation and the output of sessionInfo().
Valerie
Sorry, the "**TTG**" was a mistake. I removed them from the post. I've added some code that I used to generate the custom txdb object which is for a RSV virus, the gff3 data, the fasta sequence, and the sessionInfo(). Let me know what else you need.