Hello
I would like to ask for some help on annotating variants in transcripts with predictCoding() in the VariantAnnotation package.
I can get the amino acid change by inputting the gff3 file to make a TxDB object and get variants on gene level.
But I cant get this to work when genes consist of multiple exons or spliced genes. In some gff3 files I see the exon, mRNA and transcript features, but i cannot make VariantAnnotation annotate variants in these transcripts.
Example with spliced genes: See HPV16_genes_PaVE, where part of E1 and part of E2 are spliced together to form E1^E4. This means that the amino acid sequence is changed and in some cases like this the frame may get shifted.
gene E1 have amino acids MADPAGTNGE...
splice transcript E1^E4 have amino acids MADPAAATKYP...
One variant in position 3400 will affect the E2 gene AND the E1^E4 splice transcript, but I can only get the ALT amino acid in gene E2. I'm not sure if the gff3 files are correctly set up to pass the info of the spliced genes to predictCoding() or if it is not possible to do it?
Thank you very much.
Functions used:
library(GenomicFeatures)
library(VariantAnnotation)
txdb <- makeTxDbFromGFF("GFFfile", format="gff3")
vcffile <- readVcf("vcffile.vcf")
faf <- open(FaFile("Ref.fasta"))
vcf_anno <- predictCoding(vcffile, txdb, seqSource = faf)
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=da_DK.UTF-8 LC_NUMERIC=C LC_TIME=da_DK.UTF-8 LC_COLLATE=da_DK.UTF-8 LC_MONETARY=da_DK.UTF-8
[6] LC_MESSAGES=da_DK.UTF-8 LC_PAPER=da_DK.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] genbankr_1.20.0 Repitools_1.38.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1 GenomicFeatures_1.44.0
[13] AnnotationDbi_1.54.1 VariantAnnotation_1.38.0 Rsamtools_2.8.0 Biostrings_2.60.1
[17] XVector_0.32.0 SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0
[21] GenomeInfoDb_1.28.1 IRanges_2.26.0 S4Vectors_0.30.0 MatrixGenerics_1.4.0
[25] matrixStats_0.59.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 rjson_0.2.20 ellipsis_0.3.2 DNAcopy_1.66.0 fs_1.5.0
[6] rstudioapi_0.13 affyio_1.62.0 bit64_4.0.5 fansi_0.5.0 lubridate_1.7.10
[11] xml2_1.3.2 splines_4.1.0 cachem_1.0.5 jsonlite_1.7.2 broom_0.7.8
[16] annotate_1.70.0 cluster_2.1.2 vsn_3.60.0 dbplyr_2.1.1 png_0.1-7
[21] BiocManager_1.30.16 compiler_4.1.0 httr_1.4.2 backports_1.2.1 assertthat_0.2.1
[26] Matrix_1.3-3 fastmap_1.1.0 limma_3.48.1 cli_3.0.0 prettyunits_1.1.1
[31] tools_4.1.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.6 affy_1.70.0
[36] rappdirs_0.3.3 tinytex_0.32 Rcpp_1.0.7 cellranger_1.1.0 vctrs_0.3.8
[41] preprocessCore_1.54.0 rtracklayer_1.52.0 xfun_0.24 rvest_1.0.0 lifecycle_1.0.0
[46] restfulr_0.0.13 gtools_3.9.2 XML_3.99-0.6 edgeR_3.34.0 zlibbioc_1.38.0
[51] MASS_7.3-54 scales_1.1.1 BSgenome_1.60.0 hms_1.1.0 yaml_2.2.1
[56] curl_4.3.2 memoise_2.0.0 biomaRt_2.48.2 stringi_1.6.2 RSQLite_2.2.7
[61] genefilter_1.74.0 BiocIO_1.2.0 caTools_1.18.2 Ringo_1.56.0 filelock_1.0.2
[66] BiocParallel_1.26.1 truncnorm_1.0-8 rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
[71] Rsolnp_1.16 lattice_0.20-44 GenomicAlignments_1.28.0 bit_4.0.4 tidyselect_1.1.1
[76] magrittr_2.0.1 R6_2.5.0 gplots_3.1.1 generics_0.1.0 DelayedArray_0.18.0
[81] DBI_1.1.1 gsmoothr_0.1.7 pillar_1.6.1 haven_2.4.1 withr_2.4.2
[86] survival_3.2-11 KEGGREST_1.32.0 RCurl_1.98-1.3 modelr_0.1.8 crayon_1.4.1
[91] KernSmooth_2.23-20 utf8_1.2.1 BiocFileCache_2.0.0 progress_1.2.2 locfit_1.5-9.4
[96] grid_4.1.0 readxl_1.3.1 blob_1.2.1 reprex_2.0.0 digest_0.6.27
[101] xtable_1.8-4 munsell_0.5.0