Hi, I have trouble using the predictCoding function in VariantAnnotation. I have the following gff3 file, which contains:
A gene which consists of 2 coding regions (865-880 and 3358-3620)
A gene which consists of one coding region (2756-3853)
##gff-version 3
##sequence-region HPV16_K02718_1_revised 1 7906
HPV16_K02718_1_revised PaVE region 1 7906 . . . ID=HPV16_K02718_1_revised
HPV16_K02718_1_revised PaVE gene 865 3620 . + . ID=E4_splice
HPV16_K02718_1_revised PaVE mRNA 865 3620 . + . ID=E4_spliceRNA1;Parent=E4_splice
HPV16_K02718_1_revised PaVE exon 865 880 . + . ID=E4exon1;Parent=E4_spliceRNA1
HPV16_K02718_1_revised PaVE exon 3358 3620 . + . ID=E4exon2;Parent=E4_spliceRNA1
HPV16_K02718_1_revised PaVE CDS 865 880 . + 0 ID=E4cds1;Parent=E4_spliceRNA1
HPV16_K02718_1_revised PaVE CDS 3358 3620 . + 2 ID=E4cds2;Parent=E4_spliceRNA1
HPV16_K02718_1_revised PaVE gene 2756 3853 . + . ID=E2
HPV16_K02718_1_revised PaVE mRNA 2756 3853 . + . ID=E2mRNA;Parent=E2
HPV16_K02718_1_revised PaVE CDS 2756 3853 . + 0 ID=E2cds;Parent=E2mRNA
I also have a vcf file with the following variants:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1
HPV16_K02718_1_revised 3360 . G T 5455.04 PASS AC=1;AF=1.00;AN=1;BaseQRankSum=0.624;DP=156;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=31.63;ReadPosRankSum=-0.565;SOR=0.416 GT:AD:DP:GQ:PL 1:1,154:155:99:5465,0
HPV16_K02718_1_revised 3367 . A T 5455.04 PASS AC=1;AF=1.00;AN=1;BaseQRankSum=0.624;DP=156;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=31.63;ReadPosRankSum=-0.565;SOR=0.416 GT:AD:DP:GQ:PL 1:1,154:155:99:5465,0
HPV16_K02718_1_revised 3373 . C T 5455.04 PASS AC=1;AF=1.00;AN=1;BaseQRankSum=0.624;DP=156;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=31.63;ReadPosRankSum=-0.565;SOR=0.416 GT:AD:DP:GQ:PL 1:1,154:155:99:5465,0
HPV16_K02718_1_revised 3374 . T A 5455.04 PASS AC=1;AF=1.00;AN=1;BaseQRankSum=0.624;DP=156;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=31.63;ReadPosRankSum=-0.565;SOR=0.416 GT:AD:DP:GQ:PL 1:1,154:155:99:5465,0
However I get the error:
GRanges object contains 2 out-of-bound ranges located on sequence 1. Note that ranges located on a sequence whose
length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular()
to get the lengths and circularity flags of the underlying sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
... when trying to predict the amino acid mutations from these variants with predictCoding. Unless im completely blind, the variants and gff file have no positions outside the sequence, but something is making the function think so.
I think it has something to do with the first CDS region as I can avoid this error by changing the 2nd CDS stop position to =< 3373 (making it 15 bases long, like the first CDS region), but this ofcourse does not allow calls outside 3373 in the E4 gene.
I am very confused by the behavior of the txdb object and predictCoding function and have spent hours trying to figure this out. Can anyone help in creating a working gff file?
Functions used:
txdb <- makeTxDbFromGFF("GFFfile", format="gff3")
vcffile <- readVcf("vcffile.vcf")
faf <- open(FaFile("Ref.fasta"))
vcf_anno <- predictCoding(vcffile, txdb, seqSource = faf)
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=da_DK.UTF-8 LC_NUMERIC=C LC_TIME=da_DK.UTF-8 LC_COLLATE=da_DK.UTF-8 LC_MONETARY=da_DK.UTF-8
[6] LC_MESSAGES=da_DK.UTF-8 LC_PAPER=da_DK.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] genbankr_1.20.0 Repitools_1.38.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.7 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.2 ggplot2_3.3.5 tidyverse_1.3.1 GenomicFeatures_1.44.0
[13] AnnotationDbi_1.54.1 VariantAnnotation_1.38.0 Rsamtools_2.8.0 Biostrings_2.60.1
[17] XVector_0.32.0 SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0
[21] GenomeInfoDb_1.28.1 IRanges_2.26.0 S4Vectors_0.30.0 MatrixGenerics_1.4.0
[25] matrixStats_0.59.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 rjson_0.2.20 ellipsis_0.3.2 DNAcopy_1.66.0 fs_1.5.0
[6] rstudioapi_0.13 affyio_1.62.0 bit64_4.0.5 fansi_0.5.0 lubridate_1.7.10
[11] xml2_1.3.2 splines_4.1.0 cachem_1.0.5 jsonlite_1.7.2 broom_0.7.8
[16] annotate_1.70.0 cluster_2.1.2 vsn_3.60.0 dbplyr_2.1.1 png_0.1-7
[21] BiocManager_1.30.16 compiler_4.1.0 httr_1.4.2 backports_1.2.1 assertthat_0.2.1
[26] Matrix_1.3-3 fastmap_1.1.0 limma_3.48.1 cli_3.0.0 prettyunits_1.1.1
[31] tools_4.1.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.6 affy_1.70.0
[36] rappdirs_0.3.3 tinytex_0.32 Rcpp_1.0.7 cellranger_1.1.0 vctrs_0.3.8
[41] preprocessCore_1.54.0 rtracklayer_1.52.0 xfun_0.24 rvest_1.0.0 lifecycle_1.0.0
[46] restfulr_0.0.13 gtools_3.9.2 XML_3.99-0.6 edgeR_3.34.0 zlibbioc_1.38.0
[51] MASS_7.3-54 scales_1.1.1 BSgenome_1.60.0 hms_1.1.0 yaml_2.2.1
[56] curl_4.3.2 memoise_2.0.0 biomaRt_2.48.2 stringi_1.6.2 RSQLite_2.2.7
[61] genefilter_1.74.0 BiocIO_1.2.0 caTools_1.18.2 Ringo_1.56.0 filelock_1.0.2
[66] BiocParallel_1.26.1 truncnorm_1.0-8 rlang_0.4.11 pkgconfig_2.0.3 bitops_1.0-7
[71] Rsolnp_1.16 lattice_0.20-44 GenomicAlignments_1.28.0 bit_4.0.4 tidyselect_1.1.1
[76] magrittr_2.0.1 R6_2.5.0 gplots_3.1.1 generics_0.1.0 DelayedArray_0.18.0
[81] DBI_1.1.1 gsmoothr_0.1.7 pillar_1.6.1 haven_2.4.1 withr_2.4.2
[86] survival_3.2-11 KEGGREST_1.32.0 RCurl_1.98-1.3 modelr_0.1.8 crayon_1.4.1
[91] KernSmooth_2.23-20 utf8_1.2.1 BiocFileCache_2.0.0 progress_1.2.2 locfit_1.5-9.4
[96] grid_4.1.0 readxl_1.3.1 blob_1.2.1 reprex_2.0.0 digest_0.6.27
[101] xtable_1.8-4 munsell_0.5.0
Can we reproduce this without Ref.fasta?