Entering edit mode
Hi I am using locateVariants from VariantAnnotation. Intronic variants seem to be associated to the wrong gene on a different chromosome
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
variant <- GRanges(seqnames = "chr5", ranges = IRanges(start = 20298238
, end = 20298238))
genome(variant) <- "hg19"
anno <- locateVariants(query = variant, subject = TxDb.Hsapiens.UCSC.hg19.knownGene, region = AllVariants())
anno
GRanges object with 1 range and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID PRECEDEID FOLLOWID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> <character> <CharacterList> <CharacterList>
[1] chr5 20298238 - | intron 277333 277333 1 19778 839
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
However the transcript identified by locateVariants is located on chromosome 4, not 5
dump <- as.list(TxDb.Hsapiens.UCSC.hg19.knownGene)
dump$transcripts[dump$transcripts$tx_id %in% anno$TXID,]
tx_id tx_name tx_chrom tx_strand tx_start tx_end
19778 19778 uc003hzo.1 chr4 - 110609785 110624629
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.44.0
[3] AnnotationDbi_1.54.1 VariantAnnotation_1.38.0
[5] Rsamtools_2.8.0 Biostrings_2.60.1
[7] XVector_0.32.0 SummarizedExperiment_1.22.0
[9] Biobase_2.52.0 GenomicRanges_1.44.0
[11] GenomeInfoDb_1.28.0 IRanges_2.26.0
[13] S4Vectors_0.30.0 MatrixGenerics_1.4.0
[15] matrixStats_0.61.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lattice_0.20-45 prettyunits_1.1.1 png_0.1-7
[5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 BiocFileCache_2.0.0
[9] R6_2.5.1 RSQLite_2.2.9 httr_1.4.2 pillar_1.6.4
[13] zlibbioc_1.38.0 rlang_0.4.12 progress_1.2.2 curl_4.3.2
[17] rstudioapi_0.13 blob_1.2.2 Matrix_1.4-0 BiocParallel_1.26.0
[21] stringr_1.4.0 RCurl_1.98-1.5 bit_4.0.4 biomaRt_2.48.1
[25] DelayedArray_0.18.0 rtracklayer_1.52.0 compiler_4.1.2 pkgconfig_2.0.3
[29] tidyselect_1.1.1 KEGGREST_1.32.0 tibble_3.1.6 GenomeInfoDbData_1.2.6
[33] XML_3.99-0.8 fansi_0.5.0 crayon_1.4.2 dplyr_1.0.7
[37] dbplyr_2.1.1 GenomicAlignments_1.28.0 bitops_1.0-7 rappdirs_0.3.3
[41] grid_4.1.2 lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.1
[45] stringi_1.7.6 cachem_1.0.6 xml2_1.3.3 ellipsis_0.3.2
[49] filelock_1.0.2 vctrs_0.3.8 generics_0.1.1 rjson_0.2.20
[53] restfulr_0.0.13 tools_4.1.2 bit64_4.0.5 BSgenome_1.60.0
[57] glue_1.6.0 purrr_0.3.4 hms_1.1.1 yaml_2.2.1
[61] fastmap_1.1.0 memoise_2.0.1 BiocIO_1.2.0
Thanks!
There certainly seems to be a problem here. The TXID mapping seems problematic. Thanks for posting and we will get back to you.
Is it clear what it would mean to associate an intronic variant with a TXID?The bug in locateVariants seems pretty clear to me -- a linear index is being treated as a string identifier, and we need to fix that.But I think the right answer for this location problem could be to return NA at TXID.The program does correctly say that the query GRanges is at an intron. [Edited to acknowledge my confusion. There could be multiple transcripts associated with an intronic variant and there is no reason not to list them all.]Hi, I have experienced exactly the same problem recently, are there any news about this issue? Many thanks!