I've been using the utility function locateVariants()
from the VariantAnnotation
package and I'm fumbling to understand whether what I'm getting is a bug in the package or really something that is expected. The bottom line is that the features returned by locateVariants()
don't seem to make sense.
I've crafted a toy example using the first variant in my multi-thousand VCF file.
Loading the required packages
Let's start by loading the UCSC gene annotation txdb and the VariantAnnotation package. Then, I removed all non-standard chromosomes. There's some extra-oddities when the alt contigs are present.
r$> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
r$> library(VariantAnnotation)
r$> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
r$> txdb <- keepStandardChromosomes(txdb)
r$> seqlevels(txdb)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12"
[13] "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
[25] "chrM"
Finding feature overlapping a given SNP
Then I create a GRanges
with the first SNP in my vcf file and use locateVariants()
to find where that SNP falls in
r$> vcf <- GRanges("chr1", IRanges(3204352, width = 1))
r$> all <- locateVariants(vcf, txdb, AllVariants())
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
r$> all
GRanges object with 12 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID PRECEDEID FOLLOWID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> <character> <CharacterList> <CharacterList>
[1] chr1 3204352 - | intron 137755 137755 1 228337 6510
[2] chr1 3204352 - | intron 134706 134706 1 228338 6510
[3] chr1 3204352 - | intron 251534 251534 1 228339 6510
[4] chr1 3204352 - | intron 134706 134706 1 228340 6510
[5] chr1 3204352 - | intron 134706 134706 1 228341 6510
... ... ... ... . ... ... ... ... ... ... ... ... ...
[8] chr1 3204352 + | intron 134706 134706 1 384 8764
[9] chr1 3204352 + | intron 134706 134706 1 386 8764
[10] chr1 3204352 + | intron 134706 134706 1 387 8764
[11] chr1 3204352 + | intron 134706 134706 1 388 8764
[12] chr1 3204352 + | promoter <NA> <NA> 1 422 <NA>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
What does not make sense is that the two genes, GENEID 6510 and 8764, are no where near the SNP in the vcf
variable. In fact, gene with GENEID 6510 is on a different chromosome!
r$> locatedGenes <- as.vector(na.exclude(unique(all$GENEID)))
r$> genes <- genes(txdb)
r$> length(findOverlaps(all, genes[locatedGenes]))
[1] 0
r$> genes[locatedGenes]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
6510 chr19 46774883-46788594 - | 6510
8764 chr1 2555639-2565382 + | 8764
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
Finding real introns affected by variant
The convenience function locateVariants()
finds 11 introns and 1 promoter overlaping with the SNP, let see what introns are really overlapping.
r$> introns <- intronicParts(txdb)
r$> o.l <- findOverlaps(vcf, introns)
r$> o.l
Hits object with 1 hit and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 617
-------
queryLength: 1 / subjectLength: 421596
r$> introns[subjectHits(o.l)]
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | tx_id tx_name gene_id
<Rle> <IRanges> <Rle> | <IntegerList> <CharacterList> <CharacterList>
[1] chr1 3186475-3238095 + | 416,418,419,... ENST00000511072.5,ENST00000378391.6,ENST00000270722.10,... 63976
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
Which is an intron part of the gene PRDM16
which in fact overlap the SNP
r$> library(org.Hs.eg.db)
r$> gid <- unlist(introns[subjectHits(o.l)]$gene_id)
r$> select(org.Hs.eg.db, gid, c("SYMBOL", "GENETYPE"), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID SYMBOL GENETYPE
1 63976 PRDM16 protein-coding
Am I missing something here? The convience method locateVariants()
would be in fact convenient if it make sense as it reduces the amount of steps to go over all the genomic features...
Thanks
r$> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
[6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.20.0 VariantAnnotation_1.52.0
[3] Rsamtools_2.22.0 Biostrings_2.74.1
[5] XVector_0.46.0 SummarizedExperiment_1.36.0
[7] MatrixGenerics_1.18.1 matrixStats_1.5.0
[9] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0 GenomicFeatures_1.58.0
[11] AnnotationDbi_1.68.0 Biobase_2.66.0
[13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.3
[15] IRanges_2.40.1 S4Vectors_0.44.0
[17] BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] SparseArray_1.6.2 bitops_1.0-9 RSQLite_2.3.9 lattice_0.22-6
[5] grid_4.4.2 fastmap_1.2.0 blob_1.2.4 jsonlite_1.9.0
[9] Matrix_1.7-2 restfulr_0.0.15 DBI_1.2.3 httr_1.4.7
[13] BSgenome_1.74.0 UCSC.utils_1.2.0 XML_3.99-0.18 codetools_0.2-20
[17] abind_1.4-8 cli_3.6.4 rlang_1.1.5 crayon_1.5.3
[21] bit64_4.6.0-1 cachem_1.1.0 DelayedArray_0.32.0 yaml_2.3.10
[25] S4Arrays_1.6.0 tools_4.4.2 parallel_4.4.2 BiocParallel_1.40.0
[29] memoise_2.0.1 GenomeInfoDbData_1.2.13 curl_6.2.1 vctrs_0.6.5
[33] R6_2.6.1 png_0.1-8 BiocIO_1.16.0 rtracklayer_1.66.0
[37] zlibbioc_1.52.0 KEGGREST_1.46.0 bit_4.5.0.1 pkgconfig_2.0.3
[41] GenomicAlignments_1.42.0 rjson_0.2.23 compiler_4.4.2 RCurl_1.98-1.16
Ok, I think I found the problem. By looking at the
locateVariants()
source code, I was able to pinpoint that the problems is when trying to find SNPs location within introns.The implementation of
locateVariants()
usesfiveUTRsByTranscript()
,threeUTRsByTranscript()
andintronsByTranscript()
to generate subjects used bymapToTranscripts()
to locate the ranges where the variants fall. Turns out thatmapToTranscripts()
returns the wrong hits when passed a GRangesList fromintronsByTranscript()
.Without digging into the implementation of
mapToTranscripts()
, I'm suspecting that the fact thatintronsByTranscript()
returns a GRangesList where some of the entries have no GRanges is tripping themapToTranscripts
function. Here's the tests:Function to generate random locations within GRanges of a GRangesList
Testing
mapToTranscript
with 5' UTRsTesting
mapToTranscript
with intronsCould one of the authors of the package look into this? Should I open a new tickets as a bug report for
mapToTranscripts()
?