Question

Problem using locateVariants() from the VariantAnnotation package

0

Entering edit mode

Marco • 0

@dc86b518

Last seen 6 weeks ago

United States

I've been using the utility function locateVariants() from the VariantAnnotation package and I'm fumbling to understand whether what I'm getting is a bug in the package or really something that is expected. The bottom line is that the features returned by locateVariants() don't seem to make sense.

I've crafted a toy example using the first variant in my multi-thousand VCF file.

Loading the required packages

Let's start by loading the UCSC gene annotation txdb and the VariantAnnotation package. Then, I removed all non-standard chromosomes. There's some extra-oddities when the alt contigs are present.

r$> library(TxDb.Hsapiens.UCSC.hg38.knownGene)

r$> library(VariantAnnotation)

r$> txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

r$> txdb <- keepStandardChromosomes(txdb)

r$> seqlevels(txdb)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11" "chr12"
[13] "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY" 
[25] "chrM"

Finding feature overlapping a given SNP

Then I create a GRanges with the first SNP in my vcf file and use locateVariants() to find where that SNP falls in

r$> vcf <- GRanges("chr1", IRanges(3204352, width = 1))

r$> all <- locateVariants(vcf, txdb, AllVariants())
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns

r$> all
GRanges object with 12 ranges and 9 metadata columns:
       seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
          <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList> <character> <CharacterList> <CharacterList>
   [1]     chr1   3204352      - |   intron    137755    137755         1      228337                      6510                                
   [2]     chr1   3204352      - |   intron    134706    134706         1      228338                      6510                                
   [3]     chr1   3204352      - |   intron    251534    251534         1      228339                      6510                                
   [4]     chr1   3204352      - |   intron    134706    134706         1      228340                      6510                                
   [5]     chr1   3204352      - |   intron    134706    134706         1      228341                      6510                                
   ...      ...       ...    ... .      ...       ...       ...       ...         ...           ...         ...             ...             ...
   [8]     chr1   3204352      + | intron      134706    134706         1         384                      8764                                
   [9]     chr1   3204352      + | intron      134706    134706         1         386                      8764                                
  [10]     chr1   3204352      + | intron      134706    134706         1         387                      8764                                
  [11]     chr1   3204352      + | intron      134706    134706         1         388                      8764                                
  [12]     chr1   3204352      + | promoter      <NA>      <NA>         1         422                      <NA>                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

What does not make sense is that the two genes, GENEID 6510 and 8764, are no where near the SNP in the vcf variable. In fact, gene with GENEID 6510 is on a different chromosome!

r$> locatedGenes <- as.vector(na.exclude(unique(all$GENEID)))

r$> genes <- genes(txdb)

r$> length(findOverlaps(all, genes[locatedGenes]))
[1] 0

r$> genes[locatedGenes]
GRanges object with 2 ranges and 1 metadata column:
       seqnames            ranges strand |     gene_id
          <Rle>         <IRanges>  <Rle> | <character>
  6510    chr19 46774883-46788594      - |        6510
  8764     chr1   2555639-2565382      + |        8764
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

Finding real introns affected by variant

The convenience function locateVariants() finds 11 introns and 1 promoter overlaping with the SNP, let see what introns are really overlapping.

r$> introns <- intronicParts(txdb)

r$> o.l <- findOverlaps(vcf, introns)

r$> o.l
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1         617
  -------
  queryLength: 1 / subjectLength: 421596

r$> introns[subjectHits(o.l)]
GRanges object with 1 range and 3 metadata columns:
      seqnames          ranges strand |           tx_id                                                    tx_name         gene_id
         <Rle>       <IRanges>  <Rle> |   <IntegerList>                                            <CharacterList> <CharacterList>
  [1]     chr1 3186475-3238095      + | 416,418,419,... ENST00000511072.5,ENST00000378391.6,ENST00000270722.10,...           63976
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

Which is an intron part of the gene PRDM16 which in fact overlap the SNP

r$> library(org.Hs.eg.db)

r$> gid <- unlist(introns[subjectHits(o.l)]$gene_id)

r$> select(org.Hs.eg.db, gid, c("SYMBOL", "GENETYPE"), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
  ENTREZID SYMBOL       GENETYPE
1    63976 PRDM16 protein-coding

Am I missing something here? The convience method locateVariants() would be in fact convenient if it make sense as it reduces the amount of steps to go over all the genomic features...

Thanks

r$> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.20.0                      VariantAnnotation_1.52.0                
 [3] Rsamtools_2.22.0                         Biostrings_2.74.1                       
 [5] XVector_0.46.0                           SummarizedExperiment_1.36.0             
 [7] MatrixGenerics_1.18.1                    matrixStats_1.5.0                       
 [9] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0 GenomicFeatures_1.58.0                  
[11] AnnotationDbi_1.68.0                     Biobase_2.66.0                          
[13] GenomicRanges_1.58.0                     GenomeInfoDb_1.42.3                     
[15] IRanges_2.40.1                           S4Vectors_0.44.0                        
[17] BiocGenerics_0.52.0                     

loaded via a namespace (and not attached):
 [1] SparseArray_1.6.2        bitops_1.0-9             RSQLite_2.3.9            lattice_0.22-6          
 [5] grid_4.4.2               fastmap_1.2.0            blob_1.2.4               jsonlite_1.9.0          
 [9] Matrix_1.7-2             restfulr_0.0.15          DBI_1.2.3                httr_1.4.7              
[13] BSgenome_1.74.0          UCSC.utils_1.2.0         XML_3.99-0.18            codetools_0.2-20        
[17] abind_1.4-8              cli_3.6.4                rlang_1.1.5              crayon_1.5.3            
[21] bit64_4.6.0-1            cachem_1.1.0             DelayedArray_0.32.0      yaml_2.3.10             
[25] S4Arrays_1.6.0           tools_4.4.2              parallel_4.4.2           BiocParallel_1.40.0     
[29] memoise_2.0.1            GenomeInfoDbData_1.2.13  curl_6.2.1               vctrs_0.6.5             
[33] R6_2.6.1                 png_0.1-8                BiocIO_1.16.0            rtracklayer_1.66.0      
[37] zlibbioc_1.52.0          KEGGREST_1.46.0          bit_4.5.0.1              pkgconfig_2.0.3         
[41] GenomicAlignments_1.42.0 rjson_0.2.23             compiler_4.4.2           RCurl_1.98-1.16

VariantAnnotation vari • 211 views

ADD COMMENT • link 8 weeks ago • updated 7 weeks ago Marco • 0

0

Entering edit mode

Ok, I think I found the problem. By looking at the locateVariants() source code, I was able to pinpoint that the problems is when trying to find SNPs location within introns.

The implementation of locateVariants() uses fiveUTRsByTranscript(), threeUTRsByTranscript() and intronsByTranscript() to generate subjects used by mapToTranscripts() to locate the ranges where the variants fall. Turns out that mapToTranscripts() returns the wrong hits when passed a GRangesList from intronsByTranscript().

Without digging into the implementation of mapToTranscripts(), I'm suspecting that the fact that intronsByTranscript() returns a GRangesList where some of the entries have no GRanges is tripping the mapToTranscripts function. Here's the tests:

Function to generate random locations within GRanges of a GRangesList

get_random_location_within_sub_subject <- function(subject, nItems = 25, seed = NA) {
  ## Randomly select GRangesList items
  if (!is.na(seed)) set.seed(seed)
  targets <- sample(seq_along(subject)[elementNROWS(subject) > 0], nItems)

  ## For each selected items find a location within a sub-item

  randomLocations <- do.call(c, lapply(targets, function(x) {

  ## Pick one of the items in the selected entry
  if (!is.na(seed)) set.seed(seed)
    y <- sample(length(subject[x]), 1)

    ## Get the chromsome of the random sub-item
    chr <- seqnames(subject[[x]][y])

    ## Return a single random postion within the selected sub item
    if (!is.na(seed)) set.seed(seed)
    loc <- sample(seq(start(subject[[x]][y]), end(subject[[x]][y])), 1)

    ## Return a GRanges for that random item
    GRanges(chr, IRanges(loc))
  }))
}

Testing mapToTranscript with 5' UTRs

## Get the 5' UTR from Txdb
> subject <- fiveUTRsByTranscript(txdb)

## Get 25 random location within 5' UTR
> randomLocations <- get_random_location_within_sub_subject(subject, seed = 1234)

## Map the location of the random GRanges to the subjects
> map <- mapToTranscripts(randomLocations, subject)

## Are the mapped items within the ranges of the GRangesList
> all(randomLocations[map$xHits] %within% unlist(range(subject[map$transcriptsHits])))
[1] TRUE

Testing mapToTranscript with introns

## Get the introns from Txdb
> subject <- intronsByTranscript(txdb)

## Get random locations
> randomLocations <- get_random_location_within_sub_subject(subject, seed = 1234)

## Map the locations of the random GRanges to the subjects
> map <- mapToTranscripts(randomLocations, subject)

## Are the mapped items within the ranges of the GRangesList
> all(randomLocations[map$xHits] %within% unlist(range(subject[map$transcriptsHits])))
[1] FALSE

Could one of the authors of the package look into this? Should I open a new tickets as a bug report for mapToTranscripts()?

ADD REPLY • link 7 weeks ago Marco • 0