GenomicFeatures: Select exons by transcript. Any faster option than exonsBy?
3
0
Entering edit mode
stianlagstad ▴ 90
@stianlagstad-9723
Last seen 4.8 years ago

Consider this code:

library(microbenchmark)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

microbenchmark::microbenchmark(
  # Option 1:
  GenomicFeatures::exons(
    txdb,
    vals = list(tx_name = c("uc001aaa.3", "uc010nxq.1")),
    columns = list("EXONNAME", "TXNAME", "GENEID")),
  # Option 2:
  GenomicFeatures::exonsBy(txdb, by = "tx", use.names = TRUE)[c("uc001aaa.3", "uc010nxq.1")],
  times = 10
)

# Option 1 takes an average of 1.6 seconds
# Option 2 takes an average of 6.5 seconds

These two options gets me the same data, but option1 is a lot faster. Is there an easy way I can use option1 and still get the structure that is option2? Or is there a way to make exonsBy faster by only extracting the transcripts I'm interested in?

My end goal is to get a GRangesList object with one GRanges object per transcript, OR a single GRanges object with duplicate entries for exons that appear in multiple transcripts (if that's possible). To begin with I only have the transcript names and the TxDb object.

GenomicFeatures • 4.1k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

The exonsBy,TxDb() wrapper is pretty inflexible. You could grab the exons and then run multisplit(exons, exons$tx_id). Or something like that.

ADD COMMENT
0
Entering edit mode

Thanks for answering. I started working on an example to help explain what I want to accomplish:

library(Gviz)
library(GenomicFeatures)

options(ucscChromosomeNames=FALSE)
txdbhg19 <- GenomicFeatures::makeTxDbFromBiomart(
  biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = "dec2013.archive.ensembl.org")

transcriptNames <- c("ENST00000370032", "ENST00000420055")

# ----- APPROACH 1

# Get GrangesList
tst <- exonsBy(txdbhg19, by = "tx", use.names=TRUE)[transcriptNames]
# Unlist to get GRanges
tst <- unlist(tst)
# Add transcript metadata column to make Gviz group correctly
elementMetadata(tst)$transcript <- names(tst)
# Create track and plot
trTrack <- Gviz::GeneRegionTrack(tst)
Gviz::plotTracks(trTrack)
# This is the plot I want to produce

# ----- APPROACH 2

tst2 <- GenomicFeatures::exons(
  txdbhg19,
  vals = list(tx_name = transcriptNames),
  columns = list("EXONNAME", "TXNAME", "GENEID"))

# Problem 1: The TXNAME metadata field now contains transcripts that are not in my original transcript list:
unique(unlist(elementMetadata(tst2)$TXNAME[!elementMetadata(tst2)$TXNAME %in% transcriptNames]))
# [1] "ENST00000370031" "ENST00000402983"
# I can hack this out, but it feels unclean to do this:
elementMetadata(tst2)$TXNAME <- elementMetadata(tst2)$TXNAME[elementMetadata(tst2)$TXNAME %in% transcriptNames]
# I would much rather have it so that I only get the transcript names that I originally wanted. Anyhow, let's continue..

# Each entry in elementMetadata(tst2)$TXNAME now has 1 or more transcript names in it. This is not what I want. I want to somehow "expand" this GRanges object, so that each row only have one TXNAME value. Here's a very naive way to do it:

gr <- GRanges()
for (i in 1:length(tst2)) {
  trNames <- elementMetadata(tst2[i])$TXNAME[[1]]
  trNames <- strsplit(trNames, " ")
  for (j in 1:length(trNames)) {
    newGr <- tst2[i]
    elementMetadata(newGr)$TXNAME <- trNames[[j]]
    gr <- append(gr, newGr)
    print(paste(elementMetadata(tst2[i])$GENEID, trNames[[j]]))
  }
}
elementMetadata(gr)$transcript <- elementMetadata(gr)$TXNAME

trTrack <- Gviz::GeneRegionTrack(gr)
Gviz::plotTracks(gr)
# Error in (function (classes, fdef, mtable)  :
# unable to find an inherited method for function ‘displayPars<-’ for signature ‘"GRanges", "list"’

# Why do I get this error? I was hoping that this gr structure would be similar to the tst structure in approach 1

The goal again is to get a structure similar to the one I get with exonsBy() -> unlist(). Any ideas?

ADD REPLY
1
Entering edit mode

I agree that the TxDb accessors have major usability issues. Probably my least favorite thing in Bioconductor. I agree that a restriction by TXNAME should not return other values in the TXNAME column.

To expand, just use the expand() function.

With regard to (the inefficient) Approach 1, I am surprised that Gviz cannot plot a GRangesList directly.

ADD REPLY
0
Entering edit mode

Thank you! I was not aware of the expand() function.

I still run into problems with Gviz. It's not GRangesList I'm trying to plot, it's GRanges. Do you see what I'm doing wrong?

# (Previous code same as before)
# Each entry in elementMetadata(tst2)$TXNAME now has 1 or more transcript names in it. This is not what I want. I want to somehow "expand" this GRanges object, so that each row only have one TXNAME value. Use expand() to do this
tst2 <- expand(tst2)

class(tst2)
# [1] "GRanges"
# attr(,"package")
# [1] "GenomicRanges"

trTrack <- Gviz::GeneRegionTrack(tst2)
Gviz::plotTracks(tst2)
# Error in (function (classes, fdef, mtable)  :
# unable to find an inherited method for function ‘displayPars<-’ for signature ‘"GRanges", "list"’
ADD REPLY
1
Entering edit mode

It looks like maybe you need to pass trTrack to Gviz::plotTracks() instead of tst2.

ADD REPLY
0
Entering edit mode

Truly embarrassing. Thank you!

ADD REPLY
1
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

I'm also jumping into the discussion; If you don't need to stick to UCSC annotations you could also switch over to Ensembl based annotations and use the ensembldb package; that one provides about the same functionality than the GenomicFeatures package but in addition provides a filter framework that enables to fetch only specific entries from the database. Like in your example, if you want to fetch only annotations for specific transcripts, you can specify a TxidFilter (the same way you could specify SeqnameFilter, GenenameFilter etc, and also combine different filters; just check the package's vignette for more info).

## Load the annotation package providing Ensembl based annotations for 
## Homo sapiens, Ensembl version 75.
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

transcriptNames <- c("ENST00000370032", "ENST00000420055")

## Fetch the data only for the specified transcripts.
exonsBy(edb, filter=TxidFilter(transcriptNames))

## Or to have more info you could add e.g.
exonsBy(edb, filter=TxidFilter(transcriptNames), columns=c("gene_id", "gene_name", "tx_id", "gene_biotype"))

Hope that helps.

cheers, jo

ADD COMMENT
0
Entering edit mode

This is good stuff. Why not just generalize the EnsDb stuff to include data from arbitrary sources. Then, that could just replace the current TxDb implementation. Or incorporate these ideas into GenomicFeatures.

ADD REPLY
0
Entering edit mode

good point Michael; including that into GenomicFeatures might not be that easy, as it would require to hack directly into the SQL queries.

About EnsDb from arbitrary sources, that should be in theory possible, I would have to investigate a little bit how. Since EnsDbs can already be generated from GTF or GFF files that should not really be so problematic, but I'll have to check that.

 

 

ADD REPLY
0
Entering edit mode

This looks promising. I've been using TxDb&GenomicFeatures with Ensembl data anyway, only using UCSC for examples because of TxDb.Hsapiens.UCSC.hg19.knownGene. Thank you!

ADD REPLY
0
Entering edit mode

Question: I see your Bioconductor packages EnsDb.Hsapiens.v75 and EnsDb.Hsapiens.v79. I'm guessing that this guide will help me build for version 74 (as an example). I'm building a package that right now lets users specify which TxDb object to use, and it's easy enough for most people to use GenomicFeatures::makeTxDbFromBiomart() to create one for the version of ensembl they need (specifying which host to use). If I want to use ensembldb this looks like a more difficult process. Is that right, or am I missing something? I want to make this easy for my users, and I think a manual building process like what's described in the guide is too much to ask of them.

ADD REPLY
0
Entering edit mode

Correct, best is to read the vignette ;) building the annotation packages with the Perl API is nice, but not something for the "standard" user. For that I have the ensembldb::ensDbFromGtf, ensembldb::ensDbFromGff and, probably to most convenient one, ensembldb::ensDbFromAH that allow to build the annotation database from an Gtf, Gff, or from an Gtf file provided by the AnnotationHub package. It's all described in section 10.2 of the vignette.

ADD REPLY
1
Entering edit mode

Thank you. I tried something:

library(ensembldb)
DB <- ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf")
# Error in ensDbFromGtf(gtf = "/home/stian/Desktop/Homo_sapiens.GRCh37.74.gtf") : 
#   One or more required types are not in the gtf file. Need gene,transcript,exon,CDS but got only exon,CDS.

The gtf I got from here. Is this because that old version of the ensembl data had an incompatible structure?

ADD REPLY
0
Entering edit mode

Yes, I realized that too. Am currently working on a fix for that. Stay tuned ;)

ADD REPLY
0
Entering edit mode

OK, that's fixed now in version 1.3.19. Now you can also use GTF files for Ensembl < 75. This works now both for the ensDbFromGtf and for the ensDbFromAH (to fetch stuff directly from AnnotationHub, so you don't have to download stuff manually).

ADD REPLY
0
Entering edit mode

Cool! For the moment I'm unable to use the devel-version of Bioconductor, but I'll try it as soon as I can.

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

The TxDb are SQLite data bases, so using dplyr

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(dplyr)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
src = src_sql("sqlite", dbconn(txdb), info=DBI::dbGetInfo(dbconn(txdb)))

with tables for each of the entities and a 'splicing' describing how they join

> src
src:  sqlite 3.8.6 []
tbls: cds, chrominfo, exon, gene, metadata, splicing, transcript

So you could filter the transcripts and keep relevant information

txname = c("uc001aaa.3", "uc010nxq.1")
tbl(src, "transcript") %>% filter(tx_name %in% txname) %>%
            dplyr::select(`_tx_id`, tx_name)

and likewise for the exon table. Then do an inner_join between the transcript and splicing table, and between that and the exon table. The result might be

data <- inner_join(
    inner_join(
        tbl(src, "transcript") %>% filter(tx_name %in% txname) %>%
            dplyr::select(`_tx_id`, tx_name),
        tbl(src, "splicing") %>% dplyr::select(`_tx_id`, `exon_rank`, `_exon_id`)),
    tbl(src, "exon") %>% dplyr::select(-exon_name)) %>%
    dplyr::select(-`_tx_id`) %>% collect

and these could be made first into a GRanges and then split into a GRangesList

data %>% dplyr::select(-tx_name) %>%
    makeGRangesFromDataFrame(keep.extra.columns=TRUE) %>%
    splitAsList(data %>% .[[1]])
ADD COMMENT
0
Entering edit mode

Please tell me you're not serious.

ADD REPLY
0
Entering edit mode

Definitely thinking of the audience when writing the answer... ;)

ADD REPLY
0
Entering edit mode

Thank you! This might be a better solution long-term, since it gives me more control. I haven't tried it, but I'm wondering about this part:

tbl(src, "transcript") %>% filter(tx_name %in% txname) %>%
            dplyr::select(`_tx_id`, tx_name)

Does it first pull out everything (tbl(scr, "transcript")) and then pass it to the filter function? Or does it actually only pull out the information I'm after?

ADD REPLY
1
Entering edit mode

dplyr forms the SQL query without (full) evaluation. The SQL is evaluated with collect() (I don't think that collect is actually necessary in the above, replacing in splitAsList data %>% dplyr::select(tx_name) %>% .[[1]]) and it's up to the SQL engine to make sense of whatever you've asked it to do. Whatever it decides, the computation is in SQL rather than R, and it ends up doing something relatively efficient.

ADD REPLY
0
Entering edit mode

Thank you! dplyr is pushed up on my to-learn-list :)

ADD REPLY
1
Entering edit mode

It would be a terrible long-term solution unless your long term goals are code obfuscation and loss of encapsulation. The correct long-term solution is to fix GenomicFeatures. The advertisement and exposure of the SQL interface to annotation packages has never made sense to me. Exposure of implementation details should only be necessasry in the most extraordinary of use cases. TxDb, ensDb, biomaRt, genbankr, etc, should all implement a unified abstraction.

ADD REPLY
0
Entering edit mode

I agree. I didn't think it through earlier.

ADD REPLY

Login before adding your answer.

Traffic: 633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6