it looks like your GTF is from Ensembl. In this case you might also try to use ensembldb and EnsDb databases instead of TxDb databases. With ensembldb you can then use filters to just extract the data you want. The workflow to generate the database is slightly different:
library(ensembldb)
## Create the database. Note that this creates the
## SQLite database and does not return an EnsDb.
db <- ensDbFromGtf(gtf = "Homo_sapiens.GRCh38.85.gtf")
## Load this database
edb <- EnsDb(db)
## And use it: we'll use a GenebiotypeFilter to fetch
## only exons of protein coding genes
exonsByGene <- exonsBy(edb, filter = GenebiotypeFilter("protein_coding"), by = "gene")
While this returns all exons for all protein coding genes, it might still contain exons of non-coding transcripts of the protein coding genes, e.g. transcripts that are targeted for nonsense mediated mRNA decay.
I think this solution is an understatement. Creating a EnsDB using ensembldb::ensDbFromGtf() is not straight forward and has a lot of contingent programs/software.
While ensembldb::exonsBy(edb, filter = GenebiotypeFilter("protein_coding")) and ensembldb::addFilter(edb, , GeneBiotypeFilter("protein_coding"))) for a EnsDB may be the best way to filter for protein coding transcripts, finding an already created EnsDB using AnnotationHub is the way to go:
I think this solution is an understatement. Creating a EnsDB using
ensembldb::ensDbFromGtf()
is not straight forward and has a lot of contingent programs/software.While
ensembldb::exonsBy(edb, filter = GenebiotypeFilter("protein_coding"))
andensembldb::addFilter(edb, , GeneBiotypeFilter("protein_coding")))
for a EnsDB may be the best way to filter for protein coding transcripts, finding an already createdEnsDB
usingAnnotationHub
is the way to go:Thank you very much for your advice, it works nice! )