I believe that is a typographic error. There isn't (nor IIRC has there ever been) a genesBy
function in either ensembldb
nor AnnotationDbi
. It wouldn't make much sense to have one anyway, because a gene by definition (at least in Bioconductor) contains all the transcripts and all the exons for that gene. In other words, genesBy(EnsDb, "exon")
should return a GRangesList
with a single gene per exon, which is trivial and boring.
But I suppose there are some instances where a particular exon is shared by a bunch of very closely related genes, and it's trivial to roll yer own.
## First get an EnsDb
> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("ensdb","homo sapiens"))
AnnotationHub with 23 records
# snapshotDate(): 2022-10-31
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH95744 | Ensembl 104 EnsDb for Homo sapiens
AH98047 | Ensembl 105 EnsDb for Homo sapiens
AH100643 | Ensembl 106 EnsDb for Homo sapiens
AH104864 | Ensembl 107 EnsDb for Homo sapiens
AH109336 | Ensembl 108 EnsDb for Homo sapiens
> ensdb <- hub[["AH109336"]]
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
loading from cache
## get exons by gene
> exby <- exonsBy(ensdb, "gene")
## inspect
> exby
GRangesList object of length 70616:
$ENSG00000000003
GRanges object with 20 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
[1] X 100639945-100639991 - | ENSE00001828996
[2] X 100636793-100637104 - | ENSE00001863395
[3] X 100636608-100636806 - | ENSE00001855382
[4] X 100636191-100636689 - | ENSE00001886883
[5] X 100635558-100635746 - | ENSE00003662440
... ... ... ... . ...
[16] X 100632541-100632568 - | ENSE00001849132
[17] X 100632063-100632068 - | ENSE00003731560
[18] X 100630759-100630866 - | ENSE00000868868
[19] X 100627108-100629986 - | ENSE00001459322
[20] X 100627109-100629986 - | ENSE00003730948
-------
seqinfo: 457 sequences (1 circular) from GRCh38 genome
...
<70615 more elements>
## convert to GRanges, add a genes mcol and split
> exby2 <- unlist(exby)
> exby2
GRanges object with 888642 ranges and 1 metadata column:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000000003 X 100639945-100639991 - | ENSE00001828996
ENSG00000000003 X 100636793-100637104 - | ENSE00001863395
ENSG00000000003 X 100636608-100636806 - | ENSE00001855382
ENSG00000000003 X 100636191-100636689 - | ENSE00001886883
ENSG00000000003 X 100635558-100635746 - | ENSE00003662440
... ... ... ... . ...
LRG_999 19 42293592-42293836 + | LRG_999t1e18
LRG_999 19 42293935-42294089 + | LRG_999t1e19
LRG_999 19 42294173-42294304 + | LRG_999t1e20
LRG_999 19 42294604-42294735 + | LRG_999t1e21
LRG_999 19 42294824-42295796 + | LRG_999t1e22
-------
seqinfo: 457 sequences (1 circular) from GRCh38 genome
> exby2$gene_id <- names(exby2)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 8 out-of-bound ranges located on sequence
LRG_432. Note that ranges located on a sequence whose length is unknown
(NA) or on a circular sequence are not considered out-of-bound (use
seqlengths() and isCircular() to get the lengths and circularity flags
of the underlying sequences). You can use trim() to trim these ranges.
See ?`trim,GenomicRanges-method` for more information.
> gnby <- splitAsList(exby2, exby2$exon_id)
> gnby
GRangesList object of length 888642:
$ENSE00000000001
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000244623 7 99876062-99877033 - | ENSE00000000001
gene_id
<character>
ENSG00000244623 ENSG00000244623
-------
seqinfo: 457 sequences (1 circular) from GRCh38 genome
$ENSE00000000002
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000175485 11 6199224-6200186 + | ENSE00000000002
gene_id
<character>
ENSG00000175485 ENSG00000175485
-------
seqinfo: 457 sequences (1 circular) from GRCh38 genome
$ENSE00000000003
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | exon_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000148090 9 91361628-91361918 - | ENSE00000000003
gene_id
<character>
ENSG00000148090 ENSG00000148090
-------
seqinfo: 457 sequences (1 circular) from GRCh38 genome
...
<888639 more elements>
## BORING!
> table(lengths(gnby))
1
888642
The same result isn't true for a TxDb however
Hi Jim,
I think you don't see this with Ensembl annotations because of how Ensembl assigns exon ids. It seems to me that they will always assign different ids to exons that belong to different genes, no matter what, even if those exons are actually the same (i.e. same chromosome/start/end/strand). So instead of splitting
exby2
based on the official Ensembl exon ids (exby2$exon_id
), you might want to split based on an id that is based on the genomic location of the exon only. You can generate such id with something likeexon_altid <- match(exby2, unique(exby2))
.H.
Thanks. I thought it might be a shortcut for
unlist(reduce(exonsBy(ensdb, "gene")))
to get one gene region per gene symbol.