Question

Mapping Ensemble IDs with version numbers using ensembl annotation packages

1

Entering edit mode

Ezgi ▴ 60

@ezgi-24130

Last seen 3.2 years ago

United States

I have series of Salmon quant.sf files that include Ensembl gene and transcript IDs with version number. For example:

tibble::tribble(
          ~ensembl_tx,       ~ensembl_gene,
  "ENST00000456328.2", "ENSG00000223972.5",
  "ENST00000450305.2", "ENSG00000223972.5",
  "ENST00000488147.1", "ENSG00000227232.5",
  "ENST00000619216.1", "ENSG00000278267.1",
  "ENST00000473358.1", "ENSG00000243485.5",
  "ENST00000469289.1", "ENSG00000243485.5",
  "ENST00000607096.1", "ENSG00000284332.1",
  "ENST00000417324.1", "ENSG00000237613.2"
  )

I would like to map these to Entrez IDs, get gene start/end, tx biotype etc. using and ensembldb package, for example EnsDb.Hsapiens.v86. However Ensembl IDs in EnsDb.Hsapiens.v86 do not include any version numbers, e.g.:

library(EnsDb.Hsapiens.v86)
ens <- EnsDb.Hsapiens.v86
ensid <- keys(ens)
gene_ids <- AnnotationDbi::select(ens, keys=ensid[1:100],columns=c("GENEBIOTYPE", "GENESEQSTART", "GENESEQEND", "ENTREZID", "TXID"))
head(gene_ids)

           GENEID    GENEBIOTYPE GENESEQSTART GENESEQEND ENTREZID            TXID
1 ENSG00000000003 protein_coding    100627109  100639991     7105 ENST00000373020
2 ENSG00000000003 protein_coding    100627109  100639991     7105 ENST00000496771
3 ENSG00000000003 protein_coding    100627109  100639991     7105 ENST00000494424
4 ENSG00000000003 protein_coding    100627109  100639991     7105 ENST00000612152
5 ENSG00000000003 protein_coding    100627109  100639991     7105 ENST00000614008
6 ENSG00000000005 protein_coding    100584802  100599885    64102 ENST00000373031

Of course because of these version numbers, IDs don't match. Consequently, the IDs in my files return nothing when I try to map them using EnsDb.Hsapiens.v86. Is removing the version numbers the only solution here? Are there other annotation packages that include these version numbers? Would I be losing any information by removing them? Or knowing that I'm using v86 for example, would I be able to trace back the version number of a given transcript or gene ID?

AnnotationDbi ensembldb • 3.7k views

ADD COMMENT • link 4.2 years ago Ezgi ▴ 60

score 2 · Accepted Answer · 2021-02-08

Hi Ezgi,

first of all it would be key to find out on what Ensembl release the alignment was done. You could then get the correct EnsDb database from AnnotationHub. Assuming your alignment was done on Ensembl release 100:

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v100")
AnnotationHub with 1 record
# snapshotDate(): 2020-10-27
# names(): AH79689
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2020-04-27
# $title: Ensembl 100 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("100", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH79689"]]' 
edb <- ah[["AH79689"]]

with your code you get then:

ensid <- keys(edb)
gene_ids <- AnnotationDbi::select(edb, keys=ensid[1:100],columns=c("GENEBIOTYPE", "GENESEQSTART", "GENESEQEND", "ENTREZID", "TXID", "TXIDVERSION"))
head(gene_ids)
           GENEID    GENEBIOTYPE GENESEQSTART GENESEQEND ENTREZID
1 ENSG00000000003 protein_coding    100627108  100639991     7105
2 ENSG00000000003 protein_coding    100627108  100639991     7105
3 ENSG00000000003 protein_coding    100627108  100639991     7105
4 ENSG00000000003 protein_coding    100627108  100639991     7105
5 ENSG00000000003 protein_coding    100627108  100639991     7105
6 ENSG00000000005 protein_coding    100584936  100599885    64102
             TXID       TXIDVERSION
1 ENST00000373020 ENST00000373020.9
2 ENST00000496771 ENST00000496771.5
3 ENST00000494424 ENST00000494424.1
4 ENST00000612152 ENST00000612152.4
5 ENST00000614008 ENST00000614008.4
6 ENST00000373031 ENST00000373031.5

So, the versioned transcript IDs are also available as column "tx_id_version" (or "TXIDVERSION" if you use the AnnotationDbi framework). There is no gene version in there, but for genes IMHO it should be OK if you drop the version from the ID.

Hope this helps!

cheers, jo