I have series of Salmon quant.sf files that include Ensembl gene and transcript IDs with version number. For example:
tibble::tribble(
~ensembl_tx, ~ensembl_gene,
"ENST00000456328.2", "ENSG00000223972.5",
"ENST00000450305.2", "ENSG00000223972.5",
"ENST00000488147.1", "ENSG00000227232.5",
"ENST00000619216.1", "ENSG00000278267.1",
"ENST00000473358.1", "ENSG00000243485.5",
"ENST00000469289.1", "ENSG00000243485.5",
"ENST00000607096.1", "ENSG00000284332.1",
"ENST00000417324.1", "ENSG00000237613.2"
)
I would like to map these to Entrez IDs, get gene start/end, tx biotype etc. using and ensembldb
package, for example EnsDb.Hsapiens.v86
.
However Ensembl IDs in EnsDb.Hsapiens.v86
do not include any version numbers, e.g.:
library(EnsDb.Hsapiens.v86)
ens <- EnsDb.Hsapiens.v86
ensid <- keys(ens)
gene_ids <- AnnotationDbi::select(ens, keys=ensid[1:100],columns=c("GENEBIOTYPE", "GENESEQSTART", "GENESEQEND", "ENTREZID", "TXID"))
head(gene_ids)
GENEID GENEBIOTYPE GENESEQSTART GENESEQEND ENTREZID TXID
1 ENSG00000000003 protein_coding 100627109 100639991 7105 ENST00000373020
2 ENSG00000000003 protein_coding 100627109 100639991 7105 ENST00000496771
3 ENSG00000000003 protein_coding 100627109 100639991 7105 ENST00000494424
4 ENSG00000000003 protein_coding 100627109 100639991 7105 ENST00000612152
5 ENSG00000000003 protein_coding 100627109 100639991 7105 ENST00000614008
6 ENSG00000000005 protein_coding 100584802 100599885 64102 ENST00000373031
Of course because of these version numbers, IDs don't match. Consequently, the IDs in my files return nothing when I try to map them using EnsDb.Hsapiens.v86
.
Is removing the version numbers the only solution here? Are there other annotation packages that include these version numbers?
Would I be losing any information by removing them? Or knowing that I'm using v86
for example, would I be able to trace back the version number of a given transcript or gene ID?
Thanks, Jo! This is very useful, I very much appreciate it!