I used the AnnotationDbi::mapIds
function over the EnsDb.Mmusculus.v79
package to map Ensembl Gene IDs back to entrez id over a a long vector of ENSG Ids.
I expected this to return a 1:1 mapping when mapIds(..., multiVals='first')
, but was surprised that this returned several entrez ids concatenated with ";" for a given ensembl gene id, for instance:
R> mapIds(EnsDb.Mmusculus.v79, 'ENSMUSG00000079658', 'ENTREZID', 'GENEID') ENSMUSG00000079658 "67923;102642819"
I've long been working under the assumption that the multiVals
parameter is meant to control this, and it should only return a single identifier when the appropriate value for that parameter is passed (like 'first'
, 'last'
, or 'asNA'
, even)
I must say, this finding has shaken the bedrock of all things I thought to be true and I'm having a deja vu moment back to 1999 where I'm asking myself again if I actually might be living inside of The Matrix.
Can I get an assist? Thanks :-)
I'm running on the latest bioc, but just to orient ourselves a bit, here some versions of the relevant packages:
EnsDb.Mmusculus.v79_2.1.0 ensembldb_2.0.1 AnnotationDbi_1.38.0
that's right. Unfortunately I'm currently concatenating entrezgene identifiers for the same gene using a ; in the database. That might change in the future, or, when I find the time to redo the database layout.
I usually try to stick with whatever group's ID I have in hand, rather than trying to cross-match, because these conflicts are inevitable. So if I have Ensembl IDs, I use the EnsDb packages or biomaRt for annotation. If I have Entrez Gene IDs, then I use the TxDb and org packages for annotation.
This particular gene is a perfect example. Only 67923 is on Chr1. The other Gene ID is just a LOC (LOC102642819), and isn't even part of the latest annotation release (it's part of 105), and according to NCBI is on Chr2. Plus NCBI doesn't even agree on the HUGO symbol:
Instead claiming that Tceb1 is an alias.
Thanks for the quick feedback, James and Johannes.
FWIW, in the meantime I'm going with using biomaRt to map these ... turns out Tceb1 is actually called Eloc now anyways ;-)