Annotating DEseq output using AnnotationDbi mapIds and most results say NA
1
1
Entering edit mode
Mike ▴ 10
@mike-12142
Last seen 3.2 years ago
Canada

I have a list of ~31,000 mouse transcripts with their Ensembl transcript IDs that I'm trying to annotate using AnnotationDbi and the org.Mm.eg.db database. R v3.3.2, the object containing the IDs is called "temp". My command is:

mapIds(org.Mm.eg.db, keys=row.names(temp), keytype="ENSEMBLTRANS", column="SYMBOL", multiVals="first")

Only ~8500 get annotated with a gene name/symbol while the rest get "NA". If I search some of the NAs on Ensembl they match to transcripts/genes correctly. Some examples:

Transcript ID from "temp" Result from mapIDs Link to Ensembl record
ENSMUST00000000001 Gnai3 Ensembl
ENSMUST00000000028 NA Ensembl
ENSMUST00000000049 NA Ensembl
ENSMUST00000000058 Cav2 Ensembl

You can see that even the 2 NAs have Ensembl transcript records so why are they not getting annotated by AnnotationDbi?

The command also outputs this, which I'm not sure is relevant or something to worry about:

'select()' returned 1:many mapping between keys and columns

 

AnnotationDbi mapIds org.Mm.eg.db ENSEMBLTRANS • 5.0k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

The org.Mm.eg.db package is based on NCBI annotations, whereas you are trying to annotate using Ensembl IDs. Trying to use one annotation source to annotate another is a recipe for heartache, for a number of reasons. You will be better off just using Ensembl databases to do the mapping. You can do that by either using the EnsDb packages that Johannes Rainier provides, or the biomaRt package:

> library(EnsDb.Mmusculus.v79)

> mapIds(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
ENSMUST00000000001 ENSMUST00000000028 ENSMUST00000000049 ENSMUST00000000058
           "Gnai3"            "Cdc45"             "Apoh"             "Cav2"

Or maybe more usefully

> select(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
              TXNAME SYMBOL               TXID
1 ENSMUST00000000001  Gnai3 ENSMUST00000000001
2 ENSMUST00000000028  Cdc45 ENSMUST00000000028
3 ENSMUST00000000049   Apoh ENSMUST00000000049
4 ENSMUST00000000058   Cav2 ENSMUST00000000058

Or using biomaRt

> library(biomaRt)
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1      Gnai3    ENSMUST00000000001
2      Cdc45    ENSMUST00000000028
3       Apoh    ENSMUST00000000049
4       Cav2    ENSMUST00000000058
ADD COMMENT
0
Entering edit mode

Thank you it's working much better now but still missing about 10% of them. Actually it's matching everything up to ENSMUST00000195885 and getting NA for all subsequent transcript IDs, here are the 10 around ENSMUST00000195885:

ENSMUST00000195877 RP24-144H23.5
ENSMUST00000195879 RP23-415F2.1
ENSMUST00000195880 RP24-429G21.6
ENSMUST00000195881 RP23-379F6.3
ENSMUST00000195885 RP24-336M14.2
ENSMUST00000195892 NA
ENSMUST00000195897 NA
ENSMUST00000195905 NA
ENSMUST00000195908 NA
ENSMUST00000195914 NA

Command is:

mapIds(EnsDb.Mmusculus.v79, keys=row.names(temp), column="SYMBOL", keytype="TXNAME", multiVals="first")

Also one of the results is now blank: ENSMUST00000077235

When using org.Mm.eg.db it correctly finds Dhrsx (Ensembl link).

ADD REPLY
0
Entering edit mode

If you want more recent transcripts, you need to use a more recent version of the Ensembl database. The version that Johannes provides is based on Ensembl V79 (hence the v79 in the name), which is rather old. Biomart is based on the current version:

> ids <- paste0("ENSMUST00000", c(195892, 195897, 195905, 195908, 195914))
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1     Gm9484    ENSMUST00000195892
2    Gm44357    ENSMUST00000195897
3      Frrs1    ENSMUST00000195905
4    Gm42630    ENSMUST00000195908
5    Gm43174    ENSMUST00000195914

And if we check an archived version 79

> mart2 <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl", "mar2015.archive.ensembl.org")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart2)
[1] mgi_symbol            ensembl_transcript_id
<0 rows> (or 0-length row.names)
ADD REPLY
0
Entering edit mode

The reason for the missing entry might be that in Ensembl version 79 the transcript/gene was not annotated yet to that symbol. Locally I have EnsDb.Mmusculus.v87 and there it is annotated to DHRSX.

 

ADD REPLY
0
Entering edit mode

You beat me by 3 minutes James ;)

Mike, if you need the new EnsDb just drop me a line.

cheers, jo

ADD REPLY
0
Entering edit mode

Hi Johannes,

Any chance you can make EnsDb.Mmusculus.v87 available through the Bioconductor annotation pages? Or any other way? For my data set I (also) would like to make use of the latest annotation info available. :)

Thanks,

Guido

ADD REPLY
0
Entering edit mode

Actually, with the current development version it would be possible to get EnsDb for all species from Ensembl 87 from AnnotationHub

library(AnnotationHub)
ah <- AnnotationHub()
> query(ah, c("EnsDb", "mus musculus", "87"))
AnnotationHub with 1 record
# snapshotDate(): 2017-02-07
# names(): AH53222
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# $title: Ensembl 87 EnsDb for Mus Musculus
# $description: Gene and protein annotations for Mus Musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "87", "AHEnsDbs")
# retrieve record with 'object[["AH53222"]]'

As said, that's in the developmental BioC (version 3.5), so, not yet officially available.

In the meantime you can download the corresponding SQLite file from https://cloud.scientificnet.org/index.php/s/q4vZQ1pq96Hl6sq - but beware - download will be slow. You can use then the corresponding EnsDb by using the EnsDb function passing the file name of the SQLite file as argument (full path).

ADD REPLY
0
Entering edit mode

Thanks! Will go for the 1st option!

ADD REPLY

Login before adding your answer.

Traffic: 769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6