Question

Annotating DEseq output using AnnotationDbi mapIds and most results say NA

1

Entering edit mode

Mike ▴ 10

@mike-12142

Last seen 3.5 years ago

Canada

I have a list of ~31,000 mouse transcripts with their Ensembl transcript IDs that I'm trying to annotate using AnnotationDbi and the org.Mm.eg.db database. R v3.3.2, the object containing the IDs is called "temp". My command is:

mapIds(org.Mm.eg.db, keys=row.names(temp), keytype="ENSEMBLTRANS", column="SYMBOL", multiVals="first")

Only ~8500 get annotated with a gene name/symbol while the rest get "NA". If I search some of the NAs on Ensembl they match to transcripts/genes correctly. Some examples:

Transcript ID from "temp"	Result from mapIDs	Link to Ensembl record
ENSMUST00000000001	Gnai3	Ensembl
ENSMUST00000000028	NA	Ensembl
ENSMUST00000000049	NA	Ensembl
ENSMUST00000000058	Cav2	Ensembl

You can see that even the 2 NAs have Ensembl transcript records so why are they not getting annotated by AnnotationDbi?

The command also outputs this, which I'm not sure is relevant or something to worry about:

'select()' returned 1:many mapping between keys and columns

AnnotationDbi mapIds org.Mm.eg.db ENSEMBLTRANS • 5.4k views

ADD COMMENT • link updated 8.2 years ago by James W. MacDonald 68k • written 8.2 years ago by Mike ▴ 10

score 3 · Answer 1 · 2017-01-10

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The org.Mm.eg.db package is based on NCBI annotations, whereas you are trying to annotate using Ensembl IDs. Trying to use one annotation source to annotate another is a recipe for heartache, for a number of reasons. You will be better off just using Ensembl databases to do the mapping. You can do that by either using the EnsDb packages that Johannes Rainier provides, or the biomaRt package:

> library(EnsDb.Mmusculus.v79)

> mapIds(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
ENSMUST00000000001 ENSMUST00000000028 ENSMUST00000000049 ENSMUST00000000058
           "Gnai3"            "Cdc45"             "Apoh"             "Cav2"

Or maybe more usefully

> select(EnsDb.Mmusculus.v79, ids, "SYMBOL","TXNAME")
              TXNAME SYMBOL               TXID
1 ENSMUST00000000001  Gnai3 ENSMUST00000000001
2 ENSMUST00000000028  Cdc45 ENSMUST00000000028
3 ENSMUST00000000049   Apoh ENSMUST00000000049
4 ENSMUST00000000058   Cav2 ENSMUST00000000058

Or using biomaRt

> library(biomaRt)
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1      Gnai3    ENSMUST00000000001
2      Cdc45    ENSMUST00000000028
3       Apoh    ENSMUST00000000049
4       Cav2    ENSMUST00000000058

ADD COMMENT • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you it's working much better now but still missing about 10% of them. Actually it's matching everything up to ENSMUST00000195885 and getting NA for all subsequent transcript IDs, here are the 10 around ENSMUST00000195885:

ENSMUST00000195877	RP24-144H23.5
ENSMUST00000195879	RP23-415F2.1
ENSMUST00000195880	RP24-429G21.6
ENSMUST00000195881	RP23-379F6.3
ENSMUST00000195885	RP24-336M14.2
ENSMUST00000195892	NA
ENSMUST00000195897	NA
ENSMUST00000195905	NA
ENSMUST00000195908	NA
ENSMUST00000195914	NA

Command is:

mapIds(EnsDb.Mmusculus.v79, keys=row.names(temp), column="SYMBOL", keytype="TXNAME", multiVals="first")

Also one of the results is now blank: ENSMUST00000077235

When using org.Mm.eg.db it correctly finds Dhrsx (Ensembl link).

ADD REPLY • link 8.2 years ago Mike ▴ 10

0

Entering edit mode

If you want more recent transcripts, you need to use a more recent version of the Ensembl database. The version that Johannes provides is based on Ensembl V79 (hence the v79 in the name), which is rather old. Biomart is based on the current version:

> ids <- paste0("ENSMUST00000", c(195892, 195897, 195905, 195908, 195914))
> mart <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart)
  mgi_symbol ensembl_transcript_id
1     Gm9484    ENSMUST00000195892
2    Gm44357    ENSMUST00000195897
3      Frrs1    ENSMUST00000195905
4    Gm42630    ENSMUST00000195908
5    Gm43174    ENSMUST00000195914

And if we check an archived version 79

> mart2 <- useMart("ENSEMBL_MART_ENSEMBL", "mmusculus_gene_ensembl", "mar2015.archive.ensembl.org")
> getBM(c("mgi_symbol","ensembl_transcript_id"), "ensembl_transcript_id", ids, mart2)
[1] mgi_symbol            ensembl_transcript_id
<0 rows> (or 0-length row.names)

ADD REPLY • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

The reason for the missing entry might be that in Ensembl version 79 the transcript/gene was not annotated yet to that symbol. Locally I have EnsDb.Mmusculus.v87 and there it is annotated to DHRSX.

ADD REPLY • link 8.2 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

You beat me by 3 minutes James ;)

Mike, if you need the new EnsDb just drop me a line.

cheers, jo

ADD REPLY • link 8.2 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Hi Johannes,

Any chance you can make EnsDb.Mmusculus.v87 available through the Bioconductor annotation pages? Or any other way? For my data set I (also) would like to make use of the latest annotation info available. :)

Thanks,

Guido

ADD REPLY • link 8.1 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Actually, with the current development version it would be possible to get EnsDb for all species from Ensembl 87 from AnnotationHub

library(AnnotationHub)
ah <- AnnotationHub()
> query(ah, c("EnsDb", "mus musculus", "87"))
AnnotationHub with 1 record
# snapshotDate(): 2017-02-07
# names(): AH53222
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# $title: Ensembl 87 EnsDb for Mus Musculus
# $description: Gene and protein annotations for Mus Musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "87", "AHEnsDbs")
# retrieve record with 'object[["AH53222"]]'

As said, that's in the developmental BioC (version 3.5), so, not yet officially available.

In the meantime you can download the corresponding SQLite file from https://cloud.scientificnet.org/index.php/s/q4vZQ1pq96Hl6sq - but beware - download will be slow. You can use then the corresponding EnsDb by using the EnsDb function passing the file name of the SQLite file as argument (full path).