How to produce a dataframe with unique ENTREZID to use in gage?
2
0
Entering edit mode
colaneri ▴ 30
@colaneri-7770
Last seen 5.8 years ago
United States

Hi I have a dataframe of reads-counts (CNTS) with 47,540 unique ENSEMBL IDs. Now I want to use gage to test for differences in gene expression over gene-sets (e.g KEGG pathways) For example:

SIG.keeg.p <- gage(CNTS, gsets=kegg.sig, ref= ref.idx, samp = samp.idx, compare = "as.group")

To use gage I have to rowname my dataframe with ENTREZ IDs. For that purpose I used AnnotationDbi with multiVals= "asNA".

Entrez = select(org.Mm.eg.db, keys=row.names(cnts.norm), column="ENTREZID", keytype="ENSEMBL", multiVals="asNA")

According to ?select

"asNA": This will return an NA value whenever there are multiple matches"

Given that, I was expecting that each time my keys find multivalues I will find a NA in the ENTREZ column of the Entrez dataframe. In other words I was expecting that by removing all the rows with NA values I will have a dataframe with unique-unique pairs of ENSEMBL-ENTREZ

However this is not what I got. There are more than 400 ENSEMBL Id mapping to more than one ENTREX Id. Se table below.

ENSEMBL

ENTREZID

ENSMUSG00000000486

54204

ENSMUSG00000000486

100043580

ENSMUSG00000000562

11542

ENSMUSG00000000562

69296

ENSMUSG00000002250

19015

ENSMUSG00000002250

69050

ENSMUSG00000002345

72368

ENSMUSG00000002345

105980076

ENSMUSG00000002379

69875

ENSMUSG00000002379

239760

ENSMUSG00000003680

67706

ENSMUSG00000003680

225895

ENSMUSG00000003812

13423

ENSMUSG00000003812

100503676

ENSMUSG00000004455

19047

ENSMUSG00000004455

434233

ENSMUSG00000006050

24068

ENSMUSG00000006050

225372

ENSMUSG00000008450

68051

ENSMUSG00000008450

621832

ENSMUSG00000008682

110954

ENSMUSG00000008682

434434

ENSMUSG00000010097

53319

ENSMUSG00000010097

66836

ENSMUSG00000015290

27643

ENSMUSG00000015290

100169864

ENSMUSG00000015882

209707

ENSMUSG00000015882

100041576

ENSMUSG00000016559

15081

ENSMUSG00000016559

625328

ENSMUSG00000016559

667250

ENSMUSG00000018378

70393

ENSMUSG00000018378

103841

ENSMUSG00000019857

66403

The same is true for ENTREZids.  There are also more than 200 ENTREZId mapping to more that one ENSEMBL Id.

 

1-I have a couple of questions. Why multiVals=”asNA” did not prevented this ambiguity in the results?

2-Is there any way to prevent this behavior of AnnotationDbi?

3-To produce a dataframe with unique Entrez Ids as rownames I will have to choose one, e.g. between

ENSEMBL                                                                                ENTREZ

ENSMUSG00000060208

13216

ENSMUSG00000074440

13216

 

Which one I choose? And base in what? Each one of these ENSEMBL Ids have their own set of count values in the original CNTS dataframe. Meaning that the foldchange for the ENTREZ 13216 in the gage analysis will depend of which ENSMUSG assign to the ENTREZ:13216.

How are you expert people dealing with this? Or may be I am missing an important piece of information. In any case I will really appreciate your help

ALe

gage annotationdbi entrez ensembl • 2.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 22 minutes ago
United States

You misunderstand the help page for select. There are two parts. First the Usage section:

Usage:

       columns(x)
       keytypes(x)
       keys(x, keytype, ...)
       select(x, keys, columns, keytype, ...)
       mapIds(x, keys, column, keytype, ..., multiVals)
       saveDb(x, file)
       loadDb(file, packageName=NA)

Note that the only function that has a multiVals argument is mapIds. Since select has an ellipsis (...) argument, you can pass ANY argument to that function and it will try to match to arguments for any functions that it calls. So you won't get an error by passing in random arguments, but if select doesn't call any functions that have a multiVals argument, it will just be ignored (which is what happens). Howeva:

> mapIds(org.Mm.eg.db, "ENSMUSG00000000486", "ENTREZID","ENSEMBL", multiVals="asNA")
'select()' returned 1:many mapping between keys and columns
ENSMUSG00000000486
                NA

 

ADD COMMENT
0
Entering edit mode
colaneri ▴ 30
@colaneri-7770
Last seen 5.8 years ago
United States

Right & thank for the clarification!

I am still trying to produce a dataframe with uniquely mapped pairs of IDs (e.g  ENSEMBL -> ENTREZID or SYMBOL -> ENTREZID)

Can you tell me why such a different results using two different databases? and which one you will use to go to gage-pathview?

 

edb = EnsDb.Mmusculus.v79
entrezIds_Org = as.data.frame(mapIds(org.Mm.eg.db,keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))
entrezIds_edb = as.data.frame(mapIds(edb, keys = rownames(cnts.norm), keytype = "SYMBOL",column = "ENTREZID", multiVals = "filter"))

RESULTS

> length(entrezIds_edb[!is.na(entrezIds_edb)])
[1] 14435
> length(unique(entrezIds_edb[!is.na(entrezIds_edb)]))
[1] 14428
> length(entrezIds_edb[is.na(entrezIds_edb)])
[1] 9764
> 
> length(entrezIds_Org[!is.na(entrezIds_Org)])
[1] 21565
> length(unique(entrezIds_Org[!is.na(entrezIds_Org)]))
[1] 21565
> length(entrezIds_Org[is.na(entrezIds_Org)])
[1] 3758

As you can notice I have retrieved much more ENTREZID by using the org.Mm.eg.db, and also the multiVals="filter" since to have done their work ( 21565 total ENTREZID restrieved with 24199 SYMBOL KEYS) and 21565 were UNIQUE ENTREZIDs

However working with the Ensembl database "EnsDb.Mmusculus.v7"  I got only 14435 ENTREZID, and some of them not unique (meaning that the multiVal ="filter" did not work. 

 

ADD COMMENT
0
Entering edit mode

Please don't use the 'Add your answer' box to ask another question. If you have additional questions, use the ADD COMMENT link instead.

As to why you get different mappings, that is beyond the scope of this support site. We simply re-package data that is publicly available from NCBI and EMBL-EBI. You should note however that Entrez Gene IDs are something that NCBI uses, and that EMBL-EBI have different IDs. So any mappings between SYMBOL and ENTREZID using the ensembldb package will necessarily be SYMBOL->ENSEMBLID->ENTREZID, and any mappings between Ensembl IDs and Entrez Gene IDs will tend to be fraught.

My general rule is to stay with whomever brung ya to the dance. So either stick with NCBI IDs or EMBL-EBI IDs.

ADD REPLY

Login before adding your answer.

Traffic: 742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6