Hi I have a dataframe of reads-counts (CNTS) with 47,540 unique ENSEMBL IDs. Now I want to use gage to test for differences in gene expression over gene-sets (e.g KEGG pathways) For example:
SIG.keeg.p <- gage(CNTS, gsets=kegg.sig, ref= ref.idx, samp = samp.idx, compare = "as.group")
To use gage I have to rowname my dataframe with ENTREZ IDs. For that purpose I used AnnotationDbi with multiVals= "asNA".
Entrez = select(org.Mm.eg.db, keys=row.names(cnts.norm), column="ENTREZID", keytype="ENSEMBL", multiVals="asNA")
According to ?select
"asNA": This will return an NA value whenever there are multiple matches"
Given that, I was expecting that each time my keys find multivalues I will find a NA in the ENTREZ column of the Entrez dataframe. In other words I was expecting that by removing all the rows with NA values I will have a dataframe with unique-unique pairs of ENSEMBL-ENTREZ
However this is not what I got. There are more than 400 ENSEMBL Id mapping to more than one ENTREX Id. Se table below.
ENSEMBL |
ENTREZID |
ENSMUSG00000000486 |
54204 |
ENSMUSG00000000486 |
100043580 |
ENSMUSG00000000562 |
11542 |
ENSMUSG00000000562 |
69296 |
ENSMUSG00000002250 |
19015 |
ENSMUSG00000002250 |
69050 |
ENSMUSG00000002345 |
72368 |
ENSMUSG00000002345 |
105980076 |
ENSMUSG00000002379 |
69875 |
ENSMUSG00000002379 |
239760 |
ENSMUSG00000003680 |
67706 |
ENSMUSG00000003680 |
225895 |
ENSMUSG00000003812 |
13423 |
ENSMUSG00000003812 |
100503676 |
ENSMUSG00000004455 |
19047 |
ENSMUSG00000004455 |
434233 |
ENSMUSG00000006050 |
24068 |
ENSMUSG00000006050 |
225372 |
ENSMUSG00000008450 |
68051 |
ENSMUSG00000008450 |
621832 |
ENSMUSG00000008682 |
110954 |
ENSMUSG00000008682 |
434434 |
ENSMUSG00000010097 |
53319 |
ENSMUSG00000010097 |
66836 |
ENSMUSG00000015290 |
27643 |
ENSMUSG00000015290 |
100169864 |
ENSMUSG00000015882 |
209707 |
ENSMUSG00000015882 |
100041576 |
ENSMUSG00000016559 |
15081 |
ENSMUSG00000016559 |
625328 |
ENSMUSG00000016559 |
667250 |
ENSMUSG00000018378 |
70393 |
ENSMUSG00000018378 |
103841 |
ENSMUSG00000019857 |
66403 |
The same is true for ENTREZids. There are also more than 200 ENTREZId mapping to more that one ENSEMBL Id.
1-I have a couple of questions. Why multiVals=”asNA” did not prevented this ambiguity in the results?
2-Is there any way to prevent this behavior of AnnotationDbi?
3-To produce a dataframe with unique Entrez Ids as rownames I will have to choose one, e.g. between
ENSEMBL ENTREZ
ENSMUSG00000060208 |
13216 |
ENSMUSG00000074440 |
13216 |
Which one I choose? And base in what? Each one of these ENSEMBL Ids have their own set of count values in the original CNTS dataframe. Meaning that the foldchange for the ENTREZ 13216 in the gage analysis will depend of which ENSMUSG assign to the ENTREZ:13216.
How are you expert people dealing with this? Or may be I am missing an important piece of information. In any case I will really appreciate your help
ALe
Please don't use the 'Add your answer' box to ask another question. If you have additional questions, use the ADD COMMENT link instead.
As to why you get different mappings, that is beyond the scope of this support site. We simply re-package data that is publicly available from NCBI and EMBL-EBI. You should note however that Entrez Gene IDs are something that NCBI uses, and that EMBL-EBI have different IDs. So any mappings between SYMBOL and ENTREZID using the ensembldb package will necessarily be SYMBOL->ENSEMBLID->ENTREZID, and any mappings between Ensembl IDs and Entrez Gene IDs will tend to be fraught.
My general rule is to stay with whomever brung ya to the dance. So either stick with NCBI IDs or EMBL-EBI IDs.