Help me understand org.Hs.eg.db
2
0
Entering edit mode
Daren Tan ▴ 120
@daren-tan-3309
Last seen 10.2 years ago
I am using two approaches to get EntrezID to genes mapping, as well as genes to EntrezID mappings. toTable gives same number of mappings in both directions, but mget doesn't. Which approach should I trust and why ? > dim(toTable(org.Hs.egSYMBOL2EG)) [1] 39824 2 > dim(toTable(org.Hs.egSYMBOL)) [1] 39824 2 > length(mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG)) [1] 39800 > length(mget(mappedLkeys(org.Hs.egSYMBOL), org.Hs.egSYMBOL)) [1] 39824 > sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] splines tools stats graphics grDevices utils datasets methods base other attached packages: [1] KEGG.db_2.2.5 GOstats_2.8.0 Category_2.8.4 genefilter_1.22.0 survival_2.34-1 RBGL_1.18.0 annotate_1.20.1 [8] xtable_1.5-4 GO.db_2.2.5 graph_1.20.0 org.Hs.eg.db_2.2.6 RSQLite_0.7-1 DBI_0.2-4 AnnotationDbi_1.4.3 [15] Biobase_2.2.2 loaded via a namespace (and not attached): [1] cluster_1.11.12 gdata_2.4.2 gplots_2.6.0 GSEABase_1.4.0 gtools_2.5.0-1 xlsReadWritePro_1.4.0 [7] XML_2.1-0
GO GO • 2.3k views
ADD COMMENT
0
Entering edit mode
@christof-winter-3264
Last seen 10.2 years ago
Daren Tan wrote, On 04.04.2009 06:06: > I am using two approaches to get EntrezID to genes mapping, as well as > genes to EntrezID mappings. toTable gives same number of mappings in > both directions, but mget doesn't. Which approach should I trust and > why ? > >> dim(toTable(org.Hs.egSYMBOL2EG)) > [1] 39824 2 >> dim(toTable(org.Hs.egSYMBOL)) > [1] 39824 2 > >> length(mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG)) > [1] 39800 >> length(mget(mappedLkeys(org.Hs.egSYMBOL), org.Hs.egSYMBOL)) > [1] 39824 Dear Daren: It seems that for some Entrez Gene symbols, there is more than one Entrez Gene ID mapped to it: > x = mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG) > sum(listLen(x) > 1) [1] 24 If you really care about the correct number, you could look up those Entrez Gene IDs at NCBI and decide in each case how to count it: > x[listLen(x) > 1] HTH, Christof -- Christof Winter Bioinformatics Group Biotechnologisches Zentrum Technische Universit?t Dresden Tatzberg 47-51 01307 Dresden Germany
ADD COMMENT
0
Entering edit mode
Hi guys, toTable() is designed to give a different result from the mappedRkeys() and mappedLkeys(). toTable() is meant to just put the whole mapping in a table form, while a "mapped(L|R) keys" function only gives the uniquely mapped (left or right) keys. As Cristof pointed out, in the case of gene symbols this is going to sometimes look bad because gene symbols are really HORRIBLE as identifiers. Gene symbols are not unique, and are often "correctly" mapped onto several very different genes as a result. So for example, should CHD5 belong to "chromodomain helicase DNA binding protein 5" or to "Coronary heart disease, susceptibility to, 5" The scientific community still has not resolved all of these "conflicts". And so we are stuck with this problem. So for best results, use a real identifier such as an entrez gene ID when tracking genes. Marc Christof Winter wrote: > Daren Tan wrote, On 04.04.2009 06:06: >> I am using two approaches to get EntrezID to genes mapping, as well as >> genes to EntrezID mappings. toTable gives same number of mappings in >> both directions, but mget doesn't. Which approach should I trust and >> why ? >> >>> dim(toTable(org.Hs.egSYMBOL2EG)) >> [1] 39824 2 >>> dim(toTable(org.Hs.egSYMBOL)) >> [1] 39824 2 >> >>> length(mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG)) >> [1] 39800 >>> length(mget(mappedLkeys(org.Hs.egSYMBOL), org.Hs.egSYMBOL)) >> [1] 39824 > > Dear Daren: > > It seems that for some Entrez Gene symbols, there is more than one > Entrez Gene ID mapped to it: > > > x = mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG) > > sum(listLen(x) > 1) > [1] 24 > > If you really care about the correct number, you could look up those > Entrez Gene IDs at NCBI and decide in each case how to count it: > > > x[listLen(x) > 1] > > HTH, > Christof >
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Daren, First note that for any Bimap object 'x': length(mget(mappedRkeys(x), x)) is the same as: count.mappedRkeys(x) but the latter is much more efficient. Furthermore, if 'x' is a right-to-left map like in your case (see 'summary(x)'), then then 'count.mappedRkeys(x)' is equivalent to 'count.mappedkeys(x)' But generally speaking, there is no reason to expect: nrow(toTable(x)) == count.mappedkeys(x) # generally not true unless the mapping contained in 'x' is one-to-one. Explanation: 'toTable(x)' returns a flat representation of Bimap object 'x' e.g. Lkey Rkey 1 a A 2 a B 3 b A 4 d C All the edges (or links) of the bipartite graph are listed. Note that right key "A" is mapped to left keys "a" and "b", so this mapping is not one-to-one. The left (or right) keys that don't map to anything don't appear in this table. 'count.mappedRkeys(x)' counts the number of (unique) right keys that map at least one left key i.e. 3 in the small example above. So in fact, the following is true for any Bimap object 'x': length(unique(toTable(x)[[2]])) == count.mappedkeys(x) # always TRUE Hope this helps. Cheers, H. Daren Tan wrote: > I am using two approaches to get EntrezID to genes mapping, as well as > genes to EntrezID mappings. toTable gives same number of mappings in > both directions, but mget doesn't. Which approach should I trust and > why ? > >> dim(toTable(org.Hs.egSYMBOL2EG)) > [1] 39824 2 >> dim(toTable(org.Hs.egSYMBOL)) > [1] 39824 2 > >> length(mget(mappedRkeys(org.Hs.egSYMBOL2EG), org.Hs.egSYMBOL2EG)) > [1] 39800 >> length(mget(mappedLkeys(org.Hs.egSYMBOL), org.Hs.egSYMBOL)) > [1] 39824 > >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] splines tools stats graphics grDevices utils > datasets methods base > > other attached packages: > [1] KEGG.db_2.2.5 GOstats_2.8.0 Category_2.8.4 > genefilter_1.22.0 survival_2.34-1 RBGL_1.18.0 > annotate_1.20.1 > [8] xtable_1.5-4 GO.db_2.2.5 graph_1.20.0 > org.Hs.eg.db_2.2.6 RSQLite_0.7-1 DBI_0.2-4 > AnnotationDbi_1.4.3 > [15] Biobase_2.2.2 > > loaded via a namespace (and not attached): > [1] cluster_1.11.12 gdata_2.4.2 gplots_2.6.0 > GSEABase_1.4.0 gtools_2.5.0-1 xlsReadWritePro_1.4.0 > [7] XML_2.1-0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hervé Pagès wrote: [...] > > So in fact, the following is true for any Bimap object 'x': > > length(unique(toTable(x)[[2]])) == count.mappedkeys(x) # always TRUE oops, the correct equalities are: length(unique(toTable(x)[[1]])) == count.mappedLkeys(x) # always TRUE length(unique(toTable(x)[[2]])) == count.mappedRkeys(x) # always TRUE sorry H.
ADD REPLY

Login before adding your answer.

Traffic: 609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6