Question

numInCat in goseq inconsistent with value obtained from org.Hs.eg.db

0

Entering edit mode

siajunren • 0

@siajunren-12197

Last seen 12 months ago

Singapore

Hi,

The number of genes in a GO geneset determined by goseq is different from what I obtained by querying org.Hs.eg.db.

I ran the following:

library(goseq)
data(genes)
pwf.eg=nullp(genes,"hg19","ensGene")
GO.wall.eg=goseq(pwf.eg,"hg19","ensGene", test.cats='GO:BP')
head(GO.wall.eg,5)

It shows that GO:0000278 is top candidate and has 922 for numInCat.

Then I ran the following:

library(org.Hs.eg.db)
Hs=org.Hs.eg.db
Hsmapped_eg=select(Hs, 'GO:0000278', 'ENSEMBL','GO')$ENSEMBL
length(Hsmapped_eg) #only 139

Only got 139 members in this geneset.

Why such a huge discrepancy?

Thanks,
Junren

goseq • 1.3k views

ADD COMMENT • link updated 6.7 years ago by James W. MacDonald 68k • written 6.7 years ago by siajunren • 0

score 3 · Accepted Answer · 2018-08-14

There are two things going on here. First, the Gene Ontology (GO) is set up as a directed acyclic graph, which simply means that we start at the highest, most general level (e.g., Biological Process), and then keep splitting into more specific terms. By default, since all of the lower GO terms are more specific than the higher GO terms, all the genes that are appended to a given GO term are also appended to all of the ancestor terms, all the way up to Biological Process. So as a (fake) example, if you have a term 'Protein phosphorylation' and a term 'Phosphorylation of Map Kinases', then all of the genes that are appended to the latter are also appended to the former, because Protein Phosphorylation is a superset of the lower term.

That's a roundabout way of getting to my point, which is that the GO keytype only gives you the genes that are appended to a particular GO term, not its descendant terms. For that you need the GOALL keytype.

Second, you are starting with Ensembl IDs, but are using the org.Hs.eg.db package (which is based on NCBI annotations) to do the mappings. By default that means you are mapping Ensembl IDs to Entrez Gene IDs, which is not a trivial process. Well, it's a trivial thing to do, but NCBI and EBI/EMBL are two different groups that are trying to do the same thing; there are assumptions and decisions made by each group that tend to cause their annotations to diverge more than you might think.

ANYWAY, why does goseq say there are 922 genes when you think there should only be 139? First, let's map the GO term to all Entrez Gene IDs:

> z <- select(org.Hs.eg.db, "GO:0000278", "ENTREZID", "GOALL")
'select()' returned 1:many mapping between keys and columns

> head(z)
       GOALL EVIDENCEALL ONTOLOGYALL ENTREZID
1 GO:0000278         IEA          BP       25
2 GO:0000278         TAS          BP       25
3 GO:0000278         IMP          BP       90
4 GO:0000278         IDA          BP       91
5 GO:0000278         IDA          BP      199
6 GO:0000278         IDA          BP      207

> dim(z)

[1] 1340    4

So 1340 Entrez Gene IDs per GO term. But look at the first two rows! That's the same Gene ID twice!

> length(unique(z[,4]))
[1] 1056

So that's closer. But not quite. Let's map to Ensembl IDs

> zz <- select(org.Hs.eg.db, unique(as.character(z$ENTREZID)), "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
 > head(zz)
  ENTREZID         ENSEMBL
1       25 ENSG00000097007
2       90 ENSG00000115170
3       91 ENSG00000135503
4      199 ENSG00000204472
5      199 ENSG00000237727
6      199 ENSG00000206428

> length(unique(zz[,2]))
[1] 1180

Huh. Still not 922. But you only measured a certain set of genes!

> sum(unique(zz$ENSEMBL) %in% names(genes))
[1] 922

Et voila! There are 922 Ensembl genes that are in your set of genes, that map to that GO term or any of its descendants, that also map to any Entrez Gene ID.