Question

Large differences in gene content for GO terms between org.Mm.eg.db and msigdbr

0

Entering edit mode

S ▴ 10

@399a8e69

Last seen 20 months ago

Spain

Hello,

I was doing a functional enrichment with ClusterProfiler and wanted to check whether there was any difference in the final enriched terms with data from the library org.Mm.eg.db and the data in the package msigdbr. I saw large differences in my results regarding the entrez ids recognized by each resource (for both the list of significant genes and the gene universe) and therefore in the number of enriched terms (a few with org.Mm.eg.db and none with msigdbr) that I think may not be related to differences in the release date of both resources.

In the documentation for org.Mm.eg.db v3.16.0, it says that org.Mm.egGO is based on data provided by: Gene Ontology http://current.geneontology.org/ontology/go- basic.obo With a date stamp from the source of: 2022-07-01 In the website for msigdbr v7.5.1 the data of the release is March 30th, 2022. https://www.rdocumentation.org/packages/msigdbr/versions/7.5.1.

Here is the code I used to performed a few checks:

library(org.Mm.eg.db)
library(msigdbr)
#Selection of Gene Ontology terms related to Biological Process (GO:BP) with evidence ISO from the library org.Mm.eg.db
all_gos <- as.data.frame(org.Mm.egGO)
all_gos_bp <- subset(all_gos, all_gos$Ontology == "BP" & all_gos$Evidence == "ISO")
#Genes per term
all_gos_genes_per_term <- as.data.frame(table(all_gos_bp$go_id))

#Selection of GO:BP in msigdbr.
c5_gene_sets <- msigdbr(species = "Mus musculus", category = "C5", subcategory = "GO:BP") 
#Selection of GO term and entrez IDs.
c5_go_bp <- c5_gene_sets %>% dplyr::distinct(gs_exact_source, entrez_gene) %>% as.data.frame()
#Genes per term.
c5_genes_per_term <- as.data.frame(table(c5_go_bp$gs_exact_source))


#org.Mm.eg.db and msigdb data union for GO:BP terms
c5msigdb_org.Mm.egGO <- merge(all_gos_genes_per_term, c5_genes_per_term, by="Var1")
colnames(c5msigdb_org.Mm.egGO) <- c("Term","org.Mm.egGO_genes","msigdb_genes")

#Summary
table(c5msigdb_org.Mm.egGO$org.Mm.egGO_genes == c5msigdb_org.Mm.egGO$msigdb_genes)
table(c5msigdb_org.Mm.egGO$org.Mm.egGO_genes > c5msigdb_org.Mm.egGO$msigdb_genes)
summary(c5msigdb_org.Mm.egGO$org.Mm.egGO_genes)
summary(c5msigdb_org.Mm.egGO$msigdb_genes)

The number of terms with the exact number of genes is identical in 142/5060. The number of terms with a larger number of genes in org.Mm.egGO than in C5 from msigdb is 95/5060.

> summary(c5msigdb_org.Mm.egGO$org.Mm.egGO_genes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   4.000   7.964   8.000 727.000 

> summary(c5msigdb_org.Mm.egGO$msigdb_genes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    9.00   21.00   79.02   62.00 1914.00

The information in msigdbr for GO:BP is the same that I can find today in the links it provides to geneontology.org (see the content of c5_gene_sets column gs_url) but different from org.Mm.eg.db that usually shows a much shorter number of genes for some reason.

Could anyone help me to understand why the differences between both resources for GO:BP are so large?

Thanks in advance.

Sheila

org.Mm.eg.db GO msigdbr • 2.1k views

ADD COMMENT • link 2.1 years ago S ▴ 10

score 0 · Answer 1 · 2023-02-23

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 22 hours ago

WEHI, Melbourne, Australia

I view the org.Mm.eg.db annotations as definitive as they come directly from mouse gene/GO associations from geneontology.org.

The msigdbr resource on the other hand isn't even based on mouse GO annotation. It is simply an ortholog mapping of the human GO gene sets that are part of the MSigDb. And the MSigDb GO sets themselves are not a complete representation of human GO annotation, but instead a subset of GO terms chosen to be suitable for Broad Institute GSEA analyses.

ADD COMMENT • link 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for your answer.

I understand what you say. Nevertheless, I still find it hard to believe that the differences in both resources are only due to the reasons you provide.

There are 5060 different GO:BP terms that are identical between both resources but org.Mm.eg.db contains 8958 terms that belong to evidence = ISO and msigdbr includes 7656, with no clue about evidence or any other applied filter. Here are the first few lines of the dataframe c5msigdb_org.Mm.egGO I created with the code in my previous post.

Number of genes per term in org.Mm.eg.db and msigdbr

Looking at the number of genes for each term in these resources, I always find the numbers in msigdbr much closer to the information that is shown in http://amigo.geneontology.org than the numbers that org.Mm.eg.db provides. The release I have for msigdbr is dated from march last year which might explain the variation in the number of genes compared to what it is recorded in the latest release of geneontology.org.

Here is just one example of what I see in http://amigo.geneontology.org when searching for a particular term. Ex: Term GO:0000019.

AmiGO results for GO:0000019 for Mus musculus

The number of genes in geneontology is 7, the number of genes in org.Mm.eg.db for this term is 1 and for msigdbr is 7.

In addition, I also calculated the number of genes per term to see the distribution in both org.Mm.eg.db and msigdbr this way.

c5msigdb_org.Mm.egGO <- as.data.table(c5msigdb_org.Mm.egGO)
org.Mm.egGO_toplot <- melt(c5msigdb_org.Mm.egGO, id.vars="Term")
colnames(org.Mm.egGO_toplot) <- c("Term", "Source", "Genes_per_term")
library(ggplot2)
ggplot(org.Mm.egGO_toplot, aes(x=Genes_per_term, fill=Source, color=Source)) +
geom_histogram(position="identity", alpha=0.2, bins = 100) + xlim(c(0,100)) + ylab("Number_of_terms")

Distribution of the number of genes per GO:BP term for org.Mm.eg.db and msigdbr

1043 out of the 5060 GO:BP terms shared between these databases have just one gene in org.Mm.eg.db, which seems strange to me. msigdbr only has 5/5060 terms with one gene. In general, distributions are different and org.Mm.eg.db includes a larger number of terms with a small number of genes.

To me, taking as true the information in http://amigo.geneontology.org, it seems as if org.Mm.eg.db is filtering out a large number of genes and I would like to understand why and how this process was done.

Thanks in advance.

ADD REPLY • link 2.1 years ago S ▴ 10

1

Entering edit mode

When I do a GO analysis using org.Mm.eg.db, I find that the GO:0000002 term contains 37 BP genes, much more than the 7 in your table:

> library(limma)
> Genes <- c("17258","23797","27393")
> g <- goana(Genes, species="Mm")
> topGO(g, ont="BP", n=2)
                                       Term Ont   N DE         P.DE
GO:0000002 mitochondrial genome maintenance  BP  37  3 1.752123e-09
GO:0007005       mitochondrion organization  BP 541  3 5.917962e-06

Two reasons might be:

The correct org.Mm.eg.db object for functional enrichment analyses is not org.Mm.egGO but org.Mm.egGO2ALLEGS.
You are restricting org.Mm.eg.db genes to those with ISO evidence but not applying the same restriction to msigdbr. You don't say why. I don't see any need for this restriction when I do GO analyses.

ADD REPLY • link 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

I selected ISO because I manually inspected a few terms and saw that ISO gene sets were closer to the gene lists provided in msigdbr and geneontology.org.

Changing the database to org.Mm.egGO2ALLEGS shows a similar histogram for both org.Mm.eg.db and msigdbr. Looking at the index on the left side panel in the org.Mm.eg.db documentation I wrongly assumed that the items listed there were all the available databases and selected org.Mm.egGO without going further into the documentation, so I did not see that other databases such as org.Mm.egGO2ALLEGS existed.

Thank you very much for your help.

Best regards,

Sheila

ADD REPLY • link 2.1 years ago S ▴ 10