I am performing a GO enrichment analysis using genes from the transcriptome as the reference universe, consisting of 23,761 genes. I then conducted an enrichGO analysis on my differentially expressed genes (DEGs).
However, I noticed that only 17,800 genes from my universe are annotated in the org.Mm.eg.db database, as shown in the results below:
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0001505 GO:0001505 regulation of neurotransmitter levels 25/207 208/17800 2.039483e-18 4.729560e-15 3.896485e-15
GO:0006836 GO:0006836 neurotransmitter transport 25/207 218/17800 6.415275e-18 7.438512e-15 6.128276e-15
GO:0007269 GO:0007269 neurotransmitter secretion 21/207 166/17800 4.292311e-16 2.488467e-13 2.050143e-13
To understand this difference, I checked how many genes from my universe are properly annotated in the org.Mm.eg.db database using the following code:
# list of gene symbols from the dataframe, representing the universe of genes.
universe <- df$mgi_symbol
# list of all mouse gene symbols from the org.Mm.eg.db database.
gene_symbols <- keys(org.Mm.eg.db, keytype = "SYMBOL")
# Retrieve the associated GO terms for each of the mouse gene symbols.
go_terms <- mapIds(org.Mm.eg.db,
keys = gene_symbols,
column = "GO",
keytype = "SYMBOL",
multiVals = "list")
# Filter genes that do not have valid or non-empty GO terms.
empty_or_na_go_terms <- go_terms[sapply(go_terms, is_empty_or_na)]
# Function to check if a vector of GO terms is valid (not empty, non-null, and non-NA).
has_valid_go_terms <- function(go_terms_vector) {
!is.null(go_terms_vector) && !all(is.na(go_terms_vector)) && length(go_terms_vector) > 0
}
# Filter genes with valid GO terms by applying the function to the list of GO terms.
valid_go_terms <- go_terms[sapply(go_terms, has_valid_go_terms)]
# Extract the names of genes that have valid GO terms from the db.
genes_with_go_terms <- names(valid_go_terms)
print(length(genes_with_go_terms)) #29,891.
# Find the intersection between the genes in the universe and the genes that have valid GO terms in db.
common_genes <- intersect(universe, genes_with_go_terms)
print(length(common_genes)) #18,359.
I found 18,359 genes from my universe that are properly annotated in the org.Mm.eg.db database. However, the enrichGO results indicate only 17,800 annotated genes in the analysis. Where could this difference come from?
```
If you want to comment on an answer, please choose ADD COMMENT rather than ADD ANSWER, since you are not in fact adding an answer.
Ideally you wouldn't be using gene symbols because they are not unique.
I identified 3804 genes with GO annotations. I put my liste of genes in this link enter link description here
By using a slightly different way I get much larger numbers; both for the total number of input that could be annotated (= mapped to IDs), and the number of unique genes with GO-BP annotation.
This partly may be due to the fact that
ALIAS
is used askeytype
(input); see for example below thatH19
maps to 2 genes (ID14955
and50884
). This demonstrates what James said:SYMBOL
let aloneALIAS
are not unique... and thus not preferred as input.Step-by-step:
Note: edited post to better show difference between total number of GOALL annotations, and number of unique genes that in the end were annotated.
I gave you busted SQL code that returned the central key for the DB rather than the gene IDs. Here's the corrected version.
Hi James,
I am likely missing something obvious, but why is the code that I used returning a different number of genes having a GOALL-BP annotation than yours? Your result: 17809 unique genes, my result 9576 genes.
Thanks!
EDIT: Note that the number of unique genes is not correctly determined in the last line of code above!
Number of unique genes is identical to the number calculated by James through his direct query of the database.
See post below: clusterProfiler: BgRatio N value does not match the size of the gene universe that has GO annotations.
Hi Guido,
It's that last filtering step. You are saying the NCBI Gene ID has to be unique AND the ontology has to be BP, which isn't what you want, because
!duplicated
is the same thing as 'return only the first instance of', and you can see that in yourres
object the first instance of that NCBI Gene ID is a CC ontology, so you exclude that one (and any other Gene ID where the first instance of that ID isn't also simultaneously a BP ontology).Thanks James! I indeed missed the obvious... all clear now!