Hi,
I am new to transcriptomic data analysis and I am currently working on pathway analysis. To do so, I am using the clusterProfiler package on R. I'm going to use both enrichGO and enrichKEGG but they both have a "universe" parameter and I am struggling understanding it right.
So, from what I have understood is that, I can either use every genes that I have in my assay (around 56k) and it corresponds to the defaults parameter of the function, ie "all the genes listed in the database will be used as background", or I can use every genes that I have kept to do my differential gene expression analysis, since I have filtered them to get rid of the low counts. I have filtered them as follow, where dds is a DESeq2 object that I got using DESeqDataSetFromMatrix function with the data tables that I have :
keep <- rowSums(counts(dds) >= 10) > 3
dds <- dds[keep,]
I get around 18k genes left. Then, I used the enrichGO function in two differents ways, as follows :
enrich18k <- enrichGO(gene=rownames(subset(resyoungAC, log2FoldChange>=1 & padj<0.05)),
universe=rownames(resyoungAC),
OrgDb = "org.Mm.eg.db",
keyType = "ENSEMBL",
ont="BP")
enrich56k <- enrichGO(gene=rownames(subset(resyoungAC, log2FoldChange>=1 & padj<0.05)),
OrgDb = "org.Mm.eg.db",
keyType = "ENSEMBL",
ont="BP")
Here, resyoungAC is the object that I got using results() function from DESeq. I am filtering to keep only the up-regulated genes. The only difference between both functions is the universe parameter, where I set it as rownames(resyoungAC) in the first function, corresponding to the 18k genes that were left after filtering. The second function is using the default parameter ie the 56k genes (I verified that it was indeed using these genes, and it is).
My problem is that I have few of the most expressed pathways in the enrich18k object that don't make it in the second object (enrich56k) and I can't understand why.
In the enrich18k object, the most expressed pathway is "response to bacterium", with 27/127 as GeneRatio and 497/15489 as BgRatio (padj~7.2e-12). In the enrich56k object, the most expressed pathway is "leukocyte chemotaxis" with 14/127 as GeneRatio and 235/23012 as BgRatio (padj~1.05e-07).
I don't understand why the "response to bacterium" is not in the enrich56k results since my 127 genes of interest remain the same and I apparently have 27 of these genes that are expressed in that pathway in enrich18k.
Thanks a lot for you help.
I just would like to add that by default
clusterProfiler
is conservative in setting the background. That is, it limits the background genes to those that intersect with (i.e. need to be present) in the gene sets you used for analysis (which in your case are the genes that have a GO annotation). The input genes that are not annotated in any gene set (GO category) are removed before analysis, even when you explicitly provide them as input for the argumentuniverse
.Yet, since very recent Guangchuang introduced an option to disable taking this 'forced' intersection.
Use
options(enrichment_force_universe = TRUE)
to force theuniverse
to be kept untouched. See here for more info: https://github.com/YuLab-SMU/clusterProfiler/issues/636#issuecomment-2073994557Make sure to use
DOSE
version 3.31.1 or higher, because a bug was fixed that related to the calculation of the universe.Also, for straight GO analysis for RNA-Seq, I tend to use
goseq
, which corrects for the bias inherent in comparing between genes with different lengths. And it gives you more control over the universe which I would prefer.Hi James, thank you for your answer.
I understand how the pvalue might be significant or not depending on the universe. But I still don't get why, if one enriched term appears in one of both cases, why doesn't it appear in the other ? The pvalue might change (from significant to non significant) but why would the term simply disappear since there are genes that belong to that pathway ? The GeneRatio should stays the same while the BgRatio changes.
If you say you only want terms that are significant, and a given term is only significant in one comparison, you shouldn't expect to see that term in both, should you?
In the enrichGO results we have significant and non-significant enriched terms that are displayed. So why would an enriched term (significantly or not) would appear in one of both and not both ?
The default for
enrichGO
is to subset to those with a p-value<0.05. You don't adjust that default argument, so you shouldn't have any GO term with a p>0.05. As an example,So it is unclear to me why you say that you get non-significant 'enriched' terms displayed.
Well, I was looking at the result table (enrich18k@result, or yy@result in your example), and it gives every pathway even the non-significant ones. So I was expecting to see the same pathways but maybe with a different pvalue. Because I still don't get why the pathway with the biggest GeneRatio does not appear in both cases since this ratio should stays the same (which makes sense since we are looking to the same genes). Maybe the BgRatio would change, which could lead to a non-significant pvalue (from what I know, generatio and bgratio are taken into account in the pvalue calculation), but the pathway should still be there isn't it ?
Again, thank you for your time.
I think Guido already answered this question.