Question

Most enriched pathway is not expressed when modifying universe parameter

0

Entering edit mode

adrip ▴ 10

@eef42a90

Last seen 9 months ago

France

Hi,

I am new to transcriptomic data analysis and I am currently working on pathway analysis. To do so, I am using the clusterProfiler package on R. I'm going to use both enrichGO and enrichKEGG but they both have a "universe" parameter and I am struggling understanding it right.

So, from what I have understood is that, I can either use every genes that I have in my assay (around 56k) and it corresponds to the defaults parameter of the function, ie "all the genes listed in the database will be used as background", or I can use every genes that I have kept to do my differential gene expression analysis, since I have filtered them to get rid of the low counts. I have filtered them as follow, where dds is a DESeq2 object that I got using DESeqDataSetFromMatrix function with the data tables that I have :

keep <- rowSums(counts(dds) >= 10) > 3
dds <- dds[keep,]

I get around 18k genes left. Then, I used the enrichGO function in two differents ways, as follows :

enrich18k <- enrichGO(gene=rownames(subset(resyoungAC, log2FoldChange>=1 & padj<0.05)),
                         universe=rownames(resyoungAC),
                         OrgDb = "org.Mm.eg.db",
                         keyType = "ENSEMBL",
                         ont="BP")

enrich56k <- enrichGO(gene=rownames(subset(resyoungAC, log2FoldChange>=1 & padj<0.05)),
                         OrgDb = "org.Mm.eg.db",
                         keyType = "ENSEMBL",
                         ont="BP")

Here, resyoungAC is the object that I got using results() function from DESeq. I am filtering to keep only the up-regulated genes. The only difference between both functions is the universe parameter, where I set it as rownames(resyoungAC) in the first function, corresponding to the 18k genes that were left after filtering. The second function is using the default parameter ie the 56k genes (I verified that it was indeed using these genes, and it is).

My problem is that I have few of the most expressed pathways in the enrich18k object that don't make it in the second object (enrich56k) and I can't understand why.

In the enrich18k object, the most expressed pathway is "response to bacterium", with 27/127 as GeneRatio and 497/15489 as BgRatio (padj~7.2e-12). In the enrich56k object, the most expressed pathway is "leukocyte chemotaxis" with 14/127 as GeneRatio and 235/23012 as BgRatio (padj~1.05e-07).

I don't understand why the "response to bacterium" is not in the enrich56k results since my 127 genes of interest remain the same and I apparently have 27 of these genes that are expressed in that pathway in enrich18k.

Thanks a lot for you help.

DESeq2 clusterProfiler Pathways RNASeq org.Mm.eg.db • 2.5k views

ADD COMMENT • link updated 10 months ago by James W. MacDonald 68k • written 10 months ago by adrip ▴ 10

score 0 · Answer 1 · 2024-06-12

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

The reason you get different results is because you are changing the universe. The GO hypergeometric test is basically asking if the proportion of genes in a given term are enriched in your set of genes as compared to the universe from which they were selected.

You can think of this in percentage terms. As an example, say that a given term contains 10% of the genes in the universe, but that term is enriched in your set of significant genes, containing 25% of the genes in that set. The term will likely be significant, because that's a big enrichment.

But the comparison is to the proportions in your universe, so let's use another example. Say the term of interest is 10% of the genes in the universe, and it's also 10% of the significant genes. In that case it won't be significant. But if you filter the genes, and after having done so, the proportion of that term that remain in your filtered set has been reduced to only 5% of the genes, you are very likely to have a significant result, because now that 10% in your significant gene set is twice the 5% in your universe.

As a general rule I mostly use the unfiltered set of genes as my universe, and my rationale for doing so is that a priori all of those genes could have been detected, and it's only a posteriori that I find out which genes have been detected. You can of course make the opposite argument, using the conventional balls in an urn analogy. If you think of a hypergeometric test in that heuristic, then when you reach into the urn to select the balls, the only balls you can actually grab are the ones in the urn, not all the balls that could have been in the urn if you were able to measure them.

You as the analyst have to make the decision as to which analogy makes more sense to you, and that you feel more comfortable defending if anybody asks why you made that choice.

ADD COMMENT • link 10 months ago James W. MacDonald 68k

1

Entering edit mode

I just would like to add that by default clusterProfiler is conservative in setting the background. That is, it limits the background genes to those that intersect with (i.e. need to be present) in the gene sets you used for analysis (which in your case are the genes that have a GO annotation). The input genes that are not annotated in any gene set (GO category) are removed before analysis, even when you explicitly provide them as input for the argument universe.

Yet, since very recent Guangchuang introduced an option to disable taking this 'forced' intersection.

Use options(enrichment_force_universe = TRUE) to force the universe to be kept untouched. See here for more info: https://github.com/YuLab-SMU/clusterProfiler/issues/636#issuecomment-2073994557

Make sure to use DOSE version 3.31.1 or higher, because a bug was fixed that related to the calculation of the universe.

ADD REPLY • link 10 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Also, for straight GO analysis for RNA-Seq, I tend to use goseq, which corrects for the bias inherent in comparing between genes with different lengths. And it gives you more control over the universe which I would prefer.

ADD REPLY • link 10 months ago James W. MacDonald 68k

0

Entering edit mode

Hi James, thank you for your answer.

I understand how the pvalue might be significant or not depending on the universe. But I still don't get why, if one enriched term appears in one of both cases, why doesn't it appear in the other ? The pvalue might change (from significant to non significant) but why would the term simply disappear since there are genes that belong to that pathway ? The GeneRatio should stays the same while the BgRatio changes.

ADD REPLY • link 10 months ago adrip ▴ 10

0

Entering edit mode

If you say you only want terms that are significant, and a given term is only significant in one comparison, you shouldn't expect to see that term in both, should you?

ADD REPLY • link 10 months ago James W. MacDonald 68k

0

Entering edit mode

In the enrichGO results we have significant and non-significant enriched terms that are displayed. So why would an enriched term (significantly or not) would appear in one of both and not both ?

ADD REPLY • link 10 months ago adrip ▴ 10

0

Entering edit mode

The default for enrichGO is to subset to those with a p-value<0.05. You don't adjust that default argument, so you shouldn't have any GO term with a p>0.05. As an example,

> library(clusterProfiler)
> data(geneList, package = "DOSE")
> de <- names(geneList)[1:100]
> yy <- enrichGO(de, 'org.Hs.eg.db', ont="BP")
> range(yy$pvalue)
[1] 2.176086e-31 7.313847e-03

So it is unclear to me why you say that you get non-significant 'enriched' terms displayed.

ADD REPLY • link 10 months ago James W. MacDonald 68k

0

Entering edit mode

Well, I was looking at the result table (enrich18k@result, or yy@result in your example), and it gives every pathway even the non-significant ones. So I was expecting to see the same pathways but maybe with a different pvalue. Because I still don't get why the pathway with the biggest GeneRatio does not appear in both cases since this ratio should stays the same (which makes sense since we are looking to the same genes). Maybe the BgRatio would change, which could lead to a non-significant pvalue (from what I know, generatio and bgratio are taken into account in the pvalue calculation), but the pathway should still be there isn't it ?

Again, thank you for your time.

ADD REPLY • link 10 months ago adrip ▴ 10

0

Entering edit mode

I think Guido already answered this question.

ADD REPLY • link 10 months ago James W. MacDonald 68k