Hi, sorry if this is an overly simple question but I couldn't find a clear answer on the forums or vignette.
I'm running gene set enrichment using the gseGO function in ClusterProfiler. The function needs a list of genes, which I'm planning to rank by log fold change. Should the gene list contain all genes, or should it just contain genes below a significance cut off (e.g. padj < 0.05)? I know some people also rank using something like (signed fold change * -log10pvalue). Should that metric use all genes or just below a significance cut off?
If both inputs (all genes and padj<0.05) are valid, under what circumstances should you use one over the other?
For GSEA (FCS) you should use all genes, not a subset. If you use a subset, then you are performing a over-representation (ORA) analysis. For more info on the differences between the methods (FCS vs ORA) you may want to check the links in this post: Cluster profiler - KEGG analysis
The default ranking metric for GSEA is the so-called Signal2Noise metric, but obviously other metrics can be used. FYI: since I use limma for my analyses I standardly use its moderated t-values as ranking metric. For more background / food-for-thought on this see the GSEA website at the Broad Institute (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Metrics_for_Ranking), or e.g. this paper.
Also, to perform an ORA (based on Gene Ontology) in clusterProfiler you will need to use the function enrichGO().
Very helpful, thank you!