Hi all,
I am relatively new to GSEA with Bioconductor and enrichplot and and have some questions about enrichplot's dotplot function, in particular about the "split = .sign" option hat divides the plot into an "activated" and "suppressed" group which I split up to separate graphs via facet_grid(~.sign).
I did a gseGO search in R using this code:
AvsB_gse <- gseGO(geneList = AvsB_list,
ont ="MF",
keyType = "UNIPROT",
nPermSimple = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = org.Hs.eg.db,
pAdjustMethod = "none")
I then plotted my results via enrichplot:
AvsB_EnrichPlot <- dotplot(AvsB_gse,
x = "GeneRatio",
color = "p.adjust",
showCategory = 15,
size = NULL,
split = ".sign",
font.size = 12,
title = "A vs B Genes - Molecular Functions",
orderBy = "x",
label_format = 100,
) +
facet_grid(~.sign) +
theme(panel.spacing = unit(0.5,
"cm",
data = NULL))
When receiving my plot result from the gseGO analysis above based on a sorted list with Gene_IDs and log2ratios, I receive a gene cluster in the "activated" facet of the plot but when checking back the group IDs in the core_enrichment column from the GSEA with the original list entries' log2ratios, all these are in the negative range.
So by my understanding, this group of genes should cluster in the "suppressed" box of the facet as their log2ratios are all in the negative or am I getting this completely wrong and "activated" and "suppressed" are not at all connected to log2values? Unfortunately, I could not find any documentation that explain the use of ".sign" in more detail, any help is highly appreciated!
Best, Tobi
I would be helpful if you provided more information, for example how you sorted your list. Based on what you describe it seems you used
sort
with its default argument(s), which is:sort(x, decreasing=FALSE, ...)
. See?sort
. In other words, your input seems to be ranked form negative (on top) to positive log2ratios (on the bottom of the ranked list)... which perfectly matches with what you apparently observe in the plot.To get additional insights, you may also want to visualize the results in a so-called
gseaplot
and/orridgeplot
. See for more info the respective help pages for these functions.Hi, thanks so much for reaching out!
I did sort my list as described for clusterProfiler in a decreasing manner:
This created a list of 3796 protein IDs which are sorted in a descending order for "x" from 13.38191 (row 2) to -13.86907 (row 3797), so nothing out of the ordinary. This is the list which I ran gseGO on.
As far as I know, once sorted in this way, gseGO doesn't do any additional sorting or mirror the results around unless this changed in some recent updates.
I also checked the source code of dotplot here... https://github.com/YuLab-SMU/enrichplot/blob/b6467013b88a96160253cacc7391ccfa7988ca78/R/dotplot.R ...but I could not find anything that explained what is going on with ".sign" to me.
Thank you for the other plot suggestions, I am going to see how these look like, but I do not think you can split a ridgeplot (not sure about gseaplot).
Are you sure
AvsB_list
is in the proper/expected format? The reason I am asking is because you say that ranking starts from row 2 onward, whereas the first entry (row) already should be included in the ranking...? It should be a named numeric vector. Please provide the output fromhead(AvsB_list)
andstr(AvsB_list)
; you may also want to compare it with the example input:Also, when I run you code using the example input the
gseGO
results anddotplot
do match. This points to the only 'variable' that is not the same between your and my analysis, which is (again) the inputAvsB_list
.Thank you for this thorough analysis of the code, much appreciated! About the ranking, I was referring to the printed list I got via write.csv, which adds a header on row one, sorry for the confusion.
Here is my comparison with your code above using gene_list from DOSE, nothing really sticks out here, except for the use of Uniprot identifiers instead of entrez IDs:
I did also run all your code using gene_list from DOSE on my setup (R-Studio in a markdown format) and received pretty much the same data, albeit with a small difference compared to your plot:
I also can repeat your gseaplot sets (no need to show again) except for the ridgeplot which gives me an error:
When I run my own AvsB_list again with the previously shown code, I get following enrichplot:
Lets pick two interesting GO terms with opposite activation, MHC class genes and serine/threonine kinases:
Now when I do the gseaplot following your example, things seem to be as expected compared to enrichplot (besides different color distribution?):
So having checked all this, everything seems to be alright considering my list.
What still puzzles me and maybe I have not made this more clear in my initial post is the following:
When I check AvsB_list for the associated log2ratios of my protein IDs in the MHC terms (GO:0023026), all of these are negative:
So I am really wondering why these are described as "activated" in the enrichplot facet. These proteins are certainly lower in sample "A vs B" when seen from the comparison by each individual log2ratio. When compared to the serine/threonine group (GO:0004674) however, the individual log2ratios of the IDs listed are even lower:
This does explain the ranking to me in terms of comparing just the GO sets together.
But I do have to say that the terms "activated" and "suppressed" are a bit misleading if this is truly the difference that plots them into the separate facets using ".sign".
I admit that this is a very naive question from someone new to GO terms, but the labeling simply doesn't make sense to me considering the individual log2ratios from the Protein IDs. Genes in both GO sets should be considered "suppressed" as all log2s are negative when comparing sample A to B. Please let me know if I am missing something essential here!
Aha, the
gseaplot2
you posted shows the cause of the confusion! It boils down to the distribution of your ranking metric (logFC), and how the gene set enrichment analysis (GSEA) procedure works. See for example here for a detailed explanation: https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/Simply said, the basis of GSEA is a list in which the genes are ranked on a relevant metric, in your case logFC. Importantly, after the ranking, the values of the ranking metric (logFC) are not used anymore; of relevance is only the position of a gene in that ranked list. The next step in the procedure is to walk from top to down of the ranked list, and check in which gene set (pathway) each of the ranked genes appears. If there is a match, then the score (running sum) of those gene sets increase. This is done until the bottom of the ranked list is reached. If multiple genes that belong to the same gene set are on top of the ranked list, the (normalized) enrichment score of that gene set will be positive, and this gene set is then interpreted to be 'activated', the same applies to genes that are at the bottom of the ranked list; the gene sets that are 'enriched' by these genes have a negative NES, and could be interpreted as being 'suppressed'. What
clusterProfiler
under the hood does, is simply labeling sets having a positive NES as 'activated', and negative NES as 'suppressed'. These two term are then used to 'split' the gene sets in the dotplot.One of the underlying assumptions is that you more or less have similar number of genes with a positive or negative ranking metric. Yet, if you look at the
gseaplot2
, you see at the bottom panel (with y-axis = 'Ranked List Metric') that of the 3796 proteins you used as input, only around 700 have a positive ranking metric. In other words, in your dataset only 700 genes are up-regulated, and 3000 are down-regulated. The GSEA algorithm correctly identifies gene sets that are enriched on the top of your ranked list, and since the values of the ranking metric are not considered at all, these are then just labelled 'activated'. This is why the gene setMHC class II protein complex binding
is labelled like that!By plotting the underlying ranking data, e.g. in a
heatplot
orupsetplot
, it becomes clear that despite the label 'activated' the corresponding genes are actually down-regulated!To illustrate this with some code:
Thank you for this thorough explanation, the links and analysis, this is immensely helpful! Also sorry for my late reply, it has been quite busy around here. I do believe that quite a lot of people are using the split function but (like me) do interpret the resulting data about what is activated and what no in a wrong way.
In biology, terms like "activated" and "suppressed" are always in correlation to a control value (hence application log of fold-change ratios). What the plost using
split = ".sign"
actually shows is just that some GO-terms are more or less affected by change in the presence/position of genes in the list. This is a big difference, I believe the labels for the facets should be named differently in that regard.I only had some question marks popping up once I checked the raw numbers underlying the plot (which is always good practice I guess). Your reply made absolute sense and helped me clarify what is going on, which is great! I will need to be more careful when I make these analyses in the future, doublecheck using your code above and most likely change the labels of these facets. Thank you again for the help, I appreciate that very much!