Question

Looking for documentation about the use of "split = .sign" in enrichplot's dotplot

0

Entering edit mode

Tobias • 0

@ea227b54

Last seen 13 months ago

United States

Hi all,

I am relatively new to GSEA with Bioconductor and enrichplot and and have some questions about enrichplot's dotplot function, in particular about the "split = .sign" option hat divides the plot into an "activated" and "suppressed" group which I split up to separate graphs via facet_grid(~.sign).

I did a gseGO search in R using this code:

AvsB_gse <- gseGO(geneList = AvsB_list, 
                ont ="MF", 
                keyType = "UNIPROT", 
                nPermSimple = 10000, 
                minGSSize = 3, 
                maxGSSize = 800, 
                pvalueCutoff = 0.05, 
                verbose = TRUE, 
                OrgDb = org.Hs.eg.db, 
                pAdjustMethod = "none")

I then plotted my results via enrichplot:

 AvsB_EnrichPlot <- dotplot(AvsB_gse, 
        x = "GeneRatio",
        color = "p.adjust",
        showCategory = 15,
        size = NULL,
        split = ".sign",
        font.size = 12,
        title = "A vs B Genes - Molecular Functions",
        orderBy = "x",
        label_format = 100, 
        ) + 
  facet_grid(~.sign) +
  theme(panel.spacing = unit(0.5, 
                             "cm", 
                             data = NULL))

When receiving my plot result from the gseGO analysis above based on a sorted list with Gene_IDs and log2ratios, I receive a gene cluster in the "activated" facet of the plot but when checking back the group IDs in the core_enrichment column from the GSEA with the original list entries' log2ratios, all these are in the negative range.

So by my understanding, this group of genes should cluster in the "suppressed" box of the facet as their log2ratios are all in the negative or am I getting this completely wrong and "activated" and "suppressed" are not at all connected to log2values? Unfortunately, I could not find any documentation that explain the use of ".sign" in more detail, any help is highly appreciated!

Best, Tobi

enrichplot • 4.3k views

ADD COMMENT • link 15 months ago • updated 14 months ago Tobias • 0

1

Entering edit mode

I would be helpful if you provided more information, for example how you sorted your list. Based on what you describe it seems you used sort with its default argument(s), which is: sort(x, decreasing=FALSE, ...). See ?sort. In other words, your input seems to be ranked form negative (on top) to positive log2ratios (on the bottom of the ranked list)... which perfectly matches with what you apparently observe in the plot.

To get additional insights, you may also want to visualize the results in a so-called gseaplot and/or ridgeplot. See for more info the respective help pages for these functions.

ADD REPLY • link 15 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Hi, thanks so much for reaching out!

I did sort my list as described for clusterProfiler in a decreasing manner:

AvsB_list = sort(AvsB_list_unsorted, decreasing = TRUE)

This created a list of 3796 protein IDs which are sorted in a descending order for "x" from 13.38191 (row 2) to -13.86907 (row 3797), so nothing out of the ordinary. This is the list which I ran gseGO on.

As far as I know, once sorted in this way, gseGO doesn't do any additional sorting or mirror the results around unless this changed in some recent updates.

I also checked the source code of dotplot here... https://github.com/YuLab-SMU/enrichplot/blob/b6467013b88a96160253cacc7391ccfa7988ca78/R/dotplot.R ...but I could not find anything that explained what is going on with ".sign" to me.

Thank you for the other plot suggestions, I am going to see how these look like, but I do not think you can split a ridgeplot (not sure about gseaplot).

ADD REPLY • link 15 months ago Tobias • 0

1

Entering edit mode

Are you sure AvsB_list is in the proper/expected format? The reason I am asking is because you say that ranking starts from row 2 onward, whereas the first entry (row) already should be included in the ranking...? It should be a named numeric vector. Please provide the output from head(AvsB_list) and str(AvsB_list); you may also want to compare it with the example input:

> 
> data(geneList, package="DOSE")
> head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 
> str(geneList)
 Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
 - attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
>

Also, when I run you code using the example input the gseGO results and dotplot do match. This points to the only 'variable' that is not the same between your and my analysis, which is (again) the input AvsB_list.

> ## load required libraries
> library(clusterProfiler)
> library(enrichplot)
> library(ggplot2)
> library(org.Hs.eg.db)
> 
> ## load example data
> data(geneList, package="DOSE")         
> 
> AvsB_gse <- gseGO(geneList = geneList, ## used sample input
+                 ont ="MF", 
+                 keyType = "ENTREZID",  ## ids are now entrezid, no uniprot
+                 nPermSimple = 10000, 
+                 minGSSize = 3, 
+                 maxGSSize = 800, 
+                 pvalueCutoff = 0.05, 
+                 verbose = TRUE, 
+                 OrgDb = org.Hs.eg.db, 
+                 pAdjustMethod = "none")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize,  :
  For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.
> 
> 
> AvsB_EnrichPlot <- dotplot(AvsB_gse, 
+                            x = "GeneRatio",
+                            color = "p.adjust",
+                            showCategory = 15,
+                            size = NULL,
+                            split = ".sign",
+                            font.size = 7,  ## reduced font size
+                            title = "A vs B Genes - Molecular Functions",
+                            orderBy = "x",
+                            label_format = 100, 
+                             ) + 
+         facet_grid(~.sign) +
+         theme(panel.spacing = unit(0.5, 
+                                    "cm", 
+                                    data = NULL))
> 
> ## show selected regulated genes sets
    > as.data.frame(AvsB_gse)[c("GO:0017116", "GO:0030020"),]
                   ID
GO:0017116 GO:0017116
GO:0030020 GO:0030020
                                                                       Description
GO:0017116                                   single-stranded DNA helicase activity
GO:0030020 extracellular matrix structural constituent conferring tensile strength
           setSize enrichmentScore       NES       pvalue     p.adjust
GO:0017116      17       0.7916671  2.193173 1.741078e-05 1.741078e-05
GO:0030020      34      -0.7170093 -2.210082 4.907464e-07 4.907464e-07
                 qvalue rank                   leading_edge
GO:0017116 0.0020941167 1270 tags=71%, list=10%, signal=63%
GO:0030020 0.0002242969 1890 tags=71%, list=15%, signal=60%
                                                                                                                      core_enrichment
GO:0017116                                                               4174/4171/79075/5888/4175/4173/5984/1763/4176/5982/4172/5983
GO:0030020 1288/1291/1301/80781/1280/1306/1278/1277/81578/1293/1295/1281/50509/1290/1294/1289/1292/1296/8292/1300/1287/7373/1307/1308
> 
> 
> ## plot dotplot
> print(AvsB_EnrichPlot)
>

dotplot

> ## based on dotplot, focus on sets GO:0017116 (single-stranded DNA helicase activity), upregulated, 
> ## and GO:0030020 (extracellular matrix structural constituent conferring tensile strength), downregulated
> 
> ## all plots show that the underlying data makes sense, thus that the respective set is up- resp. down-regulated
> ## and this behavior is consistent with the dotplot and NES scores from as.data.frame(AvsB_gse)!
> 
> gseaplot2(AvsB_gse, geneSetID = c("GO:0017116", "GO:0030020") )
> 
> gseaplot(AvsB_gse, geneSetID=c("GO:0017116"), title = AvsB_gse[rownames(data.frame(AvsB_gse))=="GO:0017116","Description"])
> gseaplot(AvsB_gse, geneSetID=c("GO:0030020"), title = AvsB_gse[rownames(data.frame(AvsB_gse))=="GO:0030020","Description"])
> 
> 
> ridgeplot(AvsB_gse, showCategory= c("GO:0017116", "GO:0030020") )
Picking joint bandwidth of 0.216
>

enter image description here

ADD REPLY • link 15 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Thank you for this thorough analysis of the code, much appreciated! About the ranking, I was referring to the printed list I got via write.csv, which adds a header on row one, sorry for the confusion.

Here is my comparison with your code above using gene_list from DOSE, nothing really sticks out here, except for the use of Uniprot identifiers instead of entrez IDs:

> head(AvsB_list)
  Q92817   P36776   Q8N4S0   Q96KP1   Q03252   P18859 
13.38191 12.97564 12.76279 12.10399 11.32148 11.01676
> tail(AvsB_list)
   P10586    O43795    Q16548    Q9HC36    Q9NW15    Q9Y3A6 
-11.55145 -11.56844 -11.88507 -13.04547 -13.83828 -13.86907 
> str(AvsB_list)
 Named num [1:3796] 13.4 13 12.8 12.1 11.3 ...
 - attr(*, "names")= chr [1:3796] "Q92817" "P36776" "Q8N4S0" "Q96KP1" ...
>

I did also run all your code using gene_list from DOSE on my setup (R-Studio in a markdown format) and received pretty much the same data, albeit with a small difference compared to your plot:

enter image description here

I also can repeat your gseaplot sets (no need to show again) except for the ridgeplot which gives me an error:

> ridgeplot(AvsB_gse, showCategory= c("GO:0017116", "GO:0030020") )
Warning: first element used of 'length.out' argumentWarning: NAs introduced by coercionError in seq_len(n) : argument must be coercible to non-negative integer

When I run my own AvsB_list again with the previously shown code, I get following enrichplot:

enter image description here

Lets pick two interesting GO terms with opposite activation, MHC class genes and serine/threonine kinases:

> as.data.frame(AvsB_gse)[c("GO:0023026", "GO:0004674"),]
                   ID                              Description setSize enrichmentScore
GO:0023026 GO:0023026     MHC class II protein complex binding       8       0.5937170
GO:0004674 GO:0004674 protein serine/threonine kinase activity      55      -0.4819219
                 NES      pvalue    p.adjust    qvalue rank
GO:0023026  1.639510 0.036992840 0.036992840 0.6972025 1547
GO:0004674 -1.433715 0.006177215 0.006177215 0.4915929 1435
                              leading_edge
GO:0023026 tags=100%, list=41%, signal=59%
GO:0004674  tags=69%, list=38%, signal=44%
                                                                                                                                                                                                                                                                     core_enrichment
GO:0023026                                                                                                                                                                                                                   P61769/P08238/P50995/P07900/P62258/P05026/P60033/P14618
GO:0004674 P52564/Q9BRS2/P24941/P11274/P25098/Q16644/Q05655/Q13131/Q8IU85/P31749/P54619/Q8N4C8/O94804/P17612/P36507/P51812/Q9P289/Q96S44/Q13501/Q14680/Q9H2G2/Q9H4A3/Q9Y6M4/P48729/Q16513/P41743/Q13153/P17252/Q99759/O60885/Q96KB5/P06493/Q9UHY1/P45984/Q13177/P28482/B4DR80/P00533
>

Now when I do the gseaplot following your example, things seem to be as expected compared to enrichplot (besides different color distribution?):

> AvsB_gseaplot <- gseaplot2(AvsB_gse, 
+                            geneSetID = c("GO:0023026", "GO:0004674")
+ )
> 
> AvsB_gseaplot
>

enter image description here

> MHC_plot <- gseaplot(AvsB_gse, geneSetID=c("GO:0023026"), title = IvsP_MF_gse[rownames(data.frame(AvsB_gse))=="GO:0023026","Description"])
> 
> MHC_plot

enter image description here

> Kinase_plot <- gseaplot(AvsB_gse, geneSetID=c("GO:0004674"), title = IvsP_MF_gse[rownames(data.frame(AvsB_gse))=="GO:0004674","Description"])
> 
> Kinase_plot

enter image description here

So having checked all this, everything seems to be alright considering my list.

What still puzzles me and maybe I have not made this more clear in my initial post is the following:

When I check AvsB_list for the associated log2ratios of my protein IDs in the MHC terms (GO:0023026), all of these are negative:

> as.data.frame(AvsB_list)[c("P61769", "P08238", "P50995", "P07900", "P62258",
+          "P05026", "P60033", "P14618"),]
[1] -0.3660870 -0.6865667 -0.8761021 -0.9202186 -1.1078039 -1.2290911
[7] -1.3357713 -1.4107504
>

So I am really wondering why these are described as "activated" in the enrichplot facet. These proteins are certainly lower in sample "A vs B" when seen from the comparison by each individual log2ratio. When compared to the serine/threonine group (GO:0004674) however, the individual log2ratios of the IDs listed are even lower:

> as.data.frame(AvsB_list)[c("P52564", "Q9BRS2", "P24941", "P11274", 
+                            "P25098", "Q16644", "Q05655", "Q13131", "Q8IU85",
+                            "P31749", "P54619", "Q8N4C8", "O94804", "P17612",
+                            "P36507", "P51812", "Q9P289", "Q96S44", "Q13501",
+                            "Q14680", "Q9H2G2", "Q9H4A3", "Q9Y6M4", "P48729",
+                            "Q16513", "P41743", "Q13153", "P17252", "Q99759",
+                            "O60885", "Q96KB5", "P06493", "Q9UHY1", "P45984",
+                            "Q13177", "P28482", "B4DR80", "P00533"
+                            ),]
 [1] -4.707521 -4.725568 -4.890860 -4.895919 -4.978416 -5.049424 -5.163870
 [8] -5.283019 -5.364784 -5.792321 -5.847517 -5.877914 -5.918623 -5.989008
[15] -6.127324 -6.360098 -6.369443 -6.455713 -6.523632 -6.625598 -6.673634
[22] -6.989434 -7.235785 -7.316042 -7.454785 -7.513516 -7.687555 -7.823971
[29] -8.084195 -8.121636 -8.182946 -8.375751 -8.425330 -8.499023 -8.560701
[36] -9.240499 -9.267659 -9.317141
>

This does explain the ranking to me in terms of comparing just the GO sets together.

But I do have to say that the terms "activated" and "suppressed" are a bit misleading if this is truly the difference that plots them into the separate facets using ".sign".

I admit that this is a very naive question from someone new to GO terms, but the labeling simply doesn't make sense to me considering the individual log2ratios from the Protein IDs. Genes in both GO sets should be considered "suppressed" as all log2s are negative when comparing sample A to B. Please let me know if I am missing something essential here!

ADD REPLY • link 15 months ago Tobias • 0

0

Entering edit mode

Aha, the gseaplot2 you posted shows the cause of the confusion! It boils down to the distribution of your ranking metric (logFC), and how the gene set enrichment analysis (GSEA) procedure works. See for example here for a detailed explanation: https://www.pathwaycommons.org/guide/primers/data_analysis/gsea/

Simply said, the basis of GSEA is a list in which the genes are ranked on a relevant metric, in your case logFC. Importantly, after the ranking, the values of the ranking metric (logFC) are not used anymore; of relevance is only the position of a gene in that ranked list. The next step in the procedure is to walk from top to down of the ranked list, and check in which gene set (pathway) each of the ranked genes appears. If there is a match, then the score (running sum) of those gene sets increase. This is done until the bottom of the ranked list is reached. If multiple genes that belong to the same gene set are on top of the ranked list, the (normalized) enrichment score of that gene set will be positive, and this gene set is then interpreted to be 'activated', the same applies to genes that are at the bottom of the ranked list; the gene sets that are 'enriched' by these genes have a negative NES, and could be interpreted as being 'suppressed'. What clusterProfiler under the hood does, is simply labeling sets having a positive NES as 'activated', and negative NES as 'suppressed'. These two term are then used to 'split' the gene sets in the dotplot.

One of the underlying assumptions is that you more or less have similar number of genes with a positive or negative ranking metric. Yet, if you look at the gseaplot2, you see at the bottom panel (with y-axis = 'Ranked List Metric') that of the 3796 proteins you used as input, only around 700 have a positive ranking metric. In other words, in your dataset only 700 genes are up-regulated, and 3000 are down-regulated. The GSEA algorithm correctly identifies gene sets that are enriched on the top of your ranked list, and since the values of the ranking metric are not considered at all, these are then just labelled 'activated'. This is why the gene set MHC class II protein complex binding is labelled like that!

By plotting the underlying ranking data, e.g. in a heatplot or upsetplot, it becomes clear that despite the label 'activated' the corresponding genes are actually down-regulated!

To illustrate this with some code:

> ## select all human genes
> genes.all <- keys(org.Hs.eg.db)
> 
> 
> ## randomly assign ranking metric (e.g. logFC) to the genes.
> ## make it so that **all genes** are **downregulated**, and sort.
> random.logFC <- runif(length(genes.all), min=-100, max=0) 
> names(random.logFC) <- genes.all
> random.logFC <- sort(random.logFC, decreasing = TRUE)
> 
> 
> ## show that indeed all genes have negative log2FC,
> ## thus are downregulated
> min(random.logFC)
[1] -99.99897
> max(random.logFC)
[1] -0.0008515082
> 
> 
> ## run gseGO
> ## in this case warning on ties can be ignored
> ## to reduce run time, min/max size has been adapted
> AvsB_gse <- gseGO(geneList = random.logFC, ## used random input
+                   ont = "MF", 
+                   keyType = "ENTREZID",
+                   nPermSimple = 10000, 
+                   minGSSize = 15, 
+                   maxGSSize = 500, 
+                   pvalueCutoff = 1, 
+                   verbose = TRUE, 
+                   OrgDb = org.Hs.eg.db,  
+                   pAdjustMethod = "BH")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning message:
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (0% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
> 
> ## check results
> ## note that because random genes were used as input, indeed no gene set 'survived' 
> ## multiple testing adjustment (= p.adjust), yet some sets pop-up as being significant based on uncorrected pvalue.
> AvsB_gse
#
# Gene Set Enrichment Analysis
#
#...@organism    Homo sapiens 
#...@setType     MF 
#...@keytype     ENTREZID 
#...@geneList    Named num [1:191179] -0.000852 -0.000864 -0.001147 -0.001301 -0.001369 ...
 - attr(*, "names")= chr [1:191179] "127891272" "127267426" "129992732" "130063923" ...
#...nPerm        
#...pvalues adjusted by 'BH' with cutoff <1 
#...947 enriched terms found
'data.frame':   947 obs. of  11 variables:
 $ ID             : chr  "GO:0046875" "GO:1901505" "GO:0015932" "GO:0140104" ...
 $ Description    : chr  "ephrin receptor binding" "carbohydrate derivative transmembrane transporter activity" "nucleobase-containing compound transmembrane transporter activity" "molecular carrier activity" ...
 $ setSize        : int  27 55 55 95 18 22 25 32 28 37 ...
 $ enrichmentScore: num  -0.538 -0.475 -0.457 -0.418 -0.571 ...
 $ NES            : num  -1.67 -1.6 -1.54 -1.47 -1.66 ...
 $ pvalue         : num  0.000621 0.000267 0.000737 0.000517 0.003156 ...
 $ p.adjust       : num  0.175 0.175 0.175 0.175 0.332 ...
 $ qvalue         : num  0.175 0.175 0.175 0.175 0.332 ...
 $ rank           : num  64502 76252 76252 62029 50206 ...
 $ leading_edge   : chr  "tags=63%, list=34%, signal=42%" "tags=64%, list=40%, signal=38%" "tags=58%, list=40%, signal=35%" "tags=51%, list=32%, signal=34%" ...
 $ core_enrichment: chr  "1946/1942/6714/1944/1945/2534/4690/6464/2043/358/1948/56899/1949/1947/867/2885/5294" "23443/8034/6523/2542/114789/7355/283600/113829/64078/10237/9197/8714/29957/3177/293/84879/9153/284439/6573/5100"| __truncated__ "23443/51092/8034/84275/114789/7355/283600/113829/64078/10237/9197/29957/3177/293/9153/284439/6573/54847/51000/2"| __truncated__ "475/59335/9694/200558/64901/6944/23633/3836/5824/23534/80224/3843/10073/3838/23039/51194/220869/51501/25842/564"| __truncated__ ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> 
> ## zoom in on normalized enrichment scores
> ## note that positive NES for gene sets are returned,
> ## despite all input genes having negative logFC!
> min(as.data.frame(AvsB_gse)$NES)
[1] -1.665548
> max(as.data.frame(AvsB_gse)$NES)
[1] 1.011848
>
> ## make dotplot
> AvsB_EnrichPlot <- dotplot(AvsB_gse, 
+                            x = "GeneRatio",
+                            color = "p.adjust",
+                            showCategory = 10,
+                            size = NULL,
+                            split = ".sign",
+                            font.size = 7,  ## reduced font size
+                            title = "A vs B Genes - Molecular Functions",
+                            orderBy = "x",
+                            label_format = 100, 
+                             ) + 
+         facet_grid(~.sign) +
+         theme(panel.spacing = unit(0.5, 
+                                    "cm", 
+                                    data = NULL))
> print( AvsB_EnrichPlot)
>

enter image description here

> ## based on dotplot set "protein lysine deacetylase activity" seems to be activated,
> ## and "purine ribonucleotide transmembrane transporter activity" to be suppressed.
> ##
> ## visualize underlying data for these sets in a heat- and upset-plots.
> ## note that all genes are down-regulated, although 1st gene set is reported to be 'activated'!
> ## also: warning can be ignored
> heatplot(AvsB_gse, foldChange=random.logFC, showCategory=c("protein lysine deacetylase activity",
+          "purine ribonucleotide transmembrane transporter activity"))
> 
> upsetplot(AvsB_gse, n=c("protein lysine deacetylase activity",
+          "purine ribonucleotide transmembrane transporter activity"))
geom_line(): Each group consists of only one observation.
 Do you need to adjust the group aesthetic?
geom_line(): Each group consists of only one observation.
 Do you need to adjust the group aesthetic?
>

enter image description here

ADD REPLY • link 14 months ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Thank you for this thorough explanation, the links and analysis, this is immensely helpful! Also sorry for my late reply, it has been quite busy around here. I do believe that quite a lot of people are using the split function but (like me) do interpret the resulting data about what is activated and what no in a wrong way.

In biology, terms like "activated" and "suppressed" are always in correlation to a control value (hence application log of fold-change ratios). What the plost using split = ".sign" actually shows is just that some GO-terms are more or less affected by change in the presence/position of genes in the list. This is a big difference, I believe the labels for the facets should be named differently in that regard.

I only had some question marks popping up once I checked the raw numbers underlying the plot (which is always good practice I guess). Your reply made absolute sense and helped me clarify what is going on, which is great! I will need to be more careful when I make these analyses in the future, doublecheck using your code above and most likely change the labels of these facets. Thank you again for the help, I appreciate that very much!

ADD REPLY • link 14 months ago Tobias • 0