Question

Questions regard to GSVA and Limma

0

Entering edit mode

Chenxi • 0

@75875a3b

Last seen 3.8 years ago

Japan

Hello,

First, this is the task I need to finish:

Download gmt files from MSigDB, one is hallmark gene sets and the other is Canonical pathways under C2:curated gene sets.
Prepare data inputs, and one is the raw data counts gathered from STAR alignment results, the other one is the transformed data from raw counts and it is dealt by DESeq2 package. The transformation is $log_2(n+1)$.
Get enrichment scores by gsva() function on both inputs and both gene sets with one cdf kernel to be Poisson and the other one to be Gaussian.
Then apply gsva scores to the limma package to fit a linear model by lmFit() function. Then select pathways with adjusted p value smaller than 0.01 to be gene sets that are significant. This step is learnt from what GSVA vignette tutorial did in section 6.2.
Do comparison: compare significant sets overlapped and not overlapped. Then choose 5 sets with smallest adjusted p value that is only significant to one set of data input.
Plot the heatmaps of raw counts and transformed data input on those sets to see whether we can get any patterns to explain the differences on p value and significance.

Questions and Queries

Overall, I want to find out why transformation can lead to differences in p value.

By changing kernels from Poisson to Gaussian in gsva() function. I can see the enrichment scores changed. Then why the scores changed and even some of them changed from positive to negative?
How to interpret the linear models in limma package. It has coefficients explained as fold change. However, how to explain fold change in gsva scores (even for positive values and negative values) ? What is the meaning of coefficients and how to interpret the result of p values. It is not like the ordinary linear model that the coefficients are for prediction and t statistics and p values for showing whether these coefficients can be 0 or not. Then how to apply to this case that the predictors are factors? Also, this will lead to new question next.
In applying limma package functions, there are two set of design models, one with intercept and one without intercept. Then how to understand the differences of these two settings on coefficients and p values. Especially, what is p value for the intercept term.
If overall, this way for choosing significant sets doesn't work theoretically. Then how to choose gene sets that are significant? What test and which package should I use? Why in gsva tutorial used this way for selecting interesting gene sets?
Previously I got heatmaps on those sets that performed differently, but I can't see any good patterns to reach any conclusion. That's why I have all these questions. Also, another question is, what is a good heatmap and what we are expecting to see?

Guesses

I guess what lead to the changes in gsva score is the empirical cumulative density function step in gsva, however, I can't find a way to prove my guess because, the ecdf calculation are calculated behind gsva() so that I can't call ecdf calculations myself, since for different kernels, it used different formulas but not just ecdf(). Different numbers will lead to different ranking so that the KS statistic changed absolutely. However, does transformation make sense? Does this calculation make sense?
I know from posts in Bioconductor Forum that the explanation of logFC in gsva scores are tricky and one uses p-value because it has precise interpretation. However, does these p-value cutoff method make sense at all?
Since those are gene sets with smallest p value which means significant to our data, then we should expect to see some differences in heatmap. For me, I can't find out those clear differences, thus I am confused for now.
I guess this is not the right way to choose significant sets but I can't figure out why the tutorial used this way and why this won't work for this case.

I can provide any code if necessary. I read the GSVA paper but I just can't completely understand the whole thing. Is that because my lack of patience or intelligence? I hope I expressed my questions clearly and if there is any confusion, I can try my best to explain more.

Sincerely, Chenxi

limma GSVA • 2.7k views

ADD COMMENT • link 3.8 years ago Chenxi • 0

score 2 · Accepted Answer · 2021-07-01

hi,

Probably the difference you see is because your integer counts were not normalized, while your log-transformed continuous values were normalized. The question about differences in using the Poisson or the Gaussian kernel has already popped up before a previous post and when data are normalized both approaches should lead to similar results. We developed the approach based on a Poisson kernel at a time where we thought it was useful to work with integer counts, but currently my recommendation is that you derive normalized continuous values of expression such as log-CPMs or log-TPMs and use the default Gaussian kernel.
Figure 2 of the GSVA paper shows that the GSVA scores lead to well-calibrated tests and this is in general our experience through the years, and this is the reason why it is sensible to use a cutoff on adjusted p-values to select differentially expressed gene sets. The question on the log-fold change was already answered on the post you linked and I have nothing else to say unless you make a more specific question.
I think you third question is not specific to GSVA but about linear models in general. If you're not getting the results that you expect, you should provide more details on the analysis you are doing to be able to help and see whether this is a problem about the way in which you use GSVA.
A heatmap is not a tool to choose differentially expressed genes or gene sets, it's a exploratory visualization tool to look at the magnitude of expression changes.