Hm. The concepts seem pretty obvious to me, but then again I suppose they would be.
- Common dispersion is the mean dispersion across all genes.
- Trended dispersion is the mean dispersion across all genes with similar abundance. In other words, the fitted value of the mean-dispersion trend.
I suppose I should qualify this by saying that we don't literally compute the mean in edgeR, but rather take the dispersion value that maximizes the adjusted profile likelihood. This is unlikely to be helpful for your audience, so you can probably just simplify it by saying that it's the mean.
For the tagwise dispersions, perhaps the simplest way to proceed would be to make a BCV plot from the output of estimateDisp
with prior.df=0
. This will yield dispersion estimates without any empirical Bayes shrinkage, i.e., "raw" tagwise estimates that do not share information across genes. You can then compare this to the BCV plot with the default prior.df
; you should see that the points are squeezed towards the trend (or towards the common value in red, if you set trend="none"
). This should demonstrate how empirical Bayes shrinkage works, effectively squeezing values together to reduce the effect of estimation uncertainty when you have low numbers of replicates.
Now, as for how this affects hypothesis testing - there are three main points, in order of obviousness:
- Larger dispersions = higher variance between replicates, which reduce power to detect DE.
- The performance of the model (and thus of the DE analysis) depends on the accuracy of the dispersion estimate. If there is a strong mean-dispersion trend, the common dispersion is obviously unsuitable. If the gene-specific dispersions vary around the trend, the trended dispersion is unsuitable. The "raw" tagwise estimates are unbiased estimates of the gene-specific dispersions, and should be the most suitable, except...
- The performance of the model also depends on the precision with which the dispersions are estimated. Here, the raw estimates are least stable as they use the least amount of information, whereas the trended (and to a greater extent, common) dispersion estimates share information between genes for greater stability. This is why the shrunken tagwise estimates (that you get with default
estimateDisp
) are so useful, as they provide a compromise between precision and accuracy.
You may already know that we are now recommending the QL framework with glmQLFit
and glmQLFTest
for routine GLM-based DE analyses. This introduces another set of concepts, namely the distinction between negative binomial dispersions and quasi-likelihood dispersions. Long story short, the NB dispersions aim to model the mean-dispersion relationship across the entire dataset, while the QL dispersions aim to capture the variability and estimation uncertainty of the dispersion for each gene.
Thankyou! the prior.df=0 to prior.df = x should really visually clarify the compromis!
Do you have any suggestions to show what a GLMfit actually does, or to explain this simply. I think the testing of coefficients would be easy understandable, but I completely forgot the fitting part... I know about the gof() plot but I think this will more focus on the results than what it actually does
If you're asking about the
glmFit
function, it fits a negative binomial GLM to the counts for each gene. Exactly how it does so involves a number of interesting tricks and speed-ups (IRLS with Levenberg-Marquardt dampening), but this should not be relevant to you or your audience. If you're asking about the concept of fitting a GLM; that is a more general statistical question rather than anything specific to edgeR. I would suggest a more appropriate forum like https://stats.stackexchange.com/, or various online resources on GLMs (starting with linear models may be easier).Sorry if I wasn't clear I actually ment whether I could produce a plot like the last figure on this page. Let's say I done my tag wise dispersion estimates based on the whole data, then select a single gene and its counts and dispersion. Like so:
Based on this information I should be able to produce I plot like that right? I tried doing so by using mglmOneWay but couldn't figure it out ;(. If I could produce some of these plots it would be much easier to understand what the two coefficients actually are and why it makes sense to test hypothesis on these.
Just get the estimated coefficients from the
DGEGLM
object produced byglmFit
. These can be interpreted in the same manner as for linear models (keeping in mind that NB GLMs use a log-link, so the coefficients represent log-differences in expression).Sorry for bothering you again! But I can't figure what these coefficients mean. Let's say we compare two treatments, this will result in two coefficients: the first one is the baseline for group1 and the second the 2 vs 1 comparison (according to the user guide). Indeed when I plot coefficient 1 against coefficient 2 all genes falling on/around the diagonal are not significant. But I don't get what I'm comparing here. As this would be one model, the the model would look like this I suppose: y = b0 + treatment * b1. (where treatment is just 1 or 0). So If two conditions would have the same means I would think that b1 is just near zero, however when I look at the output you refered to the second coefficient is nearly the same as the first one for non DE genes why is this?
Plotting coefficient 2 against coefficient 1 across all genes is meaningless. You already understand that coefficient 2 represents the log-fold change between groups. It follows that your DE genes will be those where coefficient 2 is significantly different from zero. The behaviour of the first coefficient is not relevant.
Aaah that's clear! Got all the steps now. Thankyou for the great and quick support for the questions!
But the coefficient is much smaller than zero, but the fold change is nearly zero and the gene is not significant
You're not fitting the model that you said you were fitting. In your previous comments, you described a model with an intercept, where the first coefficient represents the log-expression in the "baseline" group and the second coefficient represents the log-fold change. In your latest comment above, you are fitting a no-intercept model, where the two coefficients represent the log-expression in the respective groups. Mixing the two interpretations doesn't make sense. (Also, the coefficients are natural log while the log-fold change is log2.) I suggest you re-read the relevant parts of the edgeR (and limma) user's guides.
That's why i said " the model would look like this I suppose") , but apparently I was wrong here, but then the plot I mentioned (coefficient 1 vs coefficient) does makes sense. Anyway I will re-read it