I have a small dataset with 3 replicates per condition. One condition is a gene over-expression, the other is a control. I'm used to run DESeq2 with betaPrior = True to have a comparability with the 'old' DESeq2 behavior. I can't apply this old workflow because the expression differences between my conditions it quite small and one gene, namely the over-expressed gene, would stand out and with betaPrior = T it's shrunk to a very small log2 FoldChange. According to the manual this is to be expected which is why I used betaPrior = F and the several lfcShrink methods (normal, ashr and apeglm).
Here is the data set with 3 replicates and the several methods. From the left to the right it is the following and I used coef in the lfcShrink function:
betaPrior = F and method = normal, betaPrior = F, betaPrior = F and method = ashr, betaPrior = F and method = apeglm
If I use two replicates, it's even more visible, especially for apeglm and I wonder what's happening here. Especially most of the not DE are reduced to log2Foldchange of < 0.001 and even some DEG have a log2Foldchange < 0.001
Thanks for posting these images. What you observe is consistent with what we see in testing on the benchmarking data and on simulation data. If you just compare method="normal" to method="apeglm" or "ashr", the differences you are likely to see is that normal will shrink large effects even if they have high precision (so shrinking too much) and allow small effects to float around 0, while apeglm/ashr will not shrink the precise, large effects much at all and the small effects which are indistinguishable from 0 will be shrunk to 0. By indistinguishable I mean that the SE(beta) >> |beta|. This kind of estimator is very good at ranking genes and it also has low error compared to the true LFC (which we know in the benchmarking datasets and the simulation datasets). Also all of these estimators converge to MLE as the sample size gets larger. It's up to you which you want to use for plots. For ranking genes by effect size, I'd recommend apeglm or ashr, as these perform the best in our hands.
One more note: the DEG according to the Wald test can be shrunk a lot by apeglm or ashr, they are different approaches statistically, and you can have a very small p-value for small effect size which nevertheless is not equal to 0. If you want to have more consistent error-bounded sets (FDR/FSR sets) with the shrunken effect sizes, you can use svalue=TRUE when you call lfcShrink(). Then the error-bounded set is consistent with the shrunken effect size, because the same posterior is being used for both.
Thanks, as many times before, for the feedback. After reading your tutorials and comments on threads, I understand the shrinkage. But what is a bit of a concern is providing a ma plot to a biologist. In the past the would a similar plots compared to picture 1 and 2 in https://ibb.co/fjz6ET (betaPrior = F, method = normal and betaPrior = F) and now one delivers the plot for method = apeglm which basically only shows the DEGs and everything else is around 0 and not even visible anymore.
For example here are some genes which are significant, padj < 0.1, in the two replicates per condition scenario
Gene
log2 apeglm
log2 betaPrior = F
Cond A
Cond B
Stat2
0.0013680983
0.78
894.8
1536.7
Rbm43
0.0010062328
0.91
181.3
341.4
Apobec1
0.0009898028
0.8
230.6
402.7
Zcchc24
0.0011278656
0.68
332.6
533.5
That's hard to explain to a biologist who looks at the mean counts per condition and wonders why the log2 foldchange is now almost 0. How would one approach this with a good explanation?
I forgot to mention a feature which might be useful for you. If I simulate a majority of null where LFC=0 with a few spiked in DE genes, you can see the behavior you are talking about, where the precise LFCs are preserved and many LFCs compatible with zero are shrunk close to zero. This is the estimator which gives very good rankings and MSE compared to truth.
If however, you want to use the Cauchy prior in apeglm but not have it adapt to the scale of the data, you can set apeAdapt=FALSE. This uses a non-adaptive Cauchy prior with the standard scale. This will tend to make MA plots that look like type="normal", but it will perform better than type="normal", in that precise, large effects will not be reduced. This is not the default method for apelgm, because the default is more accurate in producing FSR bounded sets which we show in the manuscript. I wouldn’t use the s-values with apeAdapt=FALSE. But this may be a useful in between for you, or using type="ashr". Here's some example code:
Hi Michael,
Thanks, as many times before, for the feedback. After reading your tutorials and comments on threads, I understand the shrinkage. But what is a bit of a concern is providing a ma plot to a biologist. In the past the would a similar plots compared to picture 1 and 2 in https://ibb.co/fjz6ET (betaPrior = F, method = normal and betaPrior = F) and now one delivers the plot for method = apeglm which basically only shows the DEGs and everything else is around 0 and not even visible anymore.
For example here are some genes which are significant, padj < 0.1, in the two replicates per condition scenario
That's hard to explain to a biologist who looks at the mean counts per condition and wonders why the log2 foldchange is now almost 0. How would one approach this with a good explanation?
I forgot to mention a feature which might be useful for you. If I simulate a majority of null where LFC=0 with a few spiked in DE genes, you can see the behavior you are talking about, where the precise LFCs are preserved and many LFCs compatible with zero are shrunk close to zero. This is the estimator which gives very good rankings and MSE compared to truth.
If however, you want to use the Cauchy prior in apeglm but not have it adapt to the scale of the data, you can set
apeAdapt=FALSE
. This uses a non-adaptive Cauchy prior with the standard scale. This will tend to make MA plots that look liketype="normal"
, but it will perform better thantype="normal"
, in that precise, large effects will not be reduced. This is not the default method for apelgm, because the default is more accurate in producing FSR bounded sets which we show in the manuscript. I wouldn’t use the s-values with apeAdapt=FALSE. But this may be a useful in between for you, or usingtype="ashr"
. Here's some example code:I guess from this table we don't know about variability of counts or the MLE. What's the LFC SE?
Also, what does ashr report?