HI,
I got the following plot for one of the DE genes. The p-value in DP cell type (the most right panel) is 7.50e-07 (FDR = 3.04e-05), with log2(fold-change) of 30. But obviously it's caused by the two outlier samples on the right corner. How would I avoid reporting genes like this? I am not sure the two outlier samples are same for other genes, so filtering them with minReplicateForReplace=Inf might not be an ideal solution.
I am using DESeq2_1.16.1.
Thanks,
-Xianjun
Thanks, Michael.
I will definitely add the non-zero counts requirement. But this will work for cases where most samples are at 1 (for example), and two samples are at extremely high (e.g. 100). I am wondering if it's possible (and how) to set the cooksCutoff parameter for case like this? Also, would the lfcSE be useful to filter case like this?
Thanks for referring the apeglm. It seems that lfcShrink() currently only accepts coef (not contrast) for type='apeglm'. In my case, I will have to use contrast to specify the reference level.
You can check the Cook's value for these genes, it's in assays(dds)[["cooks"]]
Probably using lfcShrink with type="normal" would also give LFC's of ~0 for these genes.
Thank you, Mike.
lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value. e.g.
> res_after_shrink['ENSG00000257885.1',]
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000257885.1 8.157185 0.2021806 0.1781094 9.210762 3.238169e-20 6.098768e-16
> res_before_shrink['ENSG00000257885.1',]
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000257885.1 8.157185 27.65828 3.002822 9.210762 3.238169e-20 6.098768e-16
Here is the raw values (already require non-zero count >= 4). There is an extreme high value of 350, while most are zero.
"lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value"
Yes, sorry if that wasn't clear. You can either filter on the absolute value of LFC, or you can regenerate p-values.
You can filter simply with
res[abs(res$log2FoldChange) > theta,]
lfcShrink() will only change the p-value column if you set lfcThreshold to a nonzero value (so testing that the shrunken LFC is larger than 'theta', and this only works with normal or apeglm), or if you set svalue=TRUE (works for apeglm and ashr only).
One more note: lfcThreshold only works with the latest release v1.20.
Hi Mike, I really appreciate your time and helps, but I don't think we are on the right question: My true question is why samples with one or two outliers can lead to such a fake "significant" p-value? In the case above (example gene in the DP group), most examples have zero expression and only two samples have high level in the MS group. I made a similar example here:
I know DEseq2 uses different test (e.g LRT) with advanced tricks (e.g. considering dispersion, adjust covariates etc.), and I can apply filters like you suggested above, but intuitively why t-test gives p-value 0.166 and DEseq2 gives a very small p-value like 3.238169e-20? This does not make sense to me.
Please help to explain. I might have misunderstand something here.
Hi Xianjun,
I don’t have much more for you here. The GLM pvalue detects the the two 100s vs all 0s. If it were a single large count it would likely be filtered. Shrinkage of LFC alleviate this either way.
Thanks, Mike. That's OK. Nothing is perfect. I can still play with the tricks you teach me and filter out those cases, and there are not that many of them. Have a nice weekend!
Last question on this: Would you think setting betaPrior=TRUE in DEseq(), which calculates p-values for the shrunken LFC, can help for outlier cases like this?
Yes, and certainly if you use lfcThreshold.