I am performing a DGE analysis on plant tissues under different stresses with or without a separate treatment: moderate (D) and severe drought (S), with or without a separate treatment (T). Each combination has an unstressed/untreated control, e.g. MUU (moderate drought control, unstressed and untreated), MDU (moderate drought stressed, no treatment), MDT etc., for both levels of stress and different tissues. The analysis is simple in the sense that we are only analysing stressed/treated samples versus their control in a particular tissue: MDUL vs. MUUL, MDTL vs. MUUL in leaf, for example (each sample has three biological replicates).
In order to later use lfcShrink()
with apeglm
it is necessary to use the coefficient name ("group_MDTL_vs_MUUL"
) as opposed to a contrast ("group","MDTL","MUUL"
). Thus, I releveled the data for each comparison as discussed in the vignette and elsewhere:
dds=DESeqDataSetFromTximport(txi,colData=sampletable,design=~group)
rdds=DESeq(dds)
rdds$group=relevel(rdds$group,ref="MUUL")
rdds=nbinomWaldTest(rdds)
resultsNames(rdds)
[1] "Intercept" ... "group_MDTL_vs_MUUL"
res=results(rdds,name="group_MDTL_vs_MUUL",alpha=0.05,lfcThreshold=1,altHypothesis="greaterAbs")
res_shrink=lfcShrink(dds=rdds,coef="group_MDTL_vs_MUUL",res=res,type="apeglm")
When I was analysing the output for the comparisons I noticed several instances where a gene was called DE despite intuitively seeming like the counts/number of samples with non-zero counts was too low. This included several instances where a gene had literally zero counts in both of the compared samples (bearing in mind the 0.5 pseudocount from plotCounts
):
$ res["Solyc00g500050.1.1",]
log2 fold change (MLE): group MDTL vs MUUL
Wald test p-value: group MDTL vs MUUL
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
Solyc00g500050.1.1 31.6121 -20.7792 5.10032 -3.87803 0.000105307
padj
<numeric>
Solyc00g500050.1.1 0.00381324
$ plotCounts(rdds,"Solyc00g500050.1.1",intgroup="group",return=TRUE) %>% filter(group == "MDTL" | group == "MUUL")
count group
MDTL1 0.5 MDTL
MDTL2 0.5 MDTL
MDTL3 0.5 MDTL
MUUL1 0.5 MUUL
MUUL2 0.5 MUUL
MUUL3 0.5 MUUL
Using contrast
rather than name
seemed to prevent these genes being given significant p-values, as did using lfcShrink
and filtering genes by s-value instead. I suspect the issue may be related to the highly variable expression of genes across the tissues and conditions, since a gene may be highly expressed in some tissue/condition and zero in many others? This also makes it difficult to pre-filter such genes.
1) Are there any major errors in the pipeline above?
2) Is there any way to avoid identifying such genes, other than a post-hoc filter?
Thanks!
Did you perform pre-filtering of low counts genes before differential analysis ? Have a look at the DESeq2 vignette on the pre-filtering : http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Yes, I have done it without and without pre-filtering. The problem, which I don't really make clear above, is that these genes are robustly expressed in other conditions and tissues and so have a reasonable
baseMean
. I would not expect them to be filtered out of the dataset entirely, but I would have expected a gene with no counts in two samples to not be then measured as DE between those two samples (at least, which is what appears to be happening when using the coefficient name inresults
rather thancontrast
).