We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). After running DESeq and then checking various contrasts (with ihw=TRUE), I find myself with quite a bit of genes that are detected as differentially expressed between 2 conditions. But when I plot their counts, I notice that for a lot of them, it comes from the fact that most individuals have 0 counts while 1 or 2 have a high counts, which ends up being responsible for the significance of the Wald test.
I have read that pre-filtering was recommended, something along the lines of
keep <- rowSums(counts(dds) >= x) >= y
However if you look at the example I am joining to this post : if I filtered out this gene based on the fact that it has more than 0 counts in only 10 individuals, I take the risk of filtering a gene which would have more than 0 counts in 10 individuals of one condition. This would represent half of the individuals of the given condition, which I would consider meaningful.
There are a few of these genes in my dataset too.
What could I do ?
Thank you !
The problem is that your prefilter is not group-aware so the behaviour of seeing these sorts of outliers is expected. This is why I always recommend
filterByExpr
from edgeR (returning a vector of genes to keep which you can apply to your DESeqDataSet) as this does group-aware filtering, thereby ensuring that retained genes have sufficiently large counts to facilitate a groupwise comparison avoiding the sitatuon you see.I am going to try this right now, thanks !