Question

Filtering after DESeq

0

Entering edit mode

caroline.zanchi • 0

@3f9f9566

Last seen 58 minutes ago

Germany

We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). After running DESeq and then checking various contrasts (with ihw=TRUE), I find myself with quite a bit of genes that are detected as differentially expressed between 2 conditions. But when I plot their counts, I notice that for a lot of them, it comes from the fact that most individuals have 0 counts while 1 or 2 have a high counts, which ends up being responsible for the significance of the Wald test.

I have read that pre-filtering was recommended, something along the lines of

keep <- rowSums(counts(dds) >= x) >= y

However if you look at the example I am joining to this post of 1 gene : if I filtered out this gene based on the fact that it has more than 0 counts in only 10 individuals, I take the risk of filtering a gene which would have more than 0 counts in 10 individuals of one condition. This would represent half of the individuals of the given condition, which I would consider meaningful. There are a few of these genes in my dataset too.

What could I do ?

Thank you !

DESeq2 IHW • 216 views

ADD COMMENT • link 3 days ago • updated 4 hours ago caroline.zanchi • 0

1

Entering edit mode

The problem is that your prefilter is not group-aware so the behaviour of seeing these sorts of outliers is expected. This is why I always recommend filterByExpr from edgeR (returning a vector of genes to keep which you can apply to your DESeqDataSet) as this does group-aware filtering, thereby ensuring that retained genes have sufficiently large counts to facilitate a groupwise comparison avoiding the sitatuon you see.