Question

Filtering after DESeq

0

Entering edit mode

caroline.zanchi • 0

@3f9f9566

Last seen 1 hour ago

Germany

We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). After running DESeq and then checking various contrasts (with ihw=TRUE), I find myself with quite a bit of genes that are detected as differentially expressed between 2 conditions. But when I plot their counts, I notice that for a lot of them, it comes from the fact that most individuals have 0 counts while 1 or 2 have a high counts, which ends up being responsible for the significance of the Wald test.

I have read that pre-filtering was recommended, something along the lines of

keep <- rowSums(counts(dds) >= x) >= y

However if you look at the example I am joining to this post of 1 gene : if I filtered out this gene based on the fact that it has more than 0 counts in only 10 individuals, I take the risk of filtering a gene which would have more than 0 counts in 10 individuals of one condition. This would represent half of the individuals of the given condition, which I would consider meaningful. There are a few of these genes in my dataset too.

What could I do ?

Thank you !

DESeq2 IHW • 529 views

ADD COMMENT • link 4 days ago • updated 4 hours ago caroline.zanchi • 0

1

Entering edit mode

The problem is that your prefilter is not group-aware so the behaviour of seeing these sorts of outliers is expected. This is why I always recommend filterByExpr from edgeR (returning a vector of genes to keep which you can apply to your DESeqDataSet) as this does group-aware filtering, thereby ensuring that retained genes have sufficiently large counts to facilitate a groupwise comparison avoiding the sitatuon you see.

ADD REPLY • link 1 day ago ATpoint ★ 4.8k

0

Entering edit mode

I am going to try this right now, thanks !

ADD REPLY • link 1 day ago caroline.zanchi • 0

0

Entering edit mode

... I did it. It does filter out quite a bit of genes :

keep <- filterByExpr(dds, group="host_treatment_generation",min.count = 10, min.total.count = 15, large.n = 10, min.prop = 0.7)
dds_keep <- dds[keep,]

dds
class: DESeqDataSet 
dim: 18380 198 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(18380): FBgn0000003 FBgn0000008 ... FBtr0473837_df_nrg
  FBtr0473838_df_nrg
rowData names(66): baseMean baseVar ... deviance maxCooks
colnames(198): 1 2 ... 1291 1425
colData names(9): library sample ... host_treatment_generation
  host-treat

dds_keep
class: DESeqDataSet 
dim: 13577 198 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(13577): FBgn0000003 FBgn0000008 ... FBgn0288846 FBgn0288856
rowData names(66): baseMean baseVar ... deviance maxCooks
colnames(198): 1 2 ... 1291 1425
colData names(9): library sample ... host_treatment_generation
  host-treat

However, when I looked at one specific contrast of my filtered dataset,

resIHW1_tidy <- results(dds_keep, contrast=c("host_treatment_generation","A","B"), tidy=TRUE,filterFun = ihw,alpha=0.05, lfcThreshold=1, altHypothesis="greaterAbs")

there are more genes that are significantly differentially expressed compared to what was the case in my un-filtered dataset (16 genes vs. 10). By plotting counts for all treatment levels for these genes, I notice that 6 genes seem to have the exact same expression pattern, plus the value of the stat is the exact same for all of them. Moreover, when I plot the counts for these genes, there is a gene that is only expressed in only 2 individuals, whereas all other individuals have 0 counts. This gene was not detected as significantly differentially expressed in my non filtered dataset.

I guess I am doing something wrong ?

ADD REPLY • link 4 hours ago caroline.zanchi • 0