Filtering after DESeq
0
0
Entering edit mode
@3f9f9566
Last seen 1 hour ago
Germany

We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). After running DESeq and then checking various contrasts (with ihw=TRUE), I find myself with quite a bit of genes that are detected as differentially expressed between 2 conditions. But when I plot their counts, I notice that for a lot of them, it comes from the fact that most individuals have 0 counts while 1 or 2 have a high counts, which ends up being responsible for the significance of the Wald test.

I have read that pre-filtering was recommended, something along the lines of

keep <- rowSums(counts(dds) >= x) >= y

However if you look at the example I am joining to this post of 1 gene : if I filtered out this gene based on the fact that it has more than 0 counts in only 10 individuals, I take the risk of filtering a gene which would have more than 0 counts in 10 individuals of one condition. This would represent half of the individuals of the given condition, which I would consider meaningful. There are a few of these genes in my dataset too.

What could I do ?

Thank you !

DESeq2 IHW • 529 views
ADD COMMENT
1
Entering edit mode

The problem is that your prefilter is not group-aware so the behaviour of seeing these sorts of outliers is expected. This is why I always recommend filterByExpr from edgeR (returning a vector of genes to keep which you can apply to your DESeqDataSet) as this does group-aware filtering, thereby ensuring that retained genes have sufficiently large counts to facilitate a groupwise comparison avoiding the sitatuon you see.

ADD REPLY
0
Entering edit mode

I am going to try this right now, thanks !

ADD REPLY
0
Entering edit mode

... I did it. It does filter out quite a bit of genes :

keep <- filterByExpr(dds, group="host_treatment_generation",min.count = 10, min.total.count = 15, large.n = 10, min.prop = 0.7)
dds_keep <- dds[keep,]

dds
class: DESeqDataSet 
dim: 18380 198 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(18380): FBgn0000003 FBgn0000008 ... FBtr0473837_df_nrg
  FBtr0473838_df_nrg
rowData names(66): baseMean baseVar ... deviance maxCooks
colnames(198): 1 2 ... 1291 1425
colData names(9): library sample ... host_treatment_generation
  host-treat

dds_keep
class: DESeqDataSet 
dim: 13577 198 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(13577): FBgn0000003 FBgn0000008 ... FBgn0288846 FBgn0288856
rowData names(66): baseMean baseVar ... deviance maxCooks
colnames(198): 1 2 ... 1291 1425
colData names(9): library sample ... host_treatment_generation
  host-treat

However, when I looked at one specific contrast of my filtered dataset,

resIHW1_tidy <- results(dds_keep, contrast=c("host_treatment_generation","A","B"), tidy=TRUE,filterFun = ihw,alpha=0.05, lfcThreshold=1, altHypothesis="greaterAbs")

there are more genes that are significantly differentially expressed compared to what was the case in my un-filtered dataset (16 genes vs. 10). By plotting counts for all treatment levels for these genes, I notice that 6 genes seem to have the exact same expression pattern, plus the value of the stat is the exact same for all of them. Moreover, when I plot the counts for these genes, there is a gene that is only expressed in only 2 individuals, whereas all other individuals have 0 counts. This gene was not detected as significantly differentially expressed in my non filtered dataset.

I guess I am doing something wrong ?

ADD REPLY

Login before adding your answer.

Traffic: 827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6