Hi, I am curious about prefiltering with DESeq2. I understand from this site and reading the DESeq2 vignette that prefiletering is really unnecessary as DESeq2 has a stringent filtering that it does. However, I'm seeing better results with filtering (much higher # DEGs with sig adj p-value after DESeq2).

I have a one factor design and three samples in two different groups. I do have a sample that's a bit noisier than the others but doesn't seem there is any significant grounds to remove it. With prefiltering, I find the highest number of significant DEGs when I filter out anything below 50 normalized counts for each of the six samples.

My question is, is this wrong? Is creating such a high prefiltering cutoff affecting my false positivity rate somehow? Though I love that I get triple the amount of significant DEGs, I don't want to go down the wrong path in my subsequent pathway analsyis due to a prefiltering error.

Thank you for your time.

My code

#dds <- DESeqDataSetFromMatrix(countData=countdata, colData=sampleData, design = ~ condition, tidy= FALSE)
#dds <- estimateSizeFactors(dds)
#keep <- rowSums(counts(dds, normalized=TRUE) >= 50 ) >=6
#dds <- dds[keep,]
#dds <- DESeq(dds)

Pre-filtering is fine, so you can go with your current settings and report these results. Probably what is happening is that you have a lot of high dispersion low count features that are changing the dispersion trend. It's reasonable to filter in this case. In the user guide we should really make it clear that filtering is not necessary for consideration of multiple testing burden, because of independent filtering.

Entering edit mode

Wonderful. Thank you so much for the speedy response! It's much appreciated.

Entering edit mode

Out of curiosity (since I'm quite a newbie to bioinformatics and DESeq2): In explaining my methods pipeline and DESeq2 results, is there a good way to plot as a visual the possibility of "a lot of high dispersion low count features that are changing the dispersion trend" that could give reason to pre-filtering?

If I do NOT prefilter low counts, perform DEseq, then look at a dispersion plot and a histogram of ratios of low p-value vs. mean count, it seems normal I think?histogram estdispersionplot

plotDispEsts( dds, ylim = c(1e-6, 1e1) )
qs <- c( 0, quantile( res$baseMean[res$baseMean > 0], 0:7/7 ) )
bins <- cut( res$baseMean, qs )
levels(bins) <- paste0("~",round(.5*qs[-1] + .5*qs[-length(qs)]))
ratios <- tapply( res$pvalue, bins, function(p) mean( p < .01, na.rm=TRUE ) )
barplot(ratios, xlab="mean normalized count", ylab="ratio of small $p$ values")
Entering edit mode

The fact that on the left side of the dispersion plot you have many features with mean of < 10 counts and their dispersion is at the limit in the y-axis (the limit is due to the maximal ratio of SD to mean for positive valued data).

Generally, I tend to remove features with single digit counts for all samples.

Entering edit mode

Thank you very much for explaining that. It is much appreciated.


