Hello everyone,
I am currently working on RNA-Seq data using DESeq2. As it is in the manual, you can perform pre-filtering (e.g.:
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
However, it's also said that: "While it is not necessary to pre-filter low count genes before running the DESeq2 functions...". So, from what I gather, using this threshold (10) or just removing genes w/ zero counts would yield the same result.
In my results, I used both criteria and, although the summary output is the same, I get different p-values (non-corrected and after BH adjustment).
dm <- DESeqDataSetFromMatrix(countData = tab, colData = design, design = ~ group) dm<-dm[rowSums(counts(dm)) > 0 , ] dm<-DESeq(dm) ashr_zero<-lfcShrink(dm,contrast=c("group","trt","untrt"),type="ashr") dm <- DESeqDataSetFromMatrix(countData = tab, colData = design, design = ~ group) dm<-dm[rowSums(counts(dm)) > 10 , ] dm<-DESeq(dm) ashr_ten<-lfcShrink(dm,contrast=c("group","trt","untrt"),type="ashr") ashr_zero<-ashr_zero[rownames(ashr_zero) %in% rownames(ashr_ten),] all(rownames(ashr_zero)==rownames(ashr_ten)) #to check if I'm comparing the same genes
[1] TRUE
check1<-vector() for (i in 1:ncol(ashr_res1)) { check1[i]<-all(ashr_zero[,i] == ashr_ten[,i],na.rm=T) }
check1
[1] TRUE FALSE FALSE FALSE FALSE
By looking at the summary, the independent filtering criteria is the same, the number of genes is different (which is normal, considering I filter more genes in the threshold 10 than for zero) but, I really don't understand what is causing this difference.
Thanks!
It doesn’t really matter, except that once you pick one filtering rule you should note it down and stick with it for computational reproducibility.