I’m following the DESeq2 tutorial to perform DGE analysis. I noticed that before run
Dds <- DESeq(dds)
it is recommended to remove genes whose counts are 0 for all the samples. I have question about this step:
- Why just genes with 0 counts in all samples? What about genes that add up a total of 5 counts considering all the samples? And 10? Which will be a reasonable threshold? I’m sure that for many experienced people doing DGE it should be a number that is sounded as a correct a safe threshold. I will like to have some advice regarding this question.
- I understand that DESeq2 perform independent filtering, and that for this purpose it identify a threshold base in counts and remove genes that given the counts cannot produce a trustable result. My question is: why to bother to perform the above step if these genes are going to be filtered any way.
Interesting.
On our dataset of 15 samples, we were removing the rows in which the average count was 1 or less. We even thought about removing the rows in which the average count was 2 or less.
Are you saying it is better to not remove these lines at all? Even the lines where the counts are zero for all samples?
I didn't say it was better. It should make no difference. You could increase the threshold even higher and it will begin to increase sensitivity
make no differenceup to a point at which you will be filtering too much (which will be different for each experiment).The question was: what is a safe / reasonable threshold that will work for all experiments. And our recommendation (and the default in DESeq2) is to let the genefilter software optimize the threshold, such that sensitivity (statistical power) is maximized.
There is a separate reference for genefilter if you want to read about this. Also there is a new approach from Wolfgang's group: https://www.bioconductor.org/packages/IHW