I'm running DESeq to compare two genotypes responding to a treatment. I turn that was not able to see interactions, at least after apply Benj-Hoch for multiple testing correction. I'm thinking in remove genes that are too lowly expressed with the hope that will increase my power. Now my question is, where is the logical threshold..
I like to do something like
dds <- dds[ rowSums(counts(dds)) > x, ]
which will be a good number for "x", if any?
This will reduce the number of tests made and therefore improve the multiple testing correction of Benjamin-Hochberg. I hope!
hi Ale,
That subsetting step was used in old documentation, just to clean up the presentation of results. It has no consequence on the results, because these rows with zero counts for all samples, will have all NA's for log fold changes, p-values or adjusted p-values anyway (they are not included in the p-value adjustment step).
I'm not certain if you have read the Independent Filtering section of the vignette (Section 3.8 of the current release). I'll copy it here:
"The results function of the DESeq2 package performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic is found which optimizes the number of adjusted p values lower than a significance level alpha (we use the standard variable name for significance level, though it is unrelated to the dispersion parameter α). The theory behind independent filtering is discussed in greater detail in Section 4.6. The adjusted p values for the genes which do not pass the filter threshold are set to NA."
Here's the relevant section of the man page
?results
:"By default, independent filtering is performed to select a set of genes for multiple test correction which will optimize the number of adjusted p-values less than a given critical value alpha (by default 0.1). The adjusted p-values for the genes which do not pass the filter threshold are set to NA. By default, the mean of normalized counts is used to perform this filtering, though other statistics can be provided. Several arguments from the filtered_p function of genefilter are provided here to control or turn off the independent filtering behavior."
Here is the relevant section of the manuscript:
"Due to the large number of tests performed in the analysis of RNA-seq and other genome-wide experiments, the multiple testing problem needs to be addressed. A popular objective is control or estimation of the false discovery rate (FDR). Multiple testing adjustment tends to be associated with a loss of power, in the sense that the false discovery rate for a set of genes is often higher than the individual p-values of these genes. However, the loss can be reduced if genes are omitted from the testing that have little or no chance of being detected as differentially expressed, provided that the criterion for omission is independent of the test statistic under the null hypothesis [21] (see Methods). DESeq2 uses the average expression strength of each gene, across all samples, as its filter criterion, and it omits all genes with mean normalized counts below a filtering threshold from multiple testing adjustment. DESeq2 by default will choose a threshold that maximizes the number of genes found at a user-specified target FDR. In Figures 2A-B and 3, genes found in this way to be significant at an estimated FDR of 10% are depicted in red. "
The genes with small counts are automatically filtered by the results() function to not be included in the p-value adjustment. The filter threshold is chosen to optimize the number of DE genes for a given FDR threshold (this is alpha, an argument to results(), set by default to 0.1).
So, to answer your question:
"I'm thinking in remove genes that are too lowly expressed with the hope that will increase my power....can someone give me a reasonable number that I can use at this time to remove genes without jeopardize my analysis?"
Yes, a reasonable threshold (given the objective of maximizing the total number of rejections over the entire dataset) is automatically calculated by DESeq2's results() function, using a function from the genefilter Bioconductor package.