DESeq2: Clarification on independent filtering threshold values
1
0
Entering edit mode
nattzy94 ▴ 20
@nattzy94-23466
Last seen 2.2 years ago
Singapore

I am doing differential expression using DESeq2. I have 2 conditions (Veh0h, treatment56h) and 5 samples each. After generating the gene count matrix from StringTie output files and loading into R as countData, I used the following code to do the differential expression analysis:

countData <- as.matrix(read.csv("/Users/Desktop/gene_count_matrix.csv", row.names="gene_id"))
colData <- data.frame(condition = factor(rep(c("A", "B"), each = 5)))
rownames(colData) <- colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, 
                              colData = colData, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, addMLE = TRUE)

I wanted to find out more about independent filtering and used the command metadata(res)$filterThreshold. Which gave me the following numbers:

66.02757% 
6.589014

I would like to clarify on what these numbers mean is that DESeq2 has filtered out genes with a mean count of 6.589? And as a result, 66.02% of genes were filtered out of the analysis.

Is this a correct interpretation?

deseq2 • 6.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

Yes but note that the percentile includes genes with all 0s.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I have another question on independent filtering.

I recently read a paper, where DGE analysis was done with independent filtering set to FALSE. In the supplemental methods, they write:

Differential expression analysis was performed using DESeq2 (5). Independent filtering was not used in this analysis. Pairwise comparisons were performed using a Wald test. To control for false positives due to multiple comparisons in the genome-wide differential expression analysis, we used the false discovery rate (FDR) that was computed using the Benjamini-Hochberg procedure.

Are there any instances in which it might be appropriate to turn off independent filtering? The only reason I can think of is if independent filtering only adds on a small number of significant genes.

ADD REPLY
0
Entering edit mode

Independent filtering should increase power. The FDR is controlled without independent filtering though. See the Bourgon and Ignatiadis papers for more details.

ADD REPLY
0
Entering edit mode

Hi Michael,

I have a follow up question. I plotted the number of rejections against filter quantiles and saw that at the quantile selected, the number of rejections is about 6000 genes. The plot can be seen here: https://ibb.co/F6wV05P

I then looked at the number of genes with valid padj values by doing na.omit(results$padj) and got 20,567 genes. I did the same for the results with independent filtering turned off. Doing na.omit(results_noFilter$padj, I get 34,854 genes. It seems that the filtering has rejected 14,000+ genes.

Is there a discrepancy here or am I missing a detail?

ADD REPLY
0
Entering edit mode

What is surprising/concerning here?

ADD REPLY
0
Entering edit mode

I thought that the difference between na.omit(results$padj) and na.omit(results_noFilter$padj should be the number of rejected genes. In my case 6,000 genes were rejected but the difference between between results$padj and results_noFilt$padj is 14,000.

Why is there an additional 8,000 genes which do not have padj?

ADD REPLY
0
Entering edit mode

The y-axis in the filter plot is the number of rejected null hypotheses (e.g. significant genes).

From the vignette:

For example, we can visualize the optimization by plotting the filterNumRej attribute of the results object. The results function maximizes the number of rejections (adjusted p value less than a significance level), over the quantiles of a filter statistic (the mean of normalized counts). The threshold chosen (vertical line) is the lowest quantile of the filter for which the number of rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles...

We want the most rejections of null hypotheses (adj p value < alpha).

ADD REPLY
0
Entering edit mode

Okay, I think I get it. NumRej refers to the number of genes with adj p value < alpha. For some reason I kept thinking it was the number of genes with baseMean < filterThreshold baseMean.

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6