Question regarding "low counts" in DESeq2
2
0
Entering edit mode
michael • 0
@michael-21908
Last seen 5.0 years ago

I am performing an analysis for diff. expression on insects. The insects are treated with two doses of an insecticide (let's call them "low" and "high").

The setting of the experiment is the following:

  • 5 x Control (biological replicates, several individuals pooled per sample)
  • 5 x exposed to low dosage (biological replicates, several individuals pooled per sample)
  • 5 x exposed to high dosage (biological replicates, several individuals pooled per sample)

I compare once control against "low" and once control against "high".

The analysis is based on a reference genome, since there is a high. qual. genome & annotation available. I mapped with HISAT2, estimated with stringtie and imported with tximport into DESeq2.

For the low concentration I see barely any effect (5 genes DE on padj < 0.05). For the high concentration I have 71 DE genes on padj < 0.05.

BUT, if I have a look on the summary of the two tests, I see that for the low concentration I have zero "low counts". For the high concentration I have 4079 "low counts" (see below). I do not really understand this. Why is this so extreme? Is there a problem with my analysis or is this common?

Summary for "low":

out of 11072 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 3, 0.027%
LFC < 0 (down)     : 2, 0.018%
outliers [1]       : 12, 0.11%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Summary for "high":

out of 11147 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 39, 0.35%
LFC < 0 (down)     : 32, 0.29%
outliers [1]       : 11, 0.099%
low counts [2]     : 4079, 37%
(mean count < 150)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

EDIT: The two "Summaries" above are resulting from DESeq's summary-function. The low counts are referring to the ones filtered by independentFiltering (see https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory).

deseq2 • 2.6k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

Check out the vignette explanation for “independent filtering” which explains how the low counts are identified. The procedure is greedy and heuristic but has a paper describing the idea (Bourgon 2010 PNAS).

Alternatively you can try the IHW method mentioned in the vignette.

ADD COMMENT
0
Entering edit mode

Thank you for the answer. I will have a more detailed look into the vignette and the paper. Already did that but obviously did not get it fully.

Please allow me a short question: Are you surprised that there are no genes filtered by “independent filtering” when there seems to be little to no effect with the "low dosage" but more than 4000genes with the "high dosage" where we see many DE genes?

ADD REPLY
0
Entering edit mode

No not surprised or concerned. The filtering depends on the distribution of small pvalues over the filter statistic. This is explained in the two papers (independent filtering and IHW).

ADD REPLY
0
Entering edit mode

Great, thanks a lot!

ADD REPLY
1
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 25 days ago
Republic of Ireland

You should show all of your code so that we can better understand what may be happening. For example, how are you even generating these summary results?; what is your threshold for 'low counts'?; what is your design formula?

Can it be that a high concentration 'switches off' the expression of genes, which is why 4079 are regarded as low? - please have a think.

Check the expression of some of these 4079 genes via box-and-whisker plots with overlayed scatter plots so that you can get a better idea of their distribution across your sample groups. For example:

mat <- matrix(rexp(200, rate=.1), ncol=10)
rownames(mat) <- paste0('gene', 1:nrow(mat))
colnames(mat) <- paste0('sample', 1:ncol(mat))
mat <- t(mat)
mat <- data.frame(group = c(rep('A',3),rep('B',3),rep('C',4)), mat)
mat
         group      gene1     gene2     gene3      gene4     gene5     gene6
sample1      A  1.3613396  4.818162 30.038703  1.7708178 15.753435 15.132250
sample2      A  1.0373180  7.026840  2.602620 14.2208113 27.123415 17.862783
sample3      A  5.8754663 16.823887  7.520286  5.6289010  2.070628 97.318171
sample4      B  0.5934059  7.526411  3.333351  6.3049553  4.374298 28.860962
sample5      B  7.7361421  2.869511  8.409662  0.8628736 15.582699 20.612012
sample6      B  4.5341852 21.135215 11.396572  0.5477893  1.524821 31.184327
sample7      C  3.1312399  2.827958  2.371193 25.2858262  1.097504 18.409439
sample8      C  0.9892564 19.377189 11.262286 43.7460277  4.238364  7.337152
sample9      C  0.8872756  9.423403 10.656544 11.9628562  2.225797  6.791518
sample10     C 23.8933559  7.671681  8.913326  1.8168262 15.777244 11.085702


boxplot(gene4 ~ group, data = mat)

ddd

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6