Hi all,
I am analyzing an rna-seq dataset (treated vs control) with three replicates in each. After performing differential expression with DESeq2, there are way too many differentially expressed genes ! Most of them are with low readcounts (<5). Here is MA plot, vertical line at readcount 5.
Here is my summary of results:
>summary(dds.24h.res, alpha = 0.01) out of 41405 with nonzero total read count adjusted p-value < 0.01 LFC > 0 (up) : 7163, 17% LFC < 0 (down) : 6146, 15% outliers [1] : 0, 0% low counts [2] : 11791, 28% (mean count < 1) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
Then I looked at metadata of results for threshold:
>metadata(dds.24h.res)$filterThreshold 48.74028% 0.7043008
Its too low! Then I tried turning of independentFiltering false, yet I have the same results (too many DEGs).
> summary(res.24.noFilt, alpha = 0.01) out of 41405 with nonzero total read count adjusted p-value < 0.01 LFC > 0 (up) : 6910, 17% LFC < 0 (down) : 6056, 15% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 0) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
I also manually filtered raw counts (rowmean > 5) before running DESeq, but still I have many DEGs. PCA results show nice clustering of samples and replicates so no problem there.
Is there any thing I am missing ?
Thanks.
-Anand.
Hi Anand and Michael,
To illustrate that I also have some troubles with few genes that have low read-counts but are DE with a high log2FC threshold. Here is my top DE genes for padj <0.01 and |log2FC| >2. See For example Gene G1 : lo2FC= - 22.01 BUT mean condition 1 (BC.mean)= 59.93 and 0 for condition2 (BV.mean).
So, First : How to deal with these "low read-counts" genes ?
Clearly here the filtering according to Log2FC is not relevant. Moreover It is not possible to validate these genes with QPCR because of low reads.
Second : These genes are DE or not according to the model used in DESeq2 : my design is two conditions (virus-infected versus control) two genotypes B and A. When I analysed the data according to the "Example 2" from DEseq2 vignette (which fit adequately to my design). I obtained these genes. Whereas when I analysed my design as 2 conditions only B Virus/ B control, these genes are not DE.
Any idea ? Comment ?
names
BC.m
BV.m
AC.m
AV.m
log2FC
padj
G1
59
0,00
1048,90
1263,14
-22,01
1,30E-15
G4
8,9
0,41
62,91
110,18
-4,35
2,87E-03
G5
4,40
0,00
110,33
102,44
-5,28
1,56E-03
G6
97,07
0,00
1893,44
1897,97
-9,61
5,94E-04
G7
24,81
2,58
302,67
147,67
-3,27
1,88E-03
G8
67,37
16,34
35,24
60,79
-2,05
8,63E-05
G9
63,68
0,00
1530,28
1982,28
-9,00
2,79E-04
G10
18,91
0,00
365,41
336,67
-7,26
4,10E-03
I'll reply to your post here: DESeq2 : DE genes witth high log2FC and few or null counts in some samples