Hi all,
once in a while I come upon an DESeq analysis which results in a large number of genes with low p.adjust values (~significant genes) but with a somewhat unsatisfying expression pattern: zero counts in nearly all samples. It is very likely I am missing a modelling parameter but I did search for solution and didn't find anything.
For example, in this re-analysis of a published dataset there are two distinct clouds with extreme fold-changes but at the low end of expression:
plotMA(res, ylim=c(min(res$log2FoldChange, na.rm = T), max(res$log2FoldChange, na.rm = T)))
These contain quite a few differentially expressed genes, and when I take the one with the lowest padj
we can see that it has zero counts in all of the control samples (scrambled_KCl)
plotCounts(dds, gene=which.min(res$padj), intgroup="siRNA_stimulus")
gene_id <- rownames(res[which.min(res$padj), ])
samples <- subset(sample_info, siRNA_stimulus %in% c("scrambled_KCl", "NEAT1_KCl")) %>%
dplyr::arrange(siRNA_stimulus) %>%
.$sample_id
cnts[gene_id, samples] %>%
knitr::kable()
x | |
---|---|
ERR841606 | 0 |
ERR841614 | 0 |
ERR841619 | 0 |
ERR841607 | 269 |
ERR841611 | 213 |
ERR841618 | 236 |
On the one had it makes sense that DESeq calls this gene as differentially expressed. However, if we take the 20 genes with the lowest 20 padj values we see the pattern repeating and something even more interesting:
res_20 <- res[order(res$padj), ] %>% head(20)
res_20 %>%
knitr::kable()
cnts[rownames(res_20), samples] %>%
knitr::kable()
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
ENSG00000064787 | 281.8787 | 21.46530 | 1.829856 | 11.730597 | 0 | 0 |
ENSG00000138136 | 301.6668 | -24.53871 | 2.102424 | -11.671629 | 0 | 0 |
ENSG00000261302 | 151.4447 | -22.65529 | 2.098731 | -10.794760 | 0 | 0 |
ENSG00000234913 | 181.8306 | 21.73508 | 2.021428 | 10.752341 | 0 | 0 |
ENSG00000276591 | 226.5125 | 21.68802 | 2.035174 | 10.656591 | 0 | 0 |
ENSG00000223843 | 221.6896 | 22.60131 | 2.181985 | 10.358141 | 0 | 0 |
ENSG00000235649 | 277.0373 | 21.39288 | 2.088070 | 10.245290 | 0 | 0 |
ENSG00000260036 | 170.4672 | 21.07013 | 2.068249 | 10.187428 | 0 | 0 |
ENSG00000261774 | 163.4805 | 21.30677 | 2.128112 | 10.012052 | 0 | 0 |
ENSG00000277304 | 232.6921 | 21.53662 | 2.159207 | 9.974319 | 0 | 0 |
ENSG00000268066 | 187.8949 | 21.43550 | 2.151105 | 9.964878 | 0 | 0 |
ENSG00000223431 | 208.2810 | 21.17024 | 2.135009 | 9.915763 | 0 | 0 |
ENSG00000062096 | 154.0686 | 21.51732 | 2.178597 | 9.876689 | 0 | 0 |
ENSG00000262248 | 128.2776 | -23.73491 | 2.406768 | -9.861733 | 0 | 0 |
ENSG00000259691 | 189.3708 | 20.87996 | 2.120870 | 9.844999 | 0 | 0 |
ENSG00000200997 | 311.3471 | -22.31373 | 2.284490 | -9.767488 | 0 | 0 |
ENSG00000213877 | 195.0722 | 21.81423 | 2.236702 | 9.752854 | 0 | 0 |
ENSG00000201821 | 110.8460 | -24.38071 | 2.510122 | -9.712957 | 0 | 0 |
ENSG00000276957 | 155.2442 | -23.59726 | 2.452138 | -9.623138 | 0 | 0 |
ENSG00000237260 | 162.8905 | -24.11136 | 2.509733 | -9.607141 | 0 | 0 |
ERR841606 | ERR841614 | ERR841619 | ERR841607 | ERR841611 | ERR841618 | |
---|---|---|---|---|---|---|
ENSG00000064787 | 0 | 0 | 0 | 269 | 213 | 236 |
ENSG00000138136 | 0 | 672 | 0 | 0 | 0 | 0 |
ENSG00000261302 | 159 | 0 | 0 | 0 | 0 | 0 |
ENSG00000234913 | 0 | 0 | 0 | 530 | 224 | 124 |
ENSG00000276591 | 0 | 0 | 0 | 162 | 540 | 169 |
ENSG00000223843 | 0 | 0 | 0 | 432 | 675 | 463 |
ENSG00000235649 | 0 | 0 | 0 | 618 | 116 | 0 |
ENSG00000260036 | 0 | 0 | 0 | 88 | 136 | 273 |
ENSG00000261774 | 0 | 0 | 0 | 214 | 76 | 289 |
ENSG00000277304 | 0 | 0 | 0 | 319 | 198 | 235 |
ENSG00000268066 | 0 | 0 | 0 | 294 | 340 | 100 |
ENSG00000223431 | 0 | 0 | 0 | 249 | 217 | 129 |
ENSG00000062096 | 0 | 0 | 0 | 172 | 288 | 259 |
ENSG00000262248 | 112 | 92 | 107 | 0 | 0 | 0 |
ENSG00000259691 | 0 | 0 | 0 | 249 | 131 | 101 |
ENSG00000200997 | 0 | 123 | 0 | 0 | 0 | 0 |
ENSG00000213877 | 0 | 0 | 0 | 143 | 318 | 404 |
ENSG00000201821 | 157 | 86 | 271 | 0 | 0 | 0 |
ENSG00000276957 | 0 | 0 | 267 | 0 | 0 | 0 |
ENSG00000237260 | 457 | 0 | 0 | 0 | 0 | 0 |
Several genes have a single sample with counts > 0 for samples in those groups being compared. In this case one would assume that DESeq would not attribute a low p-value two those because there is little evidence for difference expression in those two groups. Showing the counts for all groups, we can see that this gene happens to be highly expressed in the samples of other groups:
samples_ordered <- sample_info %>%
dplyr::arrange(siRNA_stimulus)
cnts["ENSG00000237260", samples_ordered$sample_id]
plotCounts(dds, gene="ENSG00000237260", intgroup="siRNA_stimulus")
My question, or doubt, is it expected that this gene would have a low p-value? If seems that while it sort of this, the model might consider that there is enough evidence in the the sample set as whole to attribute a low p-value, but IMO as a biologist, it doesn't pass the "eye test".
As a follow question, is there any way to model out these genes? In principle one could do pre-filtering for genes not expressed in less than x% samples of any condition, but for practical reasons I would rather not (merging results downstream), and I was under the assumption that DESeq would handle these cases elegantly.
Cheers!
Quite often there is a need to merge datasets, for example compare log2 fold changes or intersect differential expressed genes between two or more separate analysis. Having different subsets of genes for each of those analysis, whilst not an insurmountable problem, is a bit of pain when merging missing values. This is avoidable the results tables have consistently the same genes*.
But regardless of these minor considerations, the important issue is that there is a general assumption (recommendation) that filtering genes is not necessary in DESeq2 because the model handles low counts etc very well (which it does!). However example above shows that there are caveats with this blanket recommendation.