Question

DESeq: low p-values for genes with zero counts in all but one sample of the compared groups

0

Entering edit mode

António Miguel de Jesus Domingues ▴ 510

@antonio-miguel-de-jesus-domingues-5182

Last seen 12 months ago

Germany

Hi all,

once in a while I come upon an DESeq analysis which results in a large number of genes with low p.adjust values (~significant genes) but with a somewhat unsatisfying expression pattern: zero counts in nearly all samples. It is very likely I am missing a modelling parameter but I did search for solution and didn't find anything.

For example, in this re-analysis of a published dataset there are two distinct clouds with extreme fold-changes but at the low end of expression:

plotMA(res, ylim=c(min(res$log2FoldChange, na.rm = T), max(res$log2FoldChange, na.rm = T)))

These contain quite a few differentially expressed genes, and when I take the one with the lowest padj we can see that it has zero counts in all of the control samples (scrambled_KCl)

plotCounts(dds, gene=which.min(res$padj), intgroup="siRNA_stimulus")

gene_id <- rownames(res[which.min(res$padj), ])
samples <- subset(sample_info, siRNA_stimulus %in% c("scrambled_KCl", "NEAT1_KCl")) %>%
    dplyr::arrange(siRNA_stimulus) %>%
    .$sample_id

cnts[gene_id, samples] %>%
    knitr::kable()

	x
ERR841606	0
ERR841614	0
ERR841619	0
ERR841607	269
ERR841611	213
ERR841618	236

On the one had it makes sense that DESeq calls this gene as differentially expressed. However, if we take the 20 genes with the lowest 20 padj values we see the pattern repeating and something even more interesting:

res_20 <- res[order(res$padj), ] %>% head(20)

res_20 %>%
    knitr::kable()


cnts[rownames(res_20), samples]  %>%
    knitr::kable()

	baseMean	log2FoldChange	lfcSE	stat
ENSG00000064787	281.8787	21.46530	1.829856	11.730597
ENSG00000138136	301.6668	-24.53871	2.102424	-11.671629
ENSG00000261302	151.4447	-22.65529	2.098731	-10.794760
ENSG00000234913	181.8306	21.73508	2.021428	10.752341
ENSG00000276591	226.5125	21.68802	2.035174	10.656591
ENSG00000223843	221.6896	22.60131	2.181985	10.358141
ENSG00000235649	277.0373	21.39288	2.088070	10.245290
ENSG00000260036	170.4672	21.07013	2.068249	10.187428
ENSG00000261774	163.4805	21.30677	2.128112	10.012052
ENSG00000277304	232.6921	21.53662	2.159207	9.974319
ENSG00000268066	187.8949	21.43550	2.151105	9.964878
ENSG00000223431	208.2810	21.17024	2.135009	9.915763
ENSG00000062096	154.0686	21.51732	2.178597	9.876689
ENSG00000262248	128.2776	-23.73491	2.406768	-9.861733
ENSG00000259691	189.3708	20.87996	2.120870	9.844999
ENSG00000200997	311.3471	-22.31373	2.284490	-9.767488
ENSG00000213877	195.0722	21.81423	2.236702	9.752854
ENSG00000201821	110.8460	-24.38071	2.510122	-9.712957
ENSG00000276957	155.2442	-23.59726	2.452138	-9.623138
ENSG00000237260	162.8905	-24.11136	2.509733	-9.607141

	ERR841606	ERR841614	ERR841619	ERR841607	ERR841611	ERR841618
ENSG00000064787	0	0	0	269	213	236
ENSG00000138136	0	672	0	0	0	0
ENSG00000261302	159	0	0	0	0	0
ENSG00000234913	0	0	0	530	224	124
ENSG00000276591	0	0	0	162	540	169
ENSG00000223843	0	0	0	432	675	463
ENSG00000235649	0	0	0	618	116	0
ENSG00000260036	0	0	0	88	136	273
ENSG00000261774	0	0	0	214	76	289
ENSG00000277304	0	0	0	319	198	235
ENSG00000268066	0	0	0	294	340	100
ENSG00000223431	0	0	0	249	217	129
ENSG00000062096	0	0	0	172	288	259
ENSG00000262248	112	92	107	0	0	0
ENSG00000259691	0	0	0	249	131	101
ENSG00000200997	0	123	0	0	0	0
ENSG00000213877	0	0	0	143	318	404
ENSG00000201821	157	86	271	0	0	0
ENSG00000276957	0	0	267	0	0	0
ENSG00000237260	457	0	0	0	0	0

Several genes have a single sample with counts > 0 for samples in those groups being compared. In this case one would assume that DESeq would not attribute a low p-value two those because there is little evidence for difference expression in those two groups. Showing the counts for all groups, we can see that this gene happens to be highly expressed in the samples of other groups:

samples_ordered <- sample_info %>%
    dplyr::arrange(siRNA_stimulus)

cnts["ENSG00000237260", samples_ordered$sample_id]

plotCounts(dds, gene="ENSG00000237260", intgroup="siRNA_stimulus")

My question, or doubt, is it expected that this gene would have a low p-value? If seems that while it sort of this, the model might consider that there is enough evidence in the the sample set as whole to attribute a low p-value, but IMO as a biologist, it doesn't pass the "eye test".

As a follow question, is there any way to model out these genes? In principle one could do pre-filtering for genes not expressed in less than x% samples of any condition, but for practical reasons I would rather not (merging results downstream), and I was under the assumption that DESeq would handle these cases elegantly.

Cheers!

DESeq2 DifferentialExpression • 1.0k views

ADD COMMENT • link 21 months ago António Miguel de Jesus Domingues ▴ 510

score 0 · Answer 1 · 2023-05-08

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 14 hours ago

United States

What practical reason do you have for not wanting to filter these genes? It is very common to filter genes that have zero counts in most samples because you can get spurious results like this.

ADD COMMENT • link 21 months ago James W. MacDonald 67k

0

Entering edit mode

Quite often there is a need to merge datasets, for example compare log2 fold changes or intersect differential expressed genes between two or more separate analysis. Having different subsets of genes for each of those analysis, whilst not an insurmountable problem, is a bit of pain when merging missing values. This is avoidable the results tables have consistently the same genes*.

But regardless of these minor considerations, the important issue is that there is a general assumption (recommendation) that filtering genes is not necessary in DESeq2 because the model handles low counts etc very well (which it does!). However example above shows that there are caveats with this blanket recommendation.

I can also set the values and FC to NA in the results tables for these genes.

ADD REPLY • link 21 months ago António Miguel de Jesus Domingues ▴ 510