Hi,
the default DESeq2::DESeq(betaPrior=TRUE) parameter seems to prevent finding single highly significant genes (e.g. inserted into a random data set):
library(DESeq2) ex <- makeExampleDESeqDataSet(n = 60000, m = 10) design(ex) <- ~ condition spikeIn <- matrix(as.integer(c(5,0,177,69,119, 57842,27801,105138,97698,126860, 10,11,0,0,2, 806,319,1500,1397,1150, 100,110,90,99,95, 806,319,1500,1397,1150, 0,1,0,1,1, 806,319,1500,1397,1150)), byrow = TRUE, nrow = 4) counts(ex)[1:nrow(spikeIn), ] <- spikeIn ex <- DESeq(ex, parallel = TRUE) res <- results(ex) head(res, 4)
log2 fold change (MAP): condition B vs A Wald test p-value: condition B vs A DataFrame with 4 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> gene1 39776.4660 1.283372 0.3460637 3.708485 2.085033e-04 3.347492e-01 gene2 496.9917 1.993635 0.3694553 5.396146 6.808749e-08 1.348200e-03 gene3 542.0637 2.819978 0.2780959 10.140308 3.659435e-24 1.086907e-19 gene4 495.0726 6.384450 0.3230821 19.761072 6.442145e-87 3.826827e-82
When setting betaPrior=FALSE, DESeq results in more reasonable pvalues:
ex <- DESeq(ex, betaPrior=FALSE, parallel = TRUE) res <- results(ex) head(res, 4)
log2 fold change (MLE): condition B vs A Wald test p-value: condition B vs A DataFrame with 4 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> gene1 39776.4660 10.13320 1.1232360 9.021438 1.856363e-19 2.695687e-15 gene2 496.9917 7.80994 0.9318235 8.381351 5.232344e-17 5.698546e-13 gene3 542.0637 3.38759 0.3342198 10.135813 3.831741e-24 8.346299e-20 gene4 495.0726 10.75123 0.9090859 11.826420 2.850400e-32 1.241748e-27
I had not expected such an extreme behavior of this default parameter, meaning, you could easily lose interesting genes when doing the default analysis.
By the way, modelMatrixType = "standard" and parallel = TRUE throws an error (parallel = FALSE works):
ex <- DESeq(ex, modelMatrixType = "standard", parallel = TRUE)
using pre-existing size factors estimating dispersions gene-wise dispersion estimates: 3 workers mean-dispersion relationship found already estimated fitted dispersions, removing these final dispersion estimates, MLE betas: 3 workers fitting model and testing: 3 workers Error: BiocParallel errors element index: 1, 2, 3 first error: betaPriorVar should have length 2 to match: (Intercept), conditionB
We had observed the first and last row of the spikeIn matrix in a real data set. It took some time to figure out what went wrong. In order to demonstrate that the issue is with the single gene and that not the whole data set is the problem, I tested it in the context of a random data set.
My experience is, that one encounters "artificial" looking data sets more often than expected. When running the default analysis, one would not even notice such extreme profiles.
Right now, for a standard analysis, I would deactivate all standard options, at the cost of more false-positives (because otherwise interesting candidates get lost):
It's not that those rows of counts that are a problem, but embedding these in a simulated dataset with ~60,000 genes where LFC = 0.
But certainly, the beta prior is best turned off for certain datasets as we say in the FAQ.
How can it be explained that for the last row in the spikeIn matrix still a quite high log2FoldChange and low padj are calculated (betaPrior=TRUE)?
The dispersion estimate will be higher for the first gene than the last one. The p-values are driven both by the LFC and the estimates of within-group variance (controlled by the dispersion parameter).