Hi,
I am looking for a bit more information on what exactly the intercept-only model represents in DESeq2, ie, when you set the experimental design formula to ~1 (without predictors)?
Potential interpretations:
1) Would you expect that the baseMean output values would represent the simple average of normalized reads across the samples for each gene and that the accompanying p-value would test whether that mean value is significantly different from zero? (http://www.philender.com/courses/linearmodels/notes1/nopredict.html)
2) Or it is less simplistic than that and the baseMean value for each gene is actually the overall Negative Binomial mean? (A: Independent filtering with unequal group sizes?).
I understand that by setting the design formula to an intercept-only model using ~1 that we are “re-estimating the dispersions using only an intercept” (section 2.1.1 Deseq2 manual), however, I’m not sure how that would translate into what the baseMean value represents.
If I look at the normalized matrix output from the intercept-only model and calculate the mean across all the samples per mean, the average value for some genes matches the baseMean value from the intercept-only model output but others do not.
Model1_Interceptonly<-phyloseq_to_deseq2(data, ~1) counts<-counts(Model1_Interceptonly) geoMeans = apply(counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0])))) dds_inter<-estimateSizeFactors(Model1_Interceptonly, geoMeans=geoMeans) varstab _intercept = DESeq(dds_inter, fitType="local") res_inter = results(varstab _intercept) all_res_inter = res_inter[order(res_inter$padj, na.last=NA), ] head(all_res_inter)
log2 fold change (MAP): Intercept
Wald test p-value: Intercept
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
OTU_1 34581.43394 15.077710 0.1684288 89.51976 0.000000e+00 0.000000e+00
OTU_3 1215.52924 10.247369 0.2407054 42.57224 0.000000e+00 0.000000e+00
OTU_2 2966.83166 11.534707 0.3927574 29.36853 1.385981e-189 4.435140e-188
OTU_7 130.43222 7.027156 0.5309173 13.23588 5.445682e-40 1.306964e-38
OTU_281 8.93582 3.159600 0.2910647 10.85532 1.881520e-27 3.612518e-26
OTU_4 61.92271 5.952397 0.5612130 10.60631 2.785427e-26 4.456684e-25
If we extract the normalized values from this model and take the average across all the samples per gene:
norm_counts_intercept_only <-counts(dds_inter, normalized=TRUE) ##[DESEQ2] How to access the normalized data of a DESeqDataSet
Gene | Average across all samples |
OTU_1 | 34581.43394 |
OTU_3 | 1215.52924 |
OTU_2 | 17948.41632 |
OTU_7 | 241.9170603 |
OTU_281 | 11.64005756 |
OTU_4 | 145.9316035 |
Above we can see that OTU_1 and OUT_3 averages are exactly the same as the baseMean estimates, but the others are not.
Questions:
1) what exactly do the baseMean estimates represent in an intercept-only model and what do their p-values represent?
2) Furthermore, would it be correct to interpret the significantly deferentially expressed genes from an intercept-only model as all the genes DE across the whole dataset regardless of treatment?
3) Would it be correct to use the intercept-only model against a model with only one predictor in a Likelihood Ratio Test (section 3.5 of the DESeq2 manual)? Would it be correct to conclude that the DE genes between the two models are those DE due to the one treatment? I have seen that maybe the package author has counselled against this this post (https://stat.ethz.ch/pipermail/bioconductor/2014-May/059679.html), although maybe he was just saying it wasn’t the best reduced model when you have two predictors.
Any insights would be appreciated as I am interested in understanding what these DE genes represent.Thanks!
Thank you so much for the quick, clear and thorough answers!!