I am trying to get a better understanding of how limma and dreamlet handle random effects differently in a confounded pseudobulk scRNA-seq dataset where all of condition 1 are in one batch, and all of condition 2 in a separate batch, where batch is a random variable (rather than a fixed variable).
Where do these analyses diverge, leading to different results?
Load example
library(muscat)
library(SingleCellExperiment)
library(dreamlet)
data(example_sce)
subset_sce <- example_sce[, colData(example_sce)$cluster_id == "B cells"]
Add a "batch" variable. This results in a confounded dataset where all ctrl samples are in batch1 and all stim samples are in batch2.
colData(pb)$batch_id <- c("batch1","batch1","batch2","batch2")
> colData(pb)
DataFrame with 4 rows and 2 columns
group_id batch_id
<factor> <character>
ctrl101 ctrl batch1
ctrl107 ctrl batch1
stim101 stim batch2
stim107 stim batch2
Dreamlet
res.proc <- processAssays(pb, ~group_id + (1|batch_id))
res.dl <- dreamlet(res.proc, ~group_id + (1|batch_id))
> topTable(res.dl$`B cells`, coef = "group_idstim", number = 5)
logFC AveExpr t P.Value adj.P.Val B z.std
PLSCR1 4.076507 8.629915 12.328888 5.754163e-21 2.301665e-19 35.5107902 9.394406
SAMD9 2.860439 7.770536 9.914056 4.707924e-16 9.415848e-15 24.9772906 8.118806
BAZ1A 1.258757 7.999992 4.556386 1.682875e-05 2.243833e-04 2.0288219 4.303281
PPIB -1.082835 7.927141 -3.789278 2.739946e-04 2.739946e-03 -0.6243383 -3.638722
HNRNPDL -0.868700 8.676615 -3.220956 1.783518e-03 1.426815e-02 -2.3691292 -3.124097
Note that if I don't set batch_id as a random variable in the formula, then I get comparable results to limma (ISG15, ISG20 are top DE genes).
res.proc <- processAssays(pb, ~group_id)
res.dl <- dreamlet(res.proc, ~group_id)
> topTable(res.dl$`B cells`, coef = "group_idstim", number = 5)
logFC AveExpr t P.Value adj.P.Val B
ISG15 6.176658 10.230636 19.10827 1.263485e-14 1.070171e-11 23.21015
ISG20 3.600829 11.579442 16.01292 3.922087e-13 1.661004e-10 20.26556
LY6E 4.153726 9.124182 11.23303 2.892466e-10 8.166396e-08 13.71934
PLSCR1 4.076507 8.629915 10.86829 5.210069e-10 1.103232e-07 13.13822
IRF7 3.618931 9.249620 10.39864 1.135347e-09 1.713038e-07 12.38799
Limma
Unlike Dreamlet, these results are the same with or without setting block and correlation (as it is zero anyway and throws a warning).
design <- model.matrix(~ group_id, data=colData(pb))
v <- voom(pb@assays@data$`B cells`, design, plot=TRUE)
dupcor <- limma::duplicateCorrelation(object=v, design=design, block=colData(pb)$batch_id)
fit <- lmFit(v, design, block=colData(pb)$batch_id, correlation=dupcor$consensus.correlation)
fit <- eBayes(fit)
> topTable(fit, coef = "group_idstim", number = 5)
logFC AveExpr t P.Value adj.P.Val B
ISG20 3.5043779 11.579442 17.05821 3.832382e-13 4.855628e-10 20.33819
ISG15 6.1094164 10.230636 16.24125 9.359172e-13 5.929035e-10 18.93230
B2M 0.6715954 15.404921 11.00242 8.728791e-10 3.686460e-07 11.55374
IFIT3 9.0428629 7.531172 10.65543 1.494281e-09 4.733134e-07 10.63147
LY6E 4.2151969 9.124182 10.12225 3.500223e-09 7.675392e-07 11.27878
If the group and batch are perfectly confounded, then you're pretty much stuck.
But let's consider some other issues.
processAssays()
andvoom()
estimate the mean-variance trend in the data, but depend on the formula provided. Changing the formula will change the estimated precision weights and the results of the differential expression analysis.The major difference between how the
dream / dreamlet
andvoom()
handle random effects is thatdream / dreamlet
estimate the random effects separately for each gene, whilevoom()
estimates a single variance term shared across all genes. Thevoom()
approach works for small samples, butdream / dreamlet
can perform better statistically with sufficient sample size. See https://doi.org/10.1093/bioinformatics/btaa687]I do wonder myself what dreamlet() is doing then, since it seems to be doing something in a situation where I would have thought that estimating random effects was not possible.
It's duplicateCorrelation() rather than voom() that estimates random effects, and it estimates a shared correlation rather than a shared variance. Neither of the variance components are assumed equal across genes.
1) the dreamlet model should also fail in this case because the model is degenerate. Indeed, it fails following a convergence check with the latest version:
Can you show your package versions?
2) of course Gordon is right, it is shared _correlation_.
Thank you both for the clarification - I really appreciate it.
Here were the package versions that I was using: