Yes, this is possible, and AFAIK is possible in DESeq2, too. It's easy to get tripped up by R builds design matrices from the symbolic formula that contain interactions. When in doubt, expand the interactions explicitly with e.g. interaction
.
library(MAST)
ng = 10
nc = 20
cd = expand.grid(A = factor(1:3), B = factor(1:5), rep = 1:nc)
cd$AB = with(cd, interaction(A, B))
mm = model.matrix(~AB, data = cd)
eta = mm %*% c(3, -2, 0, # (A=2, B=1) - (A=1, B=1), not tested
2, #(A=1, B=2) - (A=1, B=1), tested
rep(0, 11)) # rest null
U = matrix(eta, nrow = length(eta), ncol = ng)
U = U + rnorm(length(U))
PV = as.vector( exp(eta) / (1 + exp(eta)))
V = 1*(runif(length(U)) < PV)
dim(V) = dim(U)
Y = U*V
sca = FromMatrix(t(Y),cData = cd)
zz = zlm(~AB, sca = sca)
waldTest(zz, Hypothesis("AB2.5 - AB1.2"))
# Similar to (up to sampling error)
waldTest(zz, Hypothesis("-AB1.2"))
# The AB2.5 coefficient should be approx. zero
waldTest(zz, Hypothesis("AB2.5"))
Really, this is more of a base-R question. Forsake not your prophets Venables and Ripley. Chapter 6 shall provide the Light. the Truth, and the Way.
Hello Andrew:
Thanks for replying. I also have a MAST specific question. The paper describes how shrinkage is done but it says little about why it was necessary in the first place. What kind of evidence did you get to justify the application of shrinkage?
Regards, Nik
There are two forms of shrinkage happening, and they were developed after a fair amount of iterative exploration of data at the time, though little of this made it into the paper. One for the logistic regression coefficients, and one for the linear regression $\sigma^2$. In practice, the first one is probably more important, because it results in reasonable point estimates (and Wald confidence intervals) when there is linear separation, which is the rule rather than the exception.
The shrinkage on $\sigma^2$ is often less important given the larger number of replicates in scRNAseq, but can still have an impact since the number of cells with expression $>0$ varies from gene to gene. If filtering for the amount of expression hasn't occurred, then sometimes you are trying to make inference with only a handful of cells.
If you are curious what the $\sigma^2$ shrinkage is doing to your data you can get the "effective number of cells" and "prior precision" from the model object,
and turn off shrinkage with
zlm(..., method = 'glm', ebayes = FALSE)
. In the simulation I ran you can see that the effective degrees of freedom is extremely high, reflecting the small number of genes and exact homogeneity in the variance. The variance is overestimated because by default, it is fit just using the intercept (theH0
model). See?ebayes
if you are curious for more details.Hello Andrew:
You are probably familiar with this paper of Soneson and Robinson. Their overall conclusion is that the best single-cell-specific method was MAST, but it did as well as the best of bulk RNA methods. Do you think they got something wrong?
Hello Andrew:
What happens in contrast estimation if there is a quantitative covariate? E.g. I have the following toy data set:
After I run zlm() w/o shrinkage, the hurdle logFC is:
Given the formula from the reference:
the question is how to set the value of Cov. In general, it can/should be set at its mean, but the means in the continuous and discrete parts are different because the continuous part uses only 6 observations out of 16. I'd say one should use two distinct means here, but I didn't manage to reproduce logFC. I also tried using just one mean obtained from all the observations and it didn't work either. When I use just A in the model, I reproduce logFC, so apparently the discrepancy is due to Cov. Please help me clarify this.
What you wrote seems accurate. If not otherwise specified, eg with
logFC(contrast0 = ...)
, the coefficient of your continuous covariateCov
will be set to zero, which in this case would mean thatCov
is equal to zero. This may not be desirable, though it depends on what the covariate is.If you center a continuous covariate, so it has mean zero over all observations, then setting the coefficient to zero would be equivalent to estimating the logFC with the covariate set to its mean over all observations (both zero and non-zero). There's no provision to center it with respect to both all and non-zero observations, since this would mean a different
contrast0
andcontrast1
for each gene.There is a definite difference between MAST in DESeq2 when designs with interactions are considered. In MAST, all of the regression coefficients are shrunken in the logistic part no matter what, whereas in DESeq2 the shrinkage of regression coefficients is disabled when interactions are present:
https://support.bioconductor.org/p/71975/#99550