I'm having a hard time understanding how to pull out the relevant contrasts when using a model with covariates in limma
.
Let's say I have the following gene expression matrix and experimental variables:
library(limma) set.seed(16) mat <- matrix(rnorm(600), 100, 6) age <- c(45, 55, 47, 56, 43, 56) gender <- factor(c("M", "F", "F", "F", "M", "F")) condition <- factor(c(rep("healthy", 3), rep("disease", 3)), levels = c("healthy", "disease"))
If I were to use a very simple design (~ condition
) then I could easily identify genes differentially expressed in disease vs healthy both using a treatment-contrast parametrization (case A below) or using a group-means parametrization and manually defining the contrast (case B below):
# case A: simple design; treatment-contrast parametrization design_A <- model.matrix(~ condition) fit_A <- lmFit(mat, design_A) fit_A <- eBayes(fit_A) tt_A <- topTable(fit_A, coef = "conditiondisease") # case B: simple design; group-means parametrization design_B <- model.matrix(~ 0 + condition) fit_B <- lmFit(mat, design_B) cont.matrix_B <- makeContrasts(diseaseVShealthy = conditiondisease - conditionhealthy, levels = design_B) fit_B <- contrasts.fit(fit_B, cont.matrix_B) fit_B <- eBayes(fit_B) tt_B <- topTable(fit_B, coef = "diseaseVShealthy")
In fact, in this case the tt_A and tt_B tables contain the same results.
However, if I want to use a more complex design (~ age + gender + condition
), I can still do the treatment-contrast parametrization (case C below) but I can no longer do the group-means parametrization because I'm missing the conditionhealthy
column in the design matrix (case D below):
# case C: complex design; treatment-contrast parametrization
design_C <- model.matrix(~ age + gender + condition)
fit_C <- lmFit(mat, design_C)
fit_C <- eBayes(fit_C)
tt_C <- topTable(fit_C, coef = "conditiondisease")
# case D: simple design; group-means parametrization
design_D <- model.matrix(~ 0 + age + gender + condition)
fit_D <- lmFit(mat, design_D)
cont.matrix_D <- makeContrasts(diseaseVShealthy = conditiondisease - conditionhealthy, levels = design_D)
Error in eval(ej, envir = levelsenv) :
object 'conditionhealthy' not found
#fit_D <- contrasts.fit(fit_D, cont.matrix_D)
#fit_D <- eBayes(fit_D)
#tt_D <- topTable(fit_D, coef = "diseaseVShealthy")
I would like to use the group-means parametrization with the more complex design because I think it's much clearer what's going on.
How can I do that?
Thank you!
P.S.: Note that I also don't completely understand the expressions "treatment-contrasts" and "group-means" parametrization, but this is what they are called on the limma user guide (section 9.2).
Thanks Gordon!
Can you please explain why that's the case? I.e.: why only the first categorical variable in the formula gets both levels as coefficients?
Because every following term is interpreted as an additive effect. In Gordon's model above:
~0
.condition
hashealthy
anddisease
levels. There's no intercept, so the function creates a term forhealthy
, and another term fordisease
, representing the average log-expression in each group.age
is real valued, so this just gets added as an extra column.gender
(which is a social construct, I suppose you really mean sex) has male and female. With the default level ordering, i.e., alphanumeric, Female is treated as the reference. This means that the sum of the existing terms, with no other additions, describe the expected log-expression in female samples. For males, the function adds another term for the male vs. female log-fold change, which is presumed to be additive.Hence, you get terms for all levels in condition, but by the time you get to
gender
, you don't get a term for the first reference level. Sometimes you can trickmodel.matrix
into giving you a term, but this usually means that your matrix is not of full rank and that you need to remove the term in order to actually fit the linear model.