Hi,
I am working with a dataset that has precisely the nested structure as exemplified in the section "Group-specific condition effects, individuals nested within groups" of the DESeq2 RNAseq tutorial: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups
The structure is the same, but I do have more groups, a variable number of individuals per group, and two conditions per invidividual. (I therefore removed all the columns from the design matrix with all zeros for coefficients that do not exist.)
Working through my data, I fit a model exactly as the one listed in the tutorial:
model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)
I got the results for the contrasts that I am interested in via results(). When inspecting the results, and subsequently playing with some toy data, I got some doubts about the correctness of what I did.
My toy example:
Two groups (A, B), three individuals in each (1, 2, 3), two conditions per individual (x, y), two replicates per (group, individual, condition), values of condition y are ~2 higher than condition x:
df <- data.frame(
grp=factor(rep(c("A", "B"), each=12)),
cond=factor(rep(rep(c("x", "y"), each=2), 6)),
ind=factor(rep(1:3, each=4)),
value=rnorm(24, 10) +
c(rnorm(12, -5), rnorm(12, 5)) +
ifelse(df$cond == "x", 0, rnorm(24, 2, 0.5))
)
I then fit two linear models. Here the first one, plus the coefficients:
lm(value ~ grp + grp:ind + grp:cond, df)
Call:
lm(formula = value ~ grp + grp:ind + grp:cond, data = df)
Coefficients:
(Intercept) grpB grpA:ind2 grpB:ind2 grpA:ind3 grpB:ind3 grpA:condy grpB:condy
5.6597 9.6079 -0.1103 1.4499 0.3759 0.4865 2.1481 2.3591
And here the second model, excluding the grp:ind
term:
lm(value ~ grp + grp:cond, df)
Call:
lm(formula = value ~ grp + grp:cond, data = df)
Coefficients:
(Intercept) grpB grpA:condy grpB:condy
5.748 10.165 2.148 2.359
My questions to the above:
1) What is the meaning of the intercept in either of these models?
2) The condition-specific effects per group (grpA:condy, grpB:condy) are the same for both models. If it was taken into account that there are paired samples (for each individual) then this should not be the case. What am I missing?
Any help greatly appreciated!
The denominator of the contrast(s) you will compute is based on the within-group variabiliity (which is supposed to tell you something about the within-population variability), and is used to determine if the t-statistic you compute is unexpectedly large, given the null distribution. If you use the smaller model that ignores the pairing, when you compute the within-group variability you ignore the fact that some of the subjects are the same person, and are expected to be correlated (e.g., have lower variance within subject as compared to between subject). In one sense that's cheating because you may have smaller variance estimates, and more degrees of freedom (you lose one degree of freedom for each coefficient estimated). On the other hand, if there are subject-specific differences that you might want to account for, you may have larger within-group variances than what you would get if you accounted for the pairing. As an example, let's add some sample-specific changes in the expression values
And now if we fit the smaller model, we get
Where the extra variability really messes things up if we ignore the pairing. But if we include the pairing, there's no problem