Dear Community,
I’m very new to the limma statistical methodology, and especially the construction and understanding of design matrix. In detail, I tried to interpret with some 'fake' data in R the example of multifactorial design in the example of 9.7 section page 48. Thus, I created a matrix below, and the 2 conditions like the example in the tutorial:
> set.seed(81) > mat <- matrix(data=sample(1:100), nrow=10, ncol=10) > head(mat) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 16 40 20 84 13 90 72 82 52 14 [2,] 91 9 29 47 80 21 34 15 87 74 [3,] 68 18 67 46 17 94 63 93 92 70 [4,] 1 48 83 77 33 2 81 97 8 71 [5,] 78 36 86 6 58 73 49 85 61 38 [6,] 98 89 56 79 88 23 44 69 31 64 > pdat <- data.frame(Tissue=rep(c("A", "B"), times=5), Condition=rep(c("Diseased","Normal"),each=5)) > pdat Tissue Condition 1 A Diseased 2 B Diseased 3 A Diseased 4 B Diseased 5 A Diseased 6 B Normal 7 A Normal 8 B Normal 9 A Normal 10 B Normal new.set <- new("ExpressionSet", exprs=mat) > head(exprs(new.set)) 1 2 3 4 5 6 7 8 9 10 1 16 40 20 84 13 90 72 82 52 14 2 91 9 29 47 80 21 34 15 87 74 3 68 18 67 46 17 94 63 93 92 70 4 1 48 83 77 33 2 81 97 8 71 5 78 36 86 6 58 73 49 85 61 38 6 98 89 56 79 88 23 44 69 31 64 > pData(new.set) <-pdat > head(pData(new.set)) Tissue Condition 1 A Diseased 2 B Diseased 3 A Diseased 4 B Diseased 5 A Diseased 6 B Normal > pairs <- factor(rep(1:5, each=2)) > Treat <- factor(paste(new.set$Condition, new.set$Tissue, sep=".")) > Treat [1] Diseased.A Diseased.B Diseased.A Diseased.B Diseased.A Normal.B Normal.A Normal.B [9] Normal.A Normal.B Levels: Diseased.A Diseased.B Normal.A Normal.B > design <- model.matrix(~0 +Treat) > colnames(design) [1] "TreatDiseased.A" "TreatDiseased.B" "TreatNormal.A" "TreatNormal.B" > design TreatDiseased.A TreatDiseased.B TreatNormal.A TreatNormal.B 1 1 0 0 0 2 0 1 0 0 3 1 0 0 0 4 0 1 0 0 5 1 0 0 0 6 0 0 0 1 7 0 0 1 0 8 0 0 0 1 9 0 0 1 0 10 0 0 0 1 attr(,"assign") [1] 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$Treat [1] "contr.treatment" > colnames(design) <- levels(Treat) > dupcor <- duplicateCorrelation(new.set, design, block=pairs) > fit <- lmFit(new.set, design, block=pairs, correlation=dupcor$consensus) > cm <- makeContrasts(Disease.A=Diseased.A-Normal.A , levels=design) > fit2 <- contrasts.fit(fit, cm) > fit3 <- eBayes(fit2, trend=TRUE) > top2 <- topTable(fit3, coef=1, number=nrow(fit3), adjust.method="fdr", sort.by="none") > head(top2) logFC AveExpr t P.Value adj.P.Val B 1 -49.677974 48.3 -1.97576747 0.07300290 0.3650145 -4.562943 2 8.592188 48.7 0.30626035 0.76495236 0.8815297 -4.608752 3 -28.543152 62.8 -1.44092558 0.17662419 0.5887473 -4.581712 4 2.134082 50.1 0.06339996 0.95055105 0.9505510 -4.610196 5 17.663417 57.0 0.79132671 0.44497885 0.7111324 -4.600623 6 47.382520 64.1 2.68642944 0.02065864 0.2065864 -4.538251
So, my main questions are the following:
- If my samples are paired(like the tutorial in the limma users guide), I would proceed as the example using duplicate correlation. Thus, what is the main difference, from including the pairs into the design matrix ??
- If I would like to move by including the pairs into the design matrix(maybe not correct but to correctly understand the approach):
# alternative methodology
> f <- relevel(Treat, ref="Normal.A") > design1 <- model.matrix(~0 + pairs + f) > design1 pairs1 pairs2 pairs3 pairs4 pairs5 fDiseased.A fDiseased.B fNormal.B 1 1 0 0 0 0 1 0 0 2 1 0 0 0 0 0 1 0 3 0 1 0 0 0 1 0 0 4 0 1 0 0 0 0 1 0 5 0 0 1 0 0 1 0 0 6 0 0 1 0 0 0 0 1 7 0 0 0 1 0 0 0 0 8 0 0 0 1 0 0 0 1 9 0 0 0 0 1 0 0 0 10 0 0 0 0 1 0 0 1 attr(,"assign") [1] 1 1 1 1 1 2 2 2 attr(,"contrasts") attr(,"contrasts")$pairs [1] "contr.treatment" attr(,"contrasts")$f [1] "contr.treatment" > fit <- lmFit(new.set, design1) > fit2 <- eBayes(fit, trend=TRUE)
....
So, the first 5 coefficients named pairs are the blocking factors for each “object”—but what the coefficients fDiseasedA, fDiseasedB represent exactly?
I mean, for instance, fDiseasedA represents the average log2 fold change between groups Diseased A and Normal A ?
Moreover, why the level "Normal.B" is missing from the design1?
Or it depends on the first level of the f factor ? when I leave it as default, has:
> Treat [1] Diseased.A Diseased.B Diseased.A Diseased.B Diseased.A Normal.B Normal.A Normal.B [9] Normal.A Normal.B Levels: Diseased.A Diseased.B Normal.A Normal.B
Or I have to somehow relevel my Treat factor, in order to have the samed resulted log2 fold change for Diseased.A –Normal.A, as in the first defined difference in cm?
My final question is, if I want to have the same returned log2 fold change of Diseased.A-Normal.A with duplicate correlation & contrasts.fit from the first design matrix, how I should proceed with the second approach—that is when I include the pairs into the model.matrix(second design matrix)?
I can understand that the statistics would change, but the log2 fold change of the same compared groups would be the same, right?
Any explanations or feedback on these questions would be very helpful, as I'm a beginner in analyzing microarray data with limma, and I would like firstly to understand important aspects of formulating various comparisons.
Thank you,
Constantinos
qr(design)$rank!=ncol(design)
. If this is true, then you need to drop some coefficients. Which ones you drop will affect the interpretation of the coefficients, so there is no one-size-fits-all solution. As for the warning, make sure you eliminate the linear dependencies in the design before you runlmFit
.Thank you again for your quick reply Aaron. One thing to insist for your answer-2, because from my opinion as a beginner seems the most “challenging” to interpret correctly. Why in my warning returns only for the coefficient fDiseased.B not estimable, and not also for the others, like the specific pairs ? Moreover, regarding which coefficient I should drop, is highly affected by which coefficients I’m interested in from my experimental design? Could for instance drop more of these inestimable coefficients, and keep for instance one of specific interest ? Or usually, I might firstly look of coefficients that are not “generally estimable” or don’t represent any interesting effect? Like the Normal.B ?
Last update: something from my initial question 2: why in the updated design matrix the reference level “Normal.A” is not appeared in any of the columns(even before dropping any coefficients) ??
Don't worry about the specifics of the warning. All you should take out of it is that you have inestimable coefficients. If you want to "automate" the dropping of coefficients, you could do something like:
... but that sometimes drops terms that makes the interpretation of the remaining coefficients a bit awkward. The best way is to look at the design matrix, figure out what each coefficient means, and decide which ones to drop in order to obtain full column rank (i.e., so that the number of columns is equal to
QR$rank
). I can't think of any general rule of thumb to make this easier, because it depends so much on the design involved, the DE comparisons you're interested in, etc. Simply put, there is no substitute for understanding exactly what you're doing with your design.Thank you again for your explanation !! But regarding my updated question ??
(Last update: something from my initial question 2: why in the updated design matrix the reference level “Normal.A” is not appeared in any of the columns(even before dropping any coefficients) )