Question

Limma Design Matrix for Paired Samples with Confouding Factors

0

Entering edit mode

yuabrahamliu • 0

@yuabrahamliu-17670

Last seen 4.1 years ago

Hi all,

I have a data set with both paired samples and confounding factors. I want to use limma to find the differential expressed genes, but I don't know how can I design the design matrix. Could anyone give some suggestions? The data looks like this,

Sample_Name	Condition	Pair_Name	Source	Days
Sample1	Treated	1	Tissue	25
Sample2	Control	1	Tissue	30
Sample3	Treated	2	Blood	22
Sample4	Control	2	Tissue	31
Sample5	Treated	3	Tissue	32
Sample6	Control	3	Blood	30

The variable Condition, namely Treated or Control, is the main effect need to be explored. Pair_Name reflects the pairing of the samples, namely the samples in pair 1 from the same donator, samples in pair 2 from the 2nd donator, pair 3 from the 3rd. Soure is the source of the sample, some from the tissue of interest, while some from blood, which is a discrete variable, and also a confounding factor. The last variable Days is a continuous variable and another confounding factor.

What I want to do is to select the significantly differential expressed genes between the 2 conditions Treated and Control, while considering the effect of sample pairing and the confounding factors Source and Days. It seems very complicated now and I don't know how to design the design matrix and contrast matrix properly. Could anyone give me some suggestions? Thank you so much!

limma design matrix paired samples confounding factors • 4.7k views

ADD COMMENT • link updated 6.5 years ago by Aaron Lun ★ 28k • written 6.5 years ago by yuabrahamliu • 0

score 0 · Answer 1 · 2018-10-24

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

The simplest approach would be to just throw in everything (note the types):

# Pair_Name must be a factor,
# Days must be numeric.
design <- model.matrix(~Condition + Pair_Name + Source + Days)

... and set coef=2 in topTable() to test the condition effect. Of course, this assumes that all effects are additive on the log-scale, and that the effect of Days is linear with respect to log-expression.

You can weaken the first assumption by defining Group <- paste0(Condition, ".", Source) and using ~0 + Group + Pair_Name + Days instead in the design matrix. This accounts for any interactions (i.e., non-additivity) between the disease condition and the tissue of origin, which will probably be present in any real biological system. Each level of Group defines a condition/source combination, so you will need to use makeContrasts() to define comparisons of interest between specific combinations, followed by contrasts.fit(). If you don't care about blood, your contrasts would just be between normal and diseased tissue.

You can weaken the second assumption by defining DaysX <- splines::ns(Days, 3) and using DaysX instead of Days in the design matrix. This fits a spline with respect to the days, which is more flexible than a linear trend. However, it does require that you have more samples than shown in the table, otherwise you won't have enough residual degrees of freedom for variance estimation.

If nothing I said above makes any sense, read the limma user's guide.

ADD COMMENT • link 6.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Very helpful. Thank you Aaron!

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

Hi Aaron,

Thank you for your help again, but I met a problem when using the method on my data, which is about the pairing information. As shown before, my data are like this,

Sample_Name	Condition	Pair_Name
Sample1	Treated	1
Sample2	Control	1
Sample3	Treated	2
Sample4	Control	2
Sample5	Treated	3
Sample6	Control	3

Actually, I have more than 100 samples, namely, more than 50 pairs, other confounding factors are not shown here. My code on the analysis is like this if only consider Condition and pairing information,

design <- model.matrix(~ Condition + Pair_Name, data = pd)
fit1 <- lmFit(Normdata, design)
fit1 <- eBayes(fit1)
difflimma <- topTable(fit1, coef=2, n=dim(fit1)[1],sort="logFC")

Here, I could only get about 40 genes with a adjusted p-value less than 0.05. While if I only consider Condition and use the following code,

design <- model.matrix(~ Conditon, data = pd)
fit1 <- lmFit(Normdata, design)
fit1 <- eBayes(fit1)
difflimma <- topTable(fit1, coef=2, n=dim(fit1)[1],sort="logFC")

If so, I could get greater than 9000 genes with a adjusted p-value less than 0.05!

I really feel worried about this result. Will the pairing information influence the significant genes so largely? Or, my code has something wrong? Or there are other causes. Actually, if I also consider other confounding factors like Source and Days, as mentioned yesterday, when combined with the pairing information, no genes will have an adjusted p-value less than 0.05. Maybe you have much more experience, do you think this result is reasonable? Thank you so much Aaron.

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

Your second analysis (without blocking on pair) is probably very anticonservative. If the pair effect is strong, samples from the same pair are going to be highly correlated. If the design matrix does not know about the pairing, the model will not consider these correlations and will effectively overstate the residual d.f. by almost two-fold. This causes underestimation of the uncertainty of the variance estimate and lower-than-expected p-values. The fact that you have so many samples will amplify this effect, resulting in the observed differences in the number of DE genes between analyses.

So, I would say that the first analysis is quite likely to be much more correct than the second analysis. However, you are still missing the other potential explanatory variables, of which Source will probably be quite important. If you include these other variables and you no longer detect DE genes... well, that suggests that the DE genes previously detected were driven by the other variables and not by the condition of interest. (Especially given that Source and Condition do not seem to be balanced.) Perhaps there aren't many (or any) differences between conditions in your system - if the data is trying to tell you something, it's worth listening with an open mind rather than trying to coerce it into giving you DE genes.

ADD REPLY • link 6.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

I see. Really learned something. Thank you so much Aaron!

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

Hi Aaron,

Yes, if consider pairing information and other confounding factors, the effect of Condition should be very limited on regulating gene expression, so I wanted to show the contribution of each factors to the variance using ANOVA,

fit <- aov(logExp ~ Condition + Source + Days + Pair_Name, data = Normdata)
summaryfit <- summary(fit)

However, what made me very confused was that, according to the F-stat of each variable, including Condition, Source, Days, and Pair_Name, Condition really had the absolutely dominant effect on the variance, that of other factors were very very little. (The averaged F-stat across all the genes was greater than 8, while each of the other factors was less than 1). Hence, do you think this is contradictory to the effect of other factors on DE gene selection shown before? I really feel confused. Sorry.

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

aov does type I sequential ANOVA, where it tests factors in order (i.e., Condition alone, then Condition + Source, and so on). So the reported statistics for Condition don't even consider the effect of the other factors, which was your entire problem in the first place. You can see this pretty clearly:

set.seed(0)
g <- gl(2, 100) # true groups
b <- factor(c(rep(LETTERS[1:2], c(90, 10)), rep(LETTERS[1:2], c(10, 90)))) # blocking
y <- rnorm(200) + as.integer(b) # making up data - batch effect only.
summary(aov(y ~ g + b)) # wow, 'g' is significant!
summary(aov(y ~ b + g)) # but it actually isn't when you consider 'b'.

You'll get the desired type III ANOVA if you use lm instead:

summary(lm(y ~ g + b))
summary(lm(y ~ b + g)) # same p-values.

ADD REPLY • link 6.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Learned. Thank you so much Aaron!

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

Hi Aaron,

I noted a strange thing when fit my data, don't know if it has a large influence on my result. The design matrix I used is,

design <- model.matrix(~ Condition + Source + Pair_Name, data = pd)

fit1 <- lmFit(Normdata, design)

However, after this step, I will always get a message as,

Coefficients not estimable: Pair_Name223
Warning message:
Partial NA coefficients for 45101 probe(s)

I don't know the reason of this warning and its influence on the result. My design matrix is like this,

(Intercept)	ConditionControls	SourceTissue	Pair_Name29	Pair_Name47	Pair_Name223
1	0	1	0	1	0
1	1	1	0	1	0
1	0	1	1	0	0
1	1	1	1	0	0
1	0	0	0	0	1
1	1	0	0	0	1

Actually I have much more rows and much more Pair_Name relevant columns. For the Pair_Name relevant columns, each of them has a column sum of 2 because 2 of all the samples belong to that specific Pair_Name. As to this design matrix, the last column is Pair_Name223 and its coefficient cannot be evaluated and all of its coefficients value in the fit1$coefficients table are NA. If I changed the design matrix to,

design <- model.matrix(~Condition + Pair_Name + Source, data = pd)

and let Source become the last column of the design matrix, limma will tell me that the coefficient of Source could not be evaluated this time.

I don't know the reason and will it bring a great influence to the result?

Thank you so much!

ADD REPLY • link 6.5 years ago yuabrahamliu • 0

0

Entering edit mode

This is a standard case of linear dependencies between coefficients. You can see how the intercept is equal to the sum of Pair_Name223 and SourceTissue. This means that there are infinitely possible solutions to the linear system that minimize the sum of squares. For a given intercept value, you could get the same sum of squares by subtracting "x" from the intercept and adding "x" to Pair_Name223 and SourceTissue. There is obviously no unique solution so one coefficient is discarded by the function to achieve identifiability.

In your example above, the pairs are nested within Source, so you can't have both of them in the same model. The immediate solution to the problem would be to simply drop Source from the model.matrix call. Now, the function raises a warning but the issue is symptomatic of a deeper misunderstanding of your experimental design. Unfortunately, my ability to give advice is limited here, as the design matrix in your comment above differs from the experimental set-up in your original post (where the pairing is not nested within the source). In fact, I don't even see how you managed to construct that design matrix from the model.matrix call, I can only assume that you have more samples than the six rows that are shown.