Hi all,
I have a data set with both paired samples and confounding factors. I want to use limma to find the differential expressed genes, but I don't know how can I design the design matrix. Could anyone give some suggestions? The data looks like this,
Sample_Name | Condition | Pair_Name | Source | Days |
Sample1 | Treated | 1 | Tissue | 25 |
Sample2 | Control | 1 | Tissue | 30 |
Sample3 | Treated | 2 | Blood | 22 |
Sample4 | Control | 2 | Tissue | 31 |
Sample5 | Treated | 3 | Tissue | 32 |
Sample6 | Control | 3 | Blood | 30 |
The variable Condition, namely Treated or Control, is the main effect need to be explored. Pair_Name reflects the pairing of the samples, namely the samples in pair 1 from the same donator, samples in pair 2 from the 2nd donator, pair 3 from the 3rd. Soure is the source of the sample, some from the tissue of interest, while some from blood, which is a discrete variable, and also a confounding factor. The last variable Days is a continuous variable and another confounding factor.
What I want to do is to select the significantly differential expressed genes between the 2 conditions Treated and Control, while considering the effect of sample pairing and the confounding factors Source and Days. It seems very complicated now and I don't know how to design the design matrix and contrast matrix properly. Could anyone give me some suggestions? Thank you so much!
Very helpful. Thank you Aaron!
Hi Aaron,
Thank you for your help again, but I met a problem when using the method on my data, which is about the pairing information. As shown before, my data are like this,
Actually, I have more than 100 samples, namely, more than 50 pairs, other confounding factors are not shown here. My code on the analysis is like this if only consider Condition and pairing information,
Here, I could only get about 40 genes with a adjusted p-value less than 0.05. While if I only consider Condition and use the following code,
If so, I could get greater than 9000 genes with a adjusted p-value less than 0.05!
I really feel worried about this result. Will the pairing information influence the significant genes so largely? Or, my code has something wrong? Or there are other causes. Actually, if I also consider other confounding factors like Source and Days, as mentioned yesterday, when combined with the pairing information, no genes will have an adjusted p-value less than 0.05. Maybe you have much more experience, do you think this result is reasonable? Thank you so much Aaron.
Your second analysis (without blocking on pair) is probably very anticonservative. If the pair effect is strong, samples from the same pair are going to be highly correlated. If the design matrix does not know about the pairing, the model will not consider these correlations and will effectively overstate the residual d.f. by almost two-fold. This causes underestimation of the uncertainty of the variance estimate and lower-than-expected p-values. The fact that you have so many samples will amplify this effect, resulting in the observed differences in the number of DE genes between analyses.
So, I would say that the first analysis is quite likely to be much more correct than the second analysis. However, you are still missing the other potential explanatory variables, of which
Source
will probably be quite important. If you include these other variables and you no longer detect DE genes... well, that suggests that the DE genes previously detected were driven by the other variables and not by the condition of interest. (Especially given thatSource
andCondition
do not seem to be balanced.) Perhaps there aren't many (or any) differences between conditions in your system - if the data is trying to tell you something, it's worth listening with an open mind rather than trying to coerce it into giving you DE genes.I see. Really learned something. Thank you so much Aaron!
Hi Aaron,
Yes, if consider pairing information and other confounding factors, the effect of Condition should be very limited on regulating gene expression, so I wanted to show the contribution of each factors to the variance using ANOVA,
However, what made me very confused was that, according to the F-stat of each variable, including Condition, Source, Days, and Pair_Name, Condition really had the absolutely dominant effect on the variance, that of other factors were very very little. (The averaged F-stat across all the genes was greater than 8, while each of the other factors was less than 1). Hence, do you think this is contradictory to the effect of other factors on DE gene selection shown before? I really feel confused. Sorry.
aov
does type I sequential ANOVA, where it tests factors in order (i.e.,Condition
alone, thenCondition + Source
, and so on). So the reported statistics forCondition
don't even consider the effect of the other factors, which was your entire problem in the first place. You can see this pretty clearly:You'll get the desired type III ANOVA if you use
lm
instead:Learned. Thank you so much Aaron!
Hi Aaron,
I noted a strange thing when fit my data, don't know if it has a large influence on my result. The design matrix I used is,
However, after this step, I will always get a message as,
I don't know the reason of this warning and its influence on the result. My design matrix is like this,
Actually I have much more rows and much more Pair_Name relevant columns. For the Pair_Name relevant columns, each of them has a column sum of 2 because 2 of all the samples belong to that specific Pair_Name. As to this design matrix, the last column is Pair_Name223 and its coefficient cannot be evaluated and all of its coefficients value in the fit1$coefficients table are NA. If I changed the design matrix to,
and let Source become the last column of the design matrix, limma will tell me that the coefficient of Source could not be evaluated this time.
I don't know the reason and will it bring a great influence to the result?
Thank you so much!
This is a standard case of linear dependencies between coefficients. You can see how the intercept is equal to the sum of
Pair_Name223
andSourceTissue
. This means that there are infinitely possible solutions to the linear system that minimize the sum of squares. For a given intercept value, you could get the same sum of squares by subtracting "x" from the intercept and adding "x" toPair_Name223
andSourceTissue
. There is obviously no unique solution so one coefficient is discarded by the function to achieve identifiability.In your example above, the pairs are nested within
Source
, so you can't have both of them in the same model. The immediate solution to the problem would be to simply dropSource
from themodel.matrix
call. Now, the function raises a warning but the issue is symptomatic of a deeper misunderstanding of your experimental design. Unfortunately, my ability to give advice is limited here, as the design matrix in your comment above differs from the experimental set-up in your original post (where the pairing is not nested within the source). In fact, I don't even see how you managed to construct that design matrix from themodel.matrix
call, I can only assume that you have more samples than the six rows that are shown.I see. Thank you so much Aaron!