I have reviewed the LIMMA manual (esp. sections 9.6.1 and 9.6.2), and need conceptual help in applying either this approach to my experimental design or suggestions to use an alternative approach.
My microarray RNA data is as follows: Two tissue types, five time points (0, 6, 24, 72, 168 hours), ~20 subjects, with each subject having up to 3 samples collected for each tissue at various time points (IRB restrictions). A wound is made and the time zero collection is the "baseline" expression profile for each tissue.
I wanted to ask the following questions but can not wrap my head around the best approach in terms of setting up the design matrix. Questions:
1) How do the tissues independently respond to wounding over time?
I could compare each tissue after wounding to it's baseline time 0hr doing various contrasts (M6-M0, M24-M0, M72-M0, etc), but the issue I find here is that the two tissues are completely different at time 0hr. This means my "control" or "baseline" that I'm comparing subsequent time points to is different between the two tissues and that I cannot really compare the results (like an intersect or overlap) between the two tissues. Would it make sense to group M0 and C0 as a control group?
Another approach is to compare changes over time using these contrasts (M6-M0, M24-M6, M72-M24, etc.) for each tissue. This leads to four comparisons for each tissue (8 total).
2) How do the two tissues differ in their response to wounding over time?
For question two I thought I could compare each tissue at each time point using contrasts (M0-C0, M6-C6, M24-C24, etc), but this is also a lot of comparisons (4) so I thought a spline based approach to model general changes over time would be more tractable. The positive aspect of the spline is I get changes in expression over time that follow different trends between tissues in one comparison. The downside is that the only way (I can think of) to group genes changing in the same manner over time is to cluster the significant results after adj.P.Val and lfc cutoff. Please see below:
pData$X <- ns(pData$time, df=3)
#Setup model matrix and rename columns ":" to "."
design <- model.matrix(~0+tissue+tissue:X, eset)
colnames(design)
[1] "tissueC" "tissueM" "tissueC.X1" "tissueM.X1" "tissueC.X2" "tissueM.X2"
[7] "tissueC.X3" "tissueM.X3"
# Estimate the correlation between measurements made on the same subject:
corfit <- duplicateCorrelation(eset,design,block=eset$subject)
corfit$consensus
# Fit the model
fit <- lmFit(eset,design,block=pData$subject,correlation=corfit$consensus)
fit2 <- eBayes(fit)
table1 <- topTable(fit2, coef=c(3:8), number=Inf, adjust.method="BH", sort.by="none")
3) I'm wondering how different is the table1 output (conceptually) than the table2 output from the code below in terms of what it is reporting?
cont.mat = makeContrasts(tissueM.X3-tissueC.X3, tissueM.X2-tissueC.X2, tissueM.X1-tissueC.X1, levels = design)
fit.cont =contrasts.fit(fit, cont.mat)
fit.cont=eBayes(fit.cont)
table2 <- topTable(fit.cont, number = Inf, adjust.method = "BH", sort.by = "none")
4) In order to look for genes changing over time in tissue M is the code below correct?
tableM <- topTable(fit2, coef=c(4, 6, 8), number = Inf, adjust.method = "BH", sort.by = "none")
5) In order to look for genes changing over time in tissue C is the code below correct?
tableC <- topTable(fit2, coef=c(3, 5, 7), number = Inf, adjust.method = "BH", sort.by = "none")
Hi Aaron, this is all great information and I really appreciate you taking the time to help! This particular step was a current bottleneck in my project as I wanted to get it right and I'm excited to move forward with it.
Just to clarify,
This tests the null hypothesis that the 6 hour response is the same (not different) between tissues. The alternative hypothesis is that they are not the same, correct?
That is correct. The other time points follow the same principle.
For the results output from the following contrast listed (also above):
In this results file, do the positive test statistics correspond to genes over expressed in M at 6hr (after baseline differences are accounted for) and negative test statistics represent C at 6hr or do I have this backwards? I just wanted to ensure I have it correct.
Hope its okay to post this here rather than creating a new post.
"This approach can be sensible if you specifically want to look for changes in consecutive time points, rather than comparing everything to the zero hour."
When using this model (time1-time0, time2-time1, time3-time2... etc) limma seems to give me genes differentially expressed at any time point. Is this the correct way to identify genes which, for example, are sequentially upregulated more and more with time?
Well, yes. It's just a different formulation of the same null hypothesis. The main difference is that the log-fold changes are reported with respect to consecutive time points rather than with respect to the zero time points. This can be a more helpful summary and allow you to more rapidly select genes of interest, e.g., by taking all genes that have positive log-fold changes between consecutive time points.