Question

limma Time Course Experiment with many time points

1

Entering edit mode

arfranco ▴ 130

@arfranco-8341

Last seen 11 months ago

European Union

I am running a RNA-Seq analysis with these targets

Array                      Treatment    Time.h

Col.0.4h.primed.CEL    heat    4
Col.0.8h.primed.CEL    heat    8
Col.0.24h.primed.CEL    heat    24
Col.0.48h.primed.CEL    heat    48
Col.0.4h.unprimed.CEL    control    4
Col.0.8h.unprimed.CEL    control    8
Col.0.24h.unprimed.CEL    control    24
Col.0.48h.unprimed.CEL    control    48

I am following the chapter 9.6.2 of the limma guide, got the fitting and I am able to get the toptable list of DE genes. But I am lost in some questions

I wrote the following scripts. I got the topTable files without problems, and my adjusted.p values are meaningful

library(limma)

data <- read.delim("raw_rma_normalized_expression_values.txt", as.is = TRUE) 
readTargets("targets2.txt") 
colnames(targets) <- c("Array", "treatment", "time.h") 

library(splines)
X <- ns(targets$time.h, df= 2)
Group <- factor(targets$treatment)
design <- model.matrix(~Group*X)
fit <- lmFit(data, design)
fit <- eBayes(fit)

# The data obtained by topTable
toptable1_4 <- topTable(fit, coef=1:4)
toptable1_2 <- topTable(fit, coef=1:2) 
toptable1_3 <- topTable(fit, coef=1:3)

Now my questions

1. Why the fit <- lmFit(data, design) fails if I define in X <- (targets$time.h, df=2) a value of df >= 3

2. I do not know the meaning of the coefficients being compared in the latest three lanes of code. Why I have 6 coefficients? What should I do to compare my treatmens in time (4h vs the rest of the times) or 8h vs the resto of the time, and so on..

3. What are the consequences of defining a different df value in the ns function ?

4. How can I get graphics and get profit of this data ?

Thank you in advance

limma RNA-Seq • 3.5k views

ADD COMMENT • link updated 9.5 years ago by Aaron Lun ★ 28k • written 9.5 years ago by arfranco ▴ 130

score 8 · Accepted Answer · 2015-10-28

For question 1, you don't have enough residual d.f. to estimate the variance if you set df=3 or greater in the spline. Each increase in df results in an additional coefficient in the spline basis matrix. If you set df=3, you'll have 3 coefficients in X; multiply by two due to the interaction; add 2 for the coefficients for each treatment group; and you end up with 8 coefficients for 8 samples, i.e., zero residual d.f. Thus, lmFit will fail.

For question 2, the DE contrasts in the last three lines (i.e., the various calls to topTable) make no sense. Let's have a look at your colnames(design):

[1] "(Intercept)"  "Groupheat"    "X1"           "X2"           "Groupheat:X1"
[6] "Groupheat:X2"

All your contrasts involve dropping the intercept, which is meaningless. Rather, if you want to test for the effect of time in the control group, you should set coef=3:4. If you want to test for the time effect in the heat group, you should set coef=5:6. You can't test for DE between specific timepoints in this model, as the whole point of the spline fit is to capture the effect across all timepoints. If you want to test for differences in the time effect between treatment groups, you should run:

con <- makeContrasts(X1 - Groupheat_X1, X2 - Groupheat_X2, levels=design) # sub ":" with "_"

... followed by contrasts.fit, eBayes and topTable without any coef specification (the substitution is required, because names with : are invalid in makeContrasts). If you want to test for DE between groups at any timepoint, then you need:

con <- makeContrasts(Groupheat, X1 - Groupheat_X1, X2 - Groupheat_X2, levels=design)

Note that simply dropping Groupheat is not sufficient, due to the presence of the interaction terms. In all of these contrasts, you should be performing tests on all relevant spline coefficients simultaneously. Individual spline coefficients don't have much practical interpretation, and it makes little sense to drop them in isolation. Indeed, you should probably ignore the log-fold changes for these coefficients.

For your third question, specifying a large df in the call to ns results in a more flexible spline and a better fit to the observed expression values. Of course, if you go overboard and use too many df, you'll end up overfitting the spline to your data (in this case, losing residual d.f. for variance estimation). Values between 3 and 5 seem to work well in real applications where the trends are generally monotonic.

Finally, for your fourth question, your graphics will depend on what you want to plot. See plotSA, plotMA and plotMDS for starters, and try help("09.Diagnostics") to get a nice overview of the plotting functions in limma. As for profits; I don't know what you mean. If I could make money off data analysis, I wouldn't be answering questions here for free.