Dear Shan Gao,
There are many legitimate ways to define a design matrix. The best way depends on what questions you want to answer or what hypotheses you want to test.
It looks like you have 35 libraries in total (24 treatment samples and 11 control samples). Is this right? Are all the libraries biologically
indendent?
I assume that the control samples have received some sort of inactive treatment for the same time period as the active samples.
What is it that you want to test? Do you want to find DE genes between treatment and control at each time? Or do you want to find genes are DE at each time vs 0h, adjusting for control differences?
In any case, you need to create a factors Time and Treatment. There are of length equal to the number of samples (35). Time takes values from 0 up to 48. Treatment takes two values, Treatment or Control, say.
If you can respond to these questions, then I'll give you more help with the design matrix and so on.
Best wishes
Gordon
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Dear Shan Gao,
I always advise people to create a "targets file", laying out the treatments associated with each sample, rather than trying to create
factors on the fly in R. This can often avoid problems that might otherwise not be noticed.
For example, the factors you've created below are of different lengths, and I suspect you mean "0h" rather than "oh".
Assuming you fix this, a quick way to create a design matrix would be:
This will create a design matrix with 14 columns. Coefficients 8-14 test for treatment effects at each time.
To test for treatment effect at time 0:
To test for treatment effect at time 1:
and so on.
I don't know of a good reference on using model.matrix(), except perhaps for the limma User's Guide. A very brief introduction to using model formula in R is given in Section 11 of the manual "An Introduction to R" that comes with R.
Best wishes
Gordon
Dear Shao Gao,
I don't really understand your questions, and I'm afraid I have no more time to give you tutorials at the moment.
I will just say that the formula ~time+treatment (that you did not mention in earlier emails) estimates the treatment effect averaged over all the different times. It estimates just one average treatment effect.
The formula I suggested, ~time+time:treatment estimates a separate treatment effect for each time. In other words, it estimates seven
different time-specific treatment effects, one for 0h, one for 6h, etc. It seems clear from the experimental design that this is what is
required.
If you find that using model formula in R is too obscure to understand, then here is a longer method that you might find more intuitive. It
treats your experiment as a oneway layout, from which you can extract all the comparisons you want. You create a combined factor to label all your samples. Lets say you use "T16" to mean treatment at 16h and C16 to mean control at 16h, and so on. So create a combined factor with something like:
Then
Fit the full model (after estimating the dispersions):
Then later on you can make any comparison you want. Suppose you want test for a treatment effect at time 16, that is T16-C16. You can ask for this by:
This gives exactly the same results as the approach I suggested in my last email.
You can use this for any comparison at all. For example, is the treatment effect larger at 6h than at 0h?
I hope this gives you enough to go on with. If not, I hope that
others
can help you.
Best wishes
Gordon
Thanks very much, Dr. Smyth, I start to understand what the manual of R is saying in terms of model formulae, to be honest, that manual seems to be drafted for the user of Limma or other R experts instead of a biologist.