Dear all,
I'm trying to use limma
in order to analyze a large number of Affymetrix microarrays. The experimental setup includes 2-3 replicates, 2-4 different time points and 1-5 different dosage levels.
So a typical targets file could look like the following where the time
and dose
can be non-integer numbers in reality:
> my.targets = data.frame(group = c(rep("untreated", times = 6), rep("treated", times = 18)), time = rep(1:3, each = 2, times = 4), dose = rep(0:3, each = 6), replicate = rep(c("A", "B"), times = 12))
> my.targets group time dose replicate 1 untreated 1 0 A 2 untreated 1 0 B 3 untreated 2 0 A 4 untreated 2 0 B 5 untreated 3 0 A 6 untreated 3 0 B 7 treated 1 1 A 8 treated 1 1 B 9 treated 2 1 A 10 treated 2 1 B 11 treated 3 1 A 12 treated 3 1 B 13 treated 1 2 A 14 treated 1 2 B 15 treated 2 2 A 16 treated 2 2 B 17 treated 3 2 A 18 treated 3 2 B 19 treated 1 3 A 20 treated 1 3 B 21 treated 2 3 A 22 treated 2 3 B 23 treated 3 3 A 24 treated 3 3 B
as you can see the "control" group is synonymous with no dosage.
I would like to evaluate the overall genetic response to treatment and time. So usually I would accomplish this with:
> my.design = model.matrix(~ time * dose, data = my.targets)
since I'm not sure whether the response is to dosage, time or their interaction.
However, I don't know what the baseline expression would be with this design and whether limma
is aware of the within replicates variability. So is it enough to use the following design:
> my.design = model.matrix(~ group + time * dose, data = my.targets)
with which limma
will then use the untreated samples to establish a baseline expression? Or do I need to explicitly include the replicates as well?
Thank you for your insights.
You will probably have to give more information.
1.) Are time and dose integers in that data.frame? If so, that is almost surely not what you want to do (unless e.g., time is say hours, or dose is say, 0, 1, 2, 3 mg/kg or something like that).
2.) What does replicate signify? If I were to naively look at the data you present, I would have to assume you had just two biological replicates, and you treated these two animals with lots of different levels at various times.
You don't want to use group and dose in the model matrix (and probably don't want to use group at all). Group is partially aliased with dose, so you have semi-redundant information in your design matrix.
To clarify James' first point; do you want to consider each time point as a separate group, or do you want to fit the model such that expression changes linearly with the
time
covariate (e.g., expression of 2 at the first time point, 3 at the second, 4 at the third, and so on)? The former is more flexible as it doesn't force genes to follow a linear trend. You end up with a larger model, but that's fine, as you've got plenty of residual d.f. The same argument applies todose
.Thank you for your comments.
time
anddose
can be non-integer numbers. Time is in the range of hours to a maximum of 4 weeks. Doses vary widely and can be anywhere from a fraction to several thousand. The units are different per separate compound considered. The shown targets are for one compound only.I agree that
group
contains redundant information but if I am interested in the change (differentially expressed genes and coefficients) as compared to the untreated samples, is it the most reasonable choice to simply fit a linear model that includes time and dose or should I rather be forming a contrast?In essence this touches upon Aaron Lun's comment. I believe that it might be sufficient to model a linear response of gene expression with time and dose. I have also been considering, however, as described in the userguide, to treat each dosage as a different group and fit splines to the time points per dosage group. Or I could, as Aaron Lun suggested, form groups given by unique combinations of time and dose.
My problem with the last two approaches is that I would like to end up with an overall response per gene to one compound. That means, how strongly does it respond (absolute values of coefficients in the linear model) and certainty (p-value). With the last two approaches I would have to maybe take the mean of the coefficients and aggregate the p-values.