Suppose you have tumor and normal samples treated with compound A with 0, 1, 2 mM for 0, 1, 2 hours, and you want to do DGE analysis using edgeR. You want to answer the question "which genes are differentially expressed in response to compound A": after taking into account of the impact of disease condition, time of treatment, and concentration of A (noted as [A]), you want to focus on the relationship of treatment by A and gene expression, and don't care about disease condition, time of treatment, not even [A].
How do you do it in edgeR?
In R, assuming the log(expression) being normally distributed, you could simply do glm(log(expression) ~ disease_condition + [A] + time. Then you only need to check the slope of [A] and its p-value, right? But since log(expression) is not normally distributed, we cannot use glm in R.
While in edgeR, things seem to be much more complicated. First, both time and [A] cannot be considered to be continuous (correct me if I am wrong); Second, I don't know how to do the DGE analysis.
For simplicity, let's forget about time in the problem -- just consider disease condition and [A] (0 & 1 mM), each with a replicate. Your gene count file looks like the following (T for tumor, N for normal; A0 for [A] = 0, A1 for [A] = 1; column 2~9 are gene counts for 8 samples, one column for one sample):
GENE T_A0 T_A0 T_A1 T_A1 N_A0 N_A0 N_A1 N_A1
GENE_A ...
GENE_B ...
...
You will have four levels: T_A0, T_A1, N_A0, N_A1.
Your code snippet looks like the following:
y <- DGEList(counts=d, group=grp);
y <- calcNormFactors(y); # global normalization
dsgn <- model.matrix(~0+grp); # now allowing intercept
y <- estimateDisp(y, dsgn);
fit <-glmFit(y, dsgn)
The question is how to construct your test (or contrast). Assuming the levels are ordered as above, what would be your contrast? c(-0.5, 0.5, -0.5, 0.5) or something else? Or is there a better approach? Or am I terribly wrong?
Thanks in advance!
You are great! So helpful!
1. Need to digest the difference of glm & lm. Thought they were almost identical except that glm allows "link";
2. OK, makes sense. So if you have enough replicates (e.g. 4 replicates each for 0, 0.1, 0.5, 1mM concentration for disease samples and normal samples), then you can use glm?
3. The reason that made me think edgeR disallows continuous variable is the "group" -- you assign a group to each sample, and even more, in the user's guide section 3.3.1, there is this line:
Group <- factor(paste(targets$Treat,targets$Time,sep="."))
which gives me the impression that an exhaustive combination of "factors" have to be used.
If I want to treat [A] as continuous, what would be "group" & "level" definitions?
How if I treat all variables (disease condition, [A], and time) as continuous, I would have no groups and no levels? What would my R script be?
Thanks much!
For question 2, not really. You still benefit from sharing information between genes and (for QL edgeR) accounting for the uncertainty of dispersion estimation. edgeR is also faster than
glm
when run across thousands of genes, and it provides more robust convergence for additive models. Finally, you don't get the benefit of TMM normalization when you runglm
on the raw counts - this is necessary to avoid composition biases when you have unbalanced up/downregulation.For question 3, the
group
setting is only necessary for classic edgeR. For GLM edgeR, you can just set up a design matrix. Read through Section 3.4 for additive models, and read the case studies in the user's guide. The reason we parametrize experimental designs in terms of groups is that a one-way layout is mathematically equivalent to an interaction model but is much easier to interpret. If you have continuous covariates, this is not possible so you just leave it as an additive model.