We have a multi-factor RNA-Seq experiment with multiple baseline measures that we would like to model using edgeR:
- 2x pre-treatment samples per mouse [time=0]
- 1x post-treatment sample per mouse [time=1]
- 2x treatment groups (all mice are treated at post-treatment)
The goal is to evaluate the treatment effect between groups adjusted for baseline (see full R code below). Note, all mice are untreated at baseline. We identified the following two approaches to model this in edgeR:
Approach 1: use each of the two pre-treatment samples per mouse as part of the model
Model: mouse+time+time*treatment
- 2x pre-treatment samples per mouse [time=0]
- 1x post-treatment sample per mouse [time=1]
- 2x treatment groups (all mice are treated at post-treatment)
Approach 2: calculate the average counts of the two pre-treatment samples per mouse prior to modeling and use it as single baseline value
Model: mouse+time+time*treatment
- 1x one pre-treatment sample [average] per mouse [time=0]
- 1x post-treatment sample per mouse [time=1]
- 2x treatment groups (all mice are treated at post-treatment)
Which of the two approaches do you recommend for estimating the treatment effect for this experiment using edgeR?
Full R code :
library(edgeR);
y = matrix(rnbinom(12000,mu=10,size=10),ncol=12);
time = as.factor(rep(c(0,0,1),4));
treatment = as.factor(rep(c(1,2),2,each=3));
mouse = as.factor(rep(c(1:4),each=3));
d = DGEList(counts=y,group=treatment);
design = model.matrix(~mouse+time+time*treatment )[,-6];
d=estimateGLMCommonDisp(d,design)
d=estimateGLMTrendedDisp(d,design)
d=estimateGLMTagwiseDisp(d,design)
d.fit = glmFit(d,design);
d.lrt = glmLRT(d.fit,'time1:treatment2')
Thank you Gordon for sharing your solution and clarifying that combining pre-treatment samples from the same mouse is neither necessary nor optimal using edgeR.
My main question was which model would be preferred to account for correlations between multiple pre-treatment samples from the same mouse. You fully answered this. Thank you.