Hello All,
I'm currently working through an analysis of 96 RNAseq samples generated from a rather complicated experimental design and need some help constructing my design matrices. Briefly, the experiment involves repeated measures of 12 individuals across two experimental conditions one week apart (No Stress and Stress). Furthermore, individuals are split into two groups, "Risk" or "Control. So factors are as follows:
Participant: 12 levels (1-12); Status: 2 levels [Risk (N=6) or Control (N=6)]; Session: 2 levels (No_Stress or Stress)
Time: 4 levels (1-4, sample times are the same in both sessions, e.g. No_Stress.1 = Stress.1)
This is summarized in the table below.
Participant | Status | Session | Time | Group |
---|---|---|---|---|
1 | Control | No_Stress | 1 | No_Stress.1 |
1 | Control | No_Stress | 2 | No_Stress.2 |
1 | Control | No_Stress | 3 | No_Stress.3 |
1 | Control | No_Stress | 4 | No_Stress.4 |
1 | Control | Stress | 1 | Stress.1 |
1 | Control | Stress | 2 | Stress.2 |
1 | Control | Stress | 3 | Stress.3 |
1 | Control | Stress | 4 | Stress.4 |
... | ||||
12 | Risk | No_Stress | 1 | No_Stress.1 |
12 | Risk | No_Stress | 2 | No_Stress.2 |
12 | Risk | No_Stress | 3 | No_Stress.3 |
12 | Risk | No_Stress | 4 | No_Stress. |
12 | Risk | Stress | 1 | Stress.1 |
12 | Risk | Stress | 2 | Stress.2 |
12 | Risk | Stress | 3 | Stress.3 |
12 | Risk | Stress | 4 | Stress.4 |
I'm interested in the following hypotheses:
i) Which genes are responsive to time (i.e. circadian)?
ii) Which genes are responsive to stress?
iii) Which genes are deferentially responsive to stress as a function of status (Risk versus Control)?
FILTERING AND NORMALIZATION STEPS:
X <- DGEList(counts=counts, genes=rownames(counts)) keep <- rowSums(cpm(X)>=1) >=2 X_filtered <- X[keep,keep.lib.sizes=FALSE] X_filter_norm <- calcNormFactors(X_filtered) X_filter_norm <- estimateGLMCommonDisp(X_filter_norm, method="deviance",robust=TRUE,subset=NULL)
I have no true replicates so I used "estimateGLMCommonDisp" with the settings above as was recommended in Section 2.11 of the edgeR User Guide.
QUESTION 1: Is there any preference to using the combined factor "Group" versus nesting Time within Session.
In other words is
time_matrix1 <- model.matrix(~0+Group)
preferred over
time_matrix2 <- model.matrix(~0+Session+Session:Time)
For Questions 2 & 3 I'm assuming using Group is best, but if not I would replace it with "Session+Session:Time" accordingly below.
QUESTION 2: I don't necessarily care about differences at the participant level, but I would like to control for them. Does my inclusion of Participant as a factor in my design matrix take care of this? For example is the model below sufficient to tests hypotheses i and ii?
time_matrix3 <- model.matrix (~0+Group+Participant) fit <- glmFit(X_filter_norm, time_matrix3)
Or do I need to nest the combined factor within Participant?
time_matrix4 <- model.matrix(~0+Participant+Participant:Group)
QUESTION 3: Is the fully nested model below appropriate for hypothesis iii?
full_matrix <- model.matrix(~Status + Status:Participant + Status:Participant:Group)
Or can I removed the Participant factor and have the matrix below instead?
full_matrix <- model.matrix(~Status + Status:Group)
ANY help or guidance would be much appreciated! I've gotten this far by piecing together different strategies I've read on other posts and in user manuals, but I still haven't found a solid answer or example of a design this complicated.
Thanks!
Just to add to Ryan's answer; you can also look for
Status
-specificGroup
effects. For example:The first 12 terms represent patient-specific effects and can be ignored. The remaining terms represent the
Status
-specific log-fold change of eachGroup
over the no-stress group at time point 1. These can be compared to each other to see if there is any interaction betweenStatus
andGroup
.However, as mentioned, a direct comparison between
Status
is not possible while blocking onParticipant
. In such cases, you have to switch to limma andduplicateCorrelation()
. If you want to continue to use edgeR, you need to subset your dataset so that only one sample from each participant is present, to avoid problems from correlations between participants.Needless to say, make sure
Participant
is indeed a factor.