Dear all,
I have searched for quite a while for questions relating to the following but couldn't find anything.
I am trying to analyse gene expression data from quite a challenging study design. The study was a non-randomised study in which 25 subjects were treated with a combination of two drugs where they were either treated with both, treated with one or other or treated with neither. Samples were taken from patients before treatment and after treatment. To me this represents a 2x2 factorial design with a repeated measure. Unfortunately due to the non-randomised observational nature of the study, the design is unbalanced (i.e. there are unequal numbers of patients in each treatment subgroup). The subgroup assignment is as follows:
Drug A + | Drug A - | |
Drug B + | 6 | 2 |
Drug B - | 9 | 7 |
The primary comparisons of interest are:
- Differences in the gene expression changes between subjects treated and not treated with Drug A
- Differences in the gene expression changes between subjects treated and not treated with Drug B
However I would also like to control for the interaction effect of the presence or absence of the other drug in a given comparison since I believe that there will inevitably be significant interaction effect for some of the genes. In other words in defining the effect of Drug A we would like to take into account whether the Drug A effect is changing whether the subject is receiving Drug B or not.
From my searching on the topic, the conventional method to approach such a design would be a Two-Way factorial ANOVA with repeated measures. However, for reasons I do not fully understand repeated measures ANOVA is not recommended for unbalanced study designs even if SSIII testing is used for F tests. Multiple online sources and consultation with a statistician recommends the use of mixed-effect models for the analysis of unbalanced designs using something like the nlme R package. The statistician also recommended the following model:
gene~TIME*drugA*drugB, random=~1|SUBJECT
Where TIME represents the timepoint (pre and post), drugA and drugB are binary vector indicating subject's treatment assignment and the random argument specifies that the linear model includes a random intercept for each subject. This returns coefficients estimating the "main effects" of drugA and drugB over time and whether there is an interaction between drugB on the effect of drugA over time. Thereby allowing me to test for the specific effect of each treatment and account for the interaction of the other drug.
Now to my question! Since I would rather use a well established method for gene expression analysis such as LIMMA for the analysis of this study and I would rather wish to avoid implementing a mixed model for every gene, would it be appropriate to use LIMMA for this? I have actually gone ahead and carried out this analysis in LIMMA and verified that it returns the same coefficients. I did this using the following:
design<-model.matrix(~TIME*drugA*drugB)
corfit <- duplicateCorrelation(hamaset,design,block=patient)
corfit$consensus
fit <- lmFit(hamaset,design,block=patient,correlation=corfit$consensus)
Note: I used the duplicateCorrelation function to block for inter-subject differences for a paired approach.
So more specifically I'm asking is it appropriate to use LIMMA to test for interaction effects in this unbalanced design using the above method? Will the unequal number of subjects in the treatment subgroups result in biased or violations of the assumptions implemented in these interaction tests?
Any advice or comments on the question or carrying out this type of analysis would be much appreciated.
Best,
Rob
You say have 25 subjects, but the numbers allocated to each subgroup don't match up, i.e., 6+2+9+7 = 24. Also, is the number of samples twice the number of subjects, because you have before and after samples for each subject?
Hi Aaron,
Thanks for your reply. Yes you're correct that was an oversight on my part, at the end of the study we only had usable samples from 24 subjects although 25 were recruited. For simplicity's sake let's take the total number of subjects as 24.
Again for the sake of the example I gave lets say that yes there are 48 samples: 24 baseline and 24 post-treatment. I actually have 2 post-treatment timepoints and there are some missing observations giving me a total sample number of 66. However, I intend to do the baseline vs timepoint 1 and baseline vs timepoint 2 analyses separately in order to account for early and late changes.