Please help me build a design that accounts for both the matched case-control design (there were 1 case [disease=1] and 2 controls [disease=0] in each pair [having the same matchid], matched for age and gender) and the correlation between multiple measurements at different timepoints (time 1, 2, 3) in a given subject. My objective is to compare the microbial differential abundance between cases and controls over all timepoints (not treating each timepoint separately). Thanks a lot!
My dataset is as following:
personid |
matchid |
time |
disease |
---|---|---|---|
1 |
1 |
1 |
1 |
1 |
1 |
2 |
1 |
1 |
1 |
3 |
1 |
2 |
1 |
1 |
0 |
2 |
1 |
2 |
0 |
2 |
1 |
3 |
0 |
3 |
1 |
1 |
0 |
3 |
1 |
2 |
0 |
3 |
1 |
3 |
0 |
4 |
2 |
1 |
1 |
4 |
2 |
2 |
1 |
4 |
2 |
3 |
1 |
5 |
2 |
1 |
0 |
5 |
2 |
2 |
0 |
5 |
2 |
3 |
0 |
6 |
2 |
1 |
0 |
6 |
2 |
2 |
0 |
6 |
2 |
3 |
0 |
7 |
3 |
1 |
1 |
7 |
3 |
2 |
1 |
7 |
3 |
3 |
1 |
8 |
3 |
1 |
0 |
8 |
3 |
2 |
0 |
8 |
3 |
3 |
0 |
9 |
3 |
1 |
0 |
9 |
3 |
2 |
0 |
9 |
3 |
3 |
0 |
If I use edgeR, is the design correct?
design <- model.matrix(~matchid+personid+disease+time+disease:time)
(There are paired examples and repeated measurement examples in the edgeRUsersGuide. However, in repeated measurement examples, repeated measurements at different timepoints were obtained from independent subjects. By contrast, in my study, repeated measurements at different timepoints were obtained from the same subjects.)
Thanks a lot for your professional and quick reply!! I appreciate it very much. There are 3 more questions:
1) If I need to adjust for some confounding factors such as smoking and drinking. Can I just add these variables into the model.matrix? Like design <- model.matrix(~matchid + time +smoking+drinking+ disease) (using duplicateCorrelation with the blocking factor set to personid).
2) Incomplete data further complicated this study. There were heterogeneity in the number and timing of measurements from different subjects in our study (for example, some pairs had only one control [not 2 controls]; some pairs had only 2 measurements [not 3 measurements]). Is the above design still suitable for this data?
3) As I am not familiar with limma, is it suitable for analyzing 16s data (in this study, microbiome composition and specific bacterial abundances were determined through bacterial 16S rRNA gene sequencing)? Compared to edgeR and Deseq2 (more frequently used for the microbiome data in the published papers), does limma have any disadvantages, although it can take the complicated design into account?
Thanks again for your precious time!
1) Yes, you can do that, provided the extra variables are not confounded with
disease
. If they are confounded withmatchid
, then they do not need to be included at all.2) Yes, that should be fine; information from all "pairs" (triplets, really) and time points will be used to estimate the disease effect, so a few missing observations should not cause a problem.
3) I don't know, but if you can model the mean-variance trend accurately, you should be fine. See if the plot produced by
voom
is similar to what you would get for RNA-seq data.Thank you very much!
For the 1st question, actually, the extra variables such as smoking may correlate with disease (these variables are not confounded with matchid), thus we want to obtain a pure correlation between microbiome and disease by adjusting these variables (For example, in multivariate logistic regression models, we can both add interested variables and confounders into the multivariate model).Is it OK for me to add these confounders (may correlate with disease) into the design matrix?