Hi again,
I already posted my question, but was asked to open a new one because it was suggested to use limma instead of DESeq2:
C: DESeq2 design formula: how to account for twin data (MZ and DZ)?
A summary of the relevant parts of my question:
I have to analyze some RNA-seq data, and I have got 400 samples of human whole blood (proportions of cell types available from FACS blood counts), all females, 40-86 years old. My phenotype of interest is disase as compared to healthy controls. There is also an intermediate stage, i.e. there are three disease states, but this intermediate stage is less interesting at the moment.
The problem: There are both monozygotic (MZ, n=220) and dizygotic (DZ, n=120) twins in the data. Additionally, there are singletons (n=60 samples with no twin pair). The different groups are not evenly distributed across my phenotype of interest, and the design is quite unbalanced in general:
- 10 disease samples (2 DZ twins, rest singletons)
- 20 intermediate samples (singletons)
- 370 healthy controls (MZ twins, DZ twins, singletons)
I would like to look at both differential expression, but also differential variability of expression (e.g. DiffVar) between the disease and healthy controls (I will probably have to use subsets of the data to have similar numbers of samples in each group, as I have only 10 disease samples, but 370 controls).
Additionally, we would like to have the possibility to look into twin effects as well, which is one of the reasons why I do not simply want to remove one of the two twins for each of them to have singletons only.
Thus I was wondering about how to correctly specify the design formula for these data. Obviously - apart from covariates like cell subpopulations and perhaps age - I would like to take the twin pairs into account, as I expect a strong genetic effect. So, building blocks using the duplicateCorrelation()
function could be useful, as also pointed out by Michael in my previous post. However, I would expect a much stronger effect for MZ twins than for DZ twins, so I do not think it is a good idea to treat all twin pairs equally (?) Adding the information of MZ and DZ as a covariate to the design formula will also not help, I suppose, as the effect occurs between the twin pairs, not across all individual samples within the group of MZs and DZs. It might work by adding an interaction term?
Any advice on how to build the blocks and/or define the design formula correctly would be highly appreciated.
More details about my doubts and concerns can be found in my original post (see link above).
Thank you very much for your help.
Simone
This is not a direct answer, but have you done anything to quantify the difference in correlation between twins vs the correlation between unrelated individuals? Maybe it will turn out that the excess correlation between twins is not as large as you suspect. I would also try running duplicateCorrelation with only the DZ twins and only the MZ twins to see how much they differ from each other. And I guess you could randomly pair up the unrelated samples and run duplicateCorrelation as a "control" to compare the twins' duplicate correlation values to. These steps might give you information that allows you to simplify the experimental design to the point where modeling it in limma is more straightforward.