Hi all,
I have to analyze some RNA-seq data and I am using DESeq2 for this. However, the experimental design is somewhat complicated in this case and I am unsure about how to correctly define the design matrix for DESeq2.
I am mostly following the “RNA-seq workflow: gene-level exploratory analysis and differential expression” workflow on Bioconductor and have already got a SummarizedExperiment. Now, I want to convert my data into a DESeqDataSet: http://www.bioconductor.org/help/workflows/rnaseqGene/#starting-from-summarizedexperiment. For this step, I need to specify the design formula:
ddsSE <- DESeqDataSet(se, design = ~ [covariates] + status)
About my data:
There are around 400 human peripheral blood samples in the data set. The phenotype of interest is a disease (“status
”). I have 10 disease samples, 17 samples of an intermediate stage, and the rest of the 400 are healthy controls. Samples are all derived from females between 40 and 86 years old, so usually I would like to add a covariate for age. Then, there are different proportions of different blood cell subtypes (proportions obtained by FACS) in each sample, which we would (usually) like to take into account by adding further covariates.
So far, so good. But in these data, I also have twins. So, I would like to take the pairs into account, of course, as I expect a strong genetic effect. The problem: I have both monozygotic (MZ, around 220 samples) and dizygotic (DZ, around 120 samples) twins, and I do not think it is a good idea to treat these pairs equally. Adding the information of MZ and DZ as a covariate will also not help, I suppose, as the effect occurs between the twin pairs, not across all individual samples within the group of MZs and DZs.
To make it even more difficult, I have singletons as well (no sibling sample in the data). I do also not have disease-discordant twins. Instead, I have a DZ twin pair who both have the disease. The rest of twins are all within the healthy control samples.
As stated before, the main phenotype of interest is disease (as compared to healthy). But we would also like to be able to look at the twin-effect on the side, in comparison to some other analyses in another data set (but still for the same study). Thus, simply removing the twins to have singletons only is not a good solution. We would not really loose power for the main comparison of disease vs healthy (I will only use matched singleton subsamples for the healthy group anyhow), but we would have no chance to do the additional analyses on twins we would like to perform.
Summary:
400 samples, whole blood (blood counts/proportions of cell types available), females, 40-86 years old
220 MZ twins, 120 DZ twins, 60 singletons
3 phenotypes of interest:
- 10 disease samples (2 DZ twins, rest singletons)
- 20 intermediate samples (singletons)
- 370 healthy controls (MZ twins, DZ twins, singletons)
We are interested in differential expression, but also in differential variability of expression between the phenotypes of interest.
Additionally, we would like to look at twin effects.
What I thought we could possibly do:
-
Prepare a data set with singletons only for the case-control analyses:
design = ~ age + celltype1 + celltype2 [+ ...] + status
-
Prepare another data set with healthy MZ twins only for the complementary twin analyses
design = ~ age + celltype1 + celltype2 [+ ...] + familyID
Actually, I would prefer to have it all in one, if there was a way of doing so. Because normalization etc. will change expression values between the separate (sub) data sets, and that might render the data and results less comparable. But using the data set as a whole without taking the different types of twins effectively into account would be even worse, in my opinion.
Which strategy would you advice? Would there be an appropriate way of keeping all samples within the data but still model it correctly, taking the different types of twins into account?
Any recommendation on how to do best deal with these data would be highly appreciated.
Thank you!
Simone
Hi Michael,
Thank you very much for your reply. This is interesting, and I have got two more questions:
1) Regarding DESeq2 on MZ and DZ twins
Out of curiosity, if there were no singletons in my data but only MZ and DZ twins, what would be the correct way of modelling it with DESeq2?
Could I apply a strategy similar to what is described in section “Group-specific condition effects, individuals nested within groups” of the DESeq2 vignette, introducing a factor
zyg
with levels MZ and DZ? Or would that be incorrect? Also, if something like this was possible, couldn’t I add a third level forzyg
indicating the singletons (with all unique familyIDs within the group of singletons)?2) Regarding limma on MZ and DZ twins
Thank you also for pointing out that using limma will be better in the case of this data set. I have used limma and
duplicateCorrelation()
to build blocks of paired samples before. However, also here, I am not sure how I could distinguish between MZ and DZ twins correctly. Should I just build blocks of twins irrespective zygosity using the familyIDs? Wouldn't that introduce a bias? And I would probably still need an additional interaction term for the design?Sorry for the confusion, and again, thank you very much for your help.
Simone
If there were no singletons, you would control using DESeq2 the within-twin variation using the methods listed in that section, but there would be more modeling choices to make to determine the design. E.g. do you want separate disease effects for MZ and DZ? I can't say what the design would be without having a longer conversation, but I don't think this is the right analysis path anyway, because you can't add the singletons to a design that controls for within-twin differences and then make comparisons across disease vs healthy.
Regarding how to use limma with duplicateCorrelation and deal with MZ, DZ and singletons, I'd recommend making a new post and tagging 'limma' so the package authors can guide you to a solution.
Thank you, Michael. I understand.
However, would you be able to give a hint about where to find the section about modeling the different twin effects? When I search for the word "twin" in the DESeq2 vignette, I do not get any results, so I am not sure which section you refer to. I looked at the one updated in October 2017 available at http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html I would just like to read about it.
I have posted a new question on how to proceed with limma:
Limma: how to account for twin data (both MZ and DZ twins)?
By "that section" I meant the section you referred to above, "Group-specific condition effects, individuals nested within groups". But now that I re-read your question, I'm not sure this section is relevant. I'm not sure what the point of the MZ and DZ pairs are in the analysis. I actually think duplicateCorrelation() is the only way to go, regardless, because want to control for the correlation and compare across groups that have different sets of pairs (and singletons).