I am to the point where the more research I do on paired sample design, the more confused I become, so hopefully I can get some clarification on our design.
I am trying to specify the appropriate model in DESeq2 for our RNA-seq data. We have 8 pairs of corals that each came from 3 reefs (1, 3, and 4 pairs per reef), and each one in the pair was either susceptible or resistant to bleaching, and for each pair we have 3 time points, pre-, mid-, and post thermal stress. Each sample was sequenced on 3 of 6 lanes of sequencing (3 technical replicates).
We are trying to identify which genes are DE in bleached (sus.) and unbleached (res.) samples at each time point and which genes change across time and which do not. We are not interested in pair-specific expression; in fact I would like to factor OUT pairs by comparing primarily within pairs and then among pairs to eliminate expression induced by the microhabitat in which each pair resides. My model matrix looks like this:
sample lane reef individual pair timepoint bleaching
109A 1 C 109 109-110 pre bleach
109B 1 C 109 109-110 mid bleach
109C 1 C 109 109-110 post bleach
110A 1 C 110 109-110 pre unbleached
110B 1 C 110 109-110 mid unbleached
110C 1 C 110 109-110 post unbleached
121A 1 C 121 121-122 pre bleach
121B 1 C 121 121-122 mid bleach
121C 1 C 121 121-122 post bleach
122A 1 C 122 121-122 pre unbleached
122B 1 C 122 121-122 mid unbleached
122C 1 C 122 121-122 post unbleached
123A 1 C 123 123-124 pre bleach
123B 1 C 123 123-124 mid bleach
123C 1 C 123 123-124 post bleach
124A 1 C 124 123-124 pre unbleached
124B 1 C 124 123-124 mid unbleached
124C 1 C 124 123-124 post unbleached
etc.
So, if I ignore reef since it's convoluted with pair, I get a model design that's potentially:
ddsTXI <- DESeqDataSetFromTximport(txiB, sampleTable, ~ pair + timepoint + bleaching + timepoint:bleaching)
and collapse technical replicates (by $sample, although I'm still skeptical there's not a better way to use this info), and make sure the unbleached and pre-bleaching samples are reference
ddsColl<-collapseReplicates(ddsTXI,meta$sample,meta$quant) ddsColl$bleaching <- relevel(ddsColl$bleaching, ref = "unbleached") ddsColl$timepoint <- relevel(ddsColl$timepoint, ref = "pre")
I get
ddsColl <- DESeq(ddsColl) matrix(resultsNames(ddsColl))
[,1]
[1,] "Intercept"
[2,] "pair_121.122_vs_109.110"
[3,] "pair_123.124_vs_109.110"
[4,] "pair_43.44_vs_109.110"
[5,] "pair_53.54_vs_109.110"
[6,] "pair_65.66_vs_109.110"
[7,] "pair_71.72_vs_109.110"
[8,] "pair_83.84_vs_109.110"
[9,] "timepoint_mid_vs_pre"
[10,] "timepoint_post_vs_pre"
[11,] "bleaching_bleach_vs_unbleached"
[12,] "timepointmid.bleachingbleach"
[13,] "timepointpost.bleachingbleach"
but there is no true reference for pair; the important comparison is WITHIN pair. If I run this through
rld <- rlog(ddsColl, blind=FALSE)
and then do a PCA, I see that one sample dominates the first axis (20% var), which just happens to be the first sample in the list (109A), which I have no a priori reason to expect is strongly different. This makes me suspect I've specified the model incorrectly to accommodate the variance effectively. Any suggestions? Thanks!
Stuart
Thanks for the quick response, Michael. I can understand that. I suppose I am most concerned if I have specified the model correctly to extract what I am interested in. And more importantly, if the PCA results suggest that the model is not sufficiently accommodating some variance inherent to individual or pair. Thanks for your insight!
It looks correct to me.
Regarding your PCA question, see the FAQ in the vignette, we have one on this.
Thanks for your previous help, Michael. For whatever reason, the varianceStabilizingTransformation provided a much more informative transformation for the PCA than rlog (to my eye, at least). It revealed that the variation within individuals is much greater than we would have expected, dwarfing any effect of pairing. I am hoping to get your take on factoring out this individual-level variation.
When I try to include "individual" as such in the model, the model is not full rank, which I presume is because individual is a linear combination of other factors. After looking at the vignette, it seems one potential way to manage this is to add a nesting factor into the model (indeed, as I understand it, the nesting factor itself already exists as "pair"). So the model now looks like:
where "pair:bleaching" is the attempt to control for individual variation using pair as a decoy. Indeed this model produces many more DE intervals (genes) than without "pair:bleaching", and having "pair" alone seems to make next-to-no difference. Does this model appear kosher to you?
Also, can you confirm my understanding of extracting particular results, please? As I understand, specifying
should give me the effect of "bleaching" in the "pre" time period, where "pre" is the reference, while
should give me the effect of "bleaching" in the "mid" time period (and similar for "post"), with the c() being necessary here.
But for time-period effects between non-reference conditions (mid vs post) I need:
and for the difference in interaction with bleaching:
Finally, it turns out that we may be able to get qPCR data for the density of symbionts in each individual at each time period, which would provide a continuous variable that would eliminate "bleaching" and hopefully provide for more powerful correlation with expression. Do you have a sense of pitfalls in going this route rather than with our binary factor? Thanks again!
I’m out of the office for winter break, but will reply when I’m back.
In the vignette, the example is to do ~group + group:individual + group:treatment.
I would recommend to use ~bleaching + bleaching:pair + bleaching:timepoint for your dataset. This will give time-point specific differences relative to the first time point for both levels of "bleaching". This is easier to interpret than the design with a main effect "time point", for users who aren't that comfortable with when to add main effects and interactions.
Thanks. I'll give that model a try.
Do you have any advice about whether to try to fit the model with quantitative symbiont data rather than binning as "bleached/unbleached"?
I’m not sure / no specific advice