DESeq2 model design for paired samples with multiple factors
1
1
Entering edit mode
swillis4 ▴ 10
@swillis4-18928
Last seen 5.9 years ago

I am to the point where the more research I do on paired sample design, the more confused I become, so hopefully I can get some clarification on our design.

I am trying to specify the appropriate model in DESeq2 for our RNA-seq data. We have 8 pairs of corals that each came from 3 reefs (1, 3, and 4 pairs per reef), and each one in the pair was either susceptible or resistant to bleaching, and for each pair we have 3 time points, pre-, mid-, and post thermal stress. Each sample was sequenced on 3 of 6 lanes of sequencing (3 technical replicates).

We are trying to identify which genes are DE in bleached (sus.) and unbleached (res.) samples at each time point and which genes change across time and which do not. We are not interested in pair-specific expression; in fact I would like to factor OUT pairs by comparing primarily within pairs and then among pairs to eliminate expression induced by the microhabitat in which each pair resides. My model matrix looks like this:

sample    lane    reef    individual    pair    timepoint    bleaching
109A    1    C    109    109-110    pre    bleach
109B    1    C    109    109-110    mid    bleach
109C    1    C    109    109-110    post    bleach
110A    1    C    110    109-110    pre    unbleached
110B    1    C    110    109-110    mid    unbleached
110C    1    C    110    109-110    post    unbleached
121A    1    C    121    121-122    pre    bleach
121B    1    C    121    121-122    mid    bleach
121C    1    C    121    121-122    post    bleach
122A    1    C    122    121-122    pre    unbleached
122B    1    C    122    121-122    mid    unbleached
122C    1    C    122    121-122    post    unbleached
123A    1    C    123    123-124    pre    bleach
123B    1    C    123    123-124    mid    bleach
123C    1    C    123    123-124    post    bleach
124A    1    C    124    123-124    pre    unbleached
124B    1    C    124    123-124    mid    unbleached
124C    1    C    124    123-124    post    unbleached
etc.

So, if I ignore reef since it's convoluted with pair, I get a model design that's potentially:

ddsTXI <- DESeqDataSetFromTximport(txiB, sampleTable, ~ pair + timepoint + bleaching + timepoint:bleaching)

and collapse technical replicates (by $sample, although I'm still skeptical there's not a better way to use this info), and make sure the unbleached and pre-bleaching samples are reference

ddsColl<-collapseReplicates(ddsTXI,meta$sample,meta$quant)
ddsColl$bleaching <- relevel(ddsColl$bleaching, ref = "unbleached")
ddsColl$timepoint <- relevel(ddsColl$timepoint, ref = "pre")

I get

ddsColl <- DESeq(ddsColl)
matrix(resultsNames(ddsColl))

      [,1]                            
 [1,] "Intercept"                     
 [2,] "pair_121.122_vs_109.110"       
 [3,] "pair_123.124_vs_109.110"       
 [4,] "pair_43.44_vs_109.110"         
 [5,] "pair_53.54_vs_109.110"         
 [6,] "pair_65.66_vs_109.110"         
 [7,] "pair_71.72_vs_109.110"         
 [8,] "pair_83.84_vs_109.110"         
 [9,] "timepoint_mid_vs_pre"          
[10,] "timepoint_post_vs_pre"         
[11,] "bleaching_bleach_vs_unbleached"
[12,] "timepointmid.bleachingbleach"  
[13,] "timepointpost.bleachingbleach" 

but there is no true reference for pair; the important comparison is WITHIN pair. If I run this through 

rld <- rlog(ddsColl, blind=FALSE)

and then do a PCA, I see that one sample dominates the first axis (20% var), which just happens to be the first sample in the list (109A), which I have no a priori reason to expect is strongly different. This makes me suspect I've specified the model incorrectly to accommodate the variance effectively. Any suggestions? Thanks!

Stuart

deseq2 rnaseq model design • 1.8k views
ADD COMMENT
0
Entering edit mode

Thanks for the quick response, Michael. I can understand that. I suppose I am most concerned if I have specified the model correctly to extract what I am interested in. And more importantly, if the PCA results suggest that the model is not sufficiently accommodating some variance inherent to individual or pair. Thanks for your insight!

ADD REPLY
0
Entering edit mode

It looks correct to me.

Regarding your PCA question, see the FAQ in the vignette, we have one on this.

ADD REPLY
0
Entering edit mode

Thanks for your previous help, Michael.  For whatever reason, the varianceStabilizingTransformation provided a much more informative transformation for the PCA than rlog (to my eye, at least). It revealed that the variation within individuals is much greater than we would have expected, dwarfing any effect of pairing. I am hoping to get your take on factoring out this individual-level variation.

When I try to include "individual" as such in the model, the model is not full rank, which I presume is because individual is a linear combination of other factors. After looking at the vignette, it seems one potential way to manage this is to add a nesting factor into the model (indeed, as I understand it, the nesting factor itself already exists as "pair"). So the model now looks like:

ddsTXI <- DESeqDataSetFromTximport(txiB, sampleTable, ~ pair + pair:bleaching + timepoint + bleaching + timepoint:bleaching)

where "pair:bleaching" is the attempt to control for individual variation using pair as a decoy. Indeed this model produces many more DE intervals (genes) than without "pair:bleaching", and having "pair" alone seems to make next-to-no difference. Does this model appear kosher to you?

Also, can you confirm my understanding of extracting particular results, please? As I understand, specifying

res <- results(ddsColl, contrast=c("bleaching","bleach","unbleached"))

should give me the effect of "bleaching" in the "pre" time period, where "pre" is the reference, while

res2 <- results(ddsColl, contrast=list( c("bleaching_bleach_vs_unbleached","bleachingbleach.timepointmid") ))

should give me the effect of "bleaching" in the "mid" time period (and similar for "post"), with the c() being necessary here.

But for time-period effects between non-reference conditions (mid vs post) I need:

res9 <-results(ddsColl, contrast=list("timepoint_mid_vs_pre", "timepoint_post_vs_pre"))

and for the difference in interaction with bleaching:

res6 <-results(ddsColl, contrast=list("timepointmid.bleachingbleach", "timepointpost.bleachingbleach"))

Finally, it turns out that we may be able to get qPCR data for the density of symbionts in each individual at each time period, which would provide a continuous variable that would eliminate "bleaching" and hopefully provide for more powerful correlation with expression. Do you have a sense of pitfalls in going this route rather than with our binary factor? Thanks again!

 

ADD REPLY
0
Entering edit mode

I’m out of the office for winter break, but will reply when I’m back.

ADD REPLY
0
Entering edit mode

In the vignette, the example is to do ~group + group:individual + group:treatment. 

I would recommend to use ~bleaching + bleaching:pair + bleaching:timepoint for your dataset. This will give time-point specific differences relative to the first time point for both levels of "bleaching". This is easier to interpret than the design with a main effect "time point", for users who aren't that comfortable with when to add main effects and interactions.

ADD REPLY
0
Entering edit mode

Thanks. I'll give that model a try.

Do you have any advice about whether to try to fit the model with quantitative symbiont data rather than binning as "bleached/unbleached"?

ADD REPLY
0
Entering edit mode

I’m not sure / no specific advice

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 21 hours ago
United States

It sounds like you are concerned with the pair coefficients, and the fact that one of the pairs is the reference. The coding of these makes no difference to the coefficients you care about (the main effect coefficients 9-11 and the interaction coefficients 12-13). If you had ~batch + condition with three batches, you would have coefficients batch_2_vs_1 and batch_3_vs_1, and it makes no difference that batch 1 is the reference, the condition effects would be the same if batch 2 or 3 were the reference.

ADD COMMENT

Login before adding your answer.

Traffic: 581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6