Question

pairwise combining factors or multifactor analysis ? (DESeq2)

2

Entering edit mode

aziza ▴ 20

@aziza-10616

Last seen 8.8 years ago

I have a dataset composed of 2 ecotypes, 2 conditions (control, treatment) and 2 geographic units (24 samples in total). I want to look at which genes are differentially expressed:
1. - between ecotype A and B at control,
2. - between ecotype A and B at treatment

3. - between control and treatment for ecotype A,
4. - between control and treatment for ecotype B or
5- equally I would like to look at different combinations including the geographic unit, between north and south for ecotype A, etc.

and I would even like to raise one level of complexity and include the three factors, for example, difference between north and south at treatment conditions for ecotype A (etc).

Following the DESeq2 manual, section 3.3 (Interactions), my solution is to define a factor as a combination of the factors I want to measure for example:
condition=rep(rep(c("A.ctl","A.treat","B.ctl","B.treat"),each=3),2)

and then use just condition in the design formula:
cond <- DESeqDataSetFromMatrix(countData = countData,colData =colData, design = ~ condition)

and apply the contrast formula to look at two comparisons like so:
condP <- DESeq(cond)
resMTC <- results(condP, contrast = c("condition", "A.treat", "A.ctl"))

but someone told me to do this instead:
resMTC <- results(condP, contrast=c(1, -1, -1, 1))
Is this correct? If so why?

I would also like to know if in this case the multifactor analysis approach is more suitable. If I understand correctly, with the multifactor analysis including for example ecotype+condition, I will get a list of genes that are differentially expressed where the effect of the 2 ecotypes is controlled by the effect of the treatment. So I can extract the up and down regulated genes for the ecotype A and B, but I won't be able to know which genes are up regulated in A and at treatment conditions, whereas with the pairwise approach I would. Is this correct?

I was also suggested that if I do the multifactor analysis I have to use ecotype+condition+ecotype:condition for the full model and then ecotype:condition for the reduced but again wouldn't this analysis only yield the list of DEG between ecotypes A and B with no information about how the treatment/control conditions affect each gene? Otherwise how would I extract that information from the multifactor list.

These are my questions:

- Are ecotypes A and B responding different to the treatment (or to the control) conditions? (Q 1 and 2)

- What genes are DE between control and treatment for each ecotype? (Q 3 and 4)

- Does ecotype A in location 1 respond different to the treatment than the same ecotype in location 2? (Q 5)

Thanks

deseq2 multifactor pairwise • 3.9k views

ADD COMMENT • link 8.9 years ago aziza ▴ 20

score 0 · Answer 1 · 2016-05-02

"but someone told me to do this instead:
resMTC <- results(condP, contrast=c(1, -1, -1, 1))
Is this correct? If so why?"

If you want to simply compare treated and control for ecotype A, the recommendation in the vignette and your code is correct. That numeric contrast doesn't make sense for that comparison.

If you want to make pairwise comparisons including geographic unit, you can add that in when you build the "condition" variable, and continue to use ~condition.

You questions at the top are simply pairwise comparisons of groups, for which it's simplest to create the "condition" variable as you have. Some of your questions at the bottom are for testing differences in the treatment effect across ecotype, in other words, testing an interaction, as described in the vignette. You would use an interaction term for those. If you require additional help with setting up the design for those questions, extracting results and interpreting them, I'd suggest you collaborate with a statistician.

score 0 · Answer 2 · 2016-05-02

0

Entering edit mode

aziza ▴ 20

@aziza-10616

Last seen 8.8 years ago

But according to the manual "Many users begin to add interaction terms to the design formula, when in fact a much simpler approach would give all the results tables that are desired. [...] combine the factors of interest into a single factor with all combinations of the original factors", according to this, adding an interaction term may not be necessary if we use the combination of factors, when is it then determined that an interaction term may or may not be used? I gathered from the manual that I could answer all the question bellow with this approach of 'combination of factors into one'

ADD COMMENT • link 8.9 years ago aziza ▴ 20

0

Entering edit mode

Sorry, then you are not gathering the correct view.

My response above, and the part of the vignette above and below the part you quoted explain when an interaction term is useful.

From the vignette:

"Interaction terms can be added to the design formula, in order to test, for example, if the log2 fold change attributable to a given condition is different based on another factor, for example if the condition effect differs across genotype. Many users begin to add interaction terms..."

And above:

"Some of your questions at the bottom are for testing differences in the treatment effect across ecotype, in other words, testing an interaction..."

Beyond this, you need to consult a local statistician or a person with a quantitative background. There is nothing special about DESeq2 here. You just need someone to explain to you when interaction terms are used in linear models, and to help you formulate your questions quantitatively in a linear model.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Hi , Thanks for your guiding, I consulted a statistician and was advised to use a multifactor analysis with all three factors and all possible interactions. He suggested to then remove in a step-wise fashion the insignificant coefficients. He has no experience with RNA-Seq data so now that I came to check the pvalues of the coefficients after running the multifactor test I realise each gene has a different estimate for the coefficient... I guess it is then better to try removing an interaction term from the full model one at a time and compare the fit of the full vs the reduced model, sequencially keeping always the one that performs better. My question is how can you do this is DESEq2?

Thanks

ADD REPLY • link 8.9 years ago aziza ▴ 20

1

Entering edit mode

Great. What aov() does in base R (and so what you are seeing when you use summary.aov) is a series of model comparisons, where terms are sequentially removed. You can do this with DESeq2 by repeatedly running DESeq() with test="LRT" and removing terms sequentially from full and reduced.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

I was editting my question as you responded. sorry because now it looks messy. Thanks again

ADD REPLY • link 8.9 years ago aziza ▴ 20

0

Entering edit mode

Sorry, what I was asking before is how can I use the step function with DESeq2 results, meaning that I need to include a model name and when I use dds as an object in doesn't work.

step(FullModel,direction="backward",test="F")
Error in terms.default(object) : no terms component nor attribute

ADD REPLY • link 8.9 years ago aziza ▴ 20

1

Entering edit mode

You have to make the comparisons explicitly and manually with DESeq2. It's up to you to specify full and reduced (which can be formula or design matrices), which will give you results similar to a single row in a summary.aov table (for each gene).

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

I post below for ease of reading

ADD REPLY • link 8.9 years ago aziza ▴ 20

score 0 · Answer 3 · 2016-06-01

I am using the LRT function to test if the triple interaction term is significant in the model, so the full model includes the triple interaction whereas the reduced model doesn't. My assumption is that when padj < 0.05, the interaction term is significant (one rejects the null hypothesis that there are no differences between the full and reduced model), so with summary(res) I check the % of genes with significant pvalues and if I get a small percentage of genes I conclude that the interaction term was not significant for a high percentage of genes and I should therefore favour the reduced model over the full.

design(ddsL) <-  formula(~ genetic_unit + condition + ecotype + genetic_unit:ecotype + genetic_unit:condition + ecotype:condition + genetic_unit:ecotype:condition) 
ddsL <- DESeq(ddsL)
resL <- results(ddsL)
summary(resL, alpha = 0.05)

# Compare full to reduced model
ddsML=ddsL
ddsML = estimateSizeFactors(ddsL)
ddsML = estimateDispersions(ddsL)

o=nbinomLRT(ddsML, full = design(ddsL), reduced =  formula(~ genetic_unit + condition + ecotype + genetic_unit:ecotype + genetic_unit:condition + ecotype:condition), maxit = 1000)
o=results(o)
summary(o, alpha = 0.05)

out of 28842 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 193, 0.67%
LFC < 0 (down)   : 40, 0.14%
outliers [1]     : 81, 0.28%
low counts [2]   : 3913, 14%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

So from this example in only 0.81% of the genes the interaction term is significant. Am I interpreting correctly this test?

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] genefilter_1.52.1          RColorBrewer_1.1-2         gplots_3.0.1               limma_3.26.9              
 [5] DESeq2_1.10.1              RcppArmadillo_0.6.700.6.0  Rcpp_0.12.5                SummarizedExperiment_1.0.2
 [9] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.8              S4Vectors_0.8.11          
[13] DESeq_1.22.1               lattice_0.20-33            locfit_1.5-9.1             Biobase_2.30.0            
[17] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0       bitops_1.0-6         futile.options_1.0.0 tools_3.2.5         
 [7] zlibbioc_1.16.0      rpart_4.1-10         RSQLite_1.0.0        annotate_1.48.0      gtable_0.2.0         Matrix_1.2-6        
[13] DBI_0.4-1            gridExtra_2.2.1      cluster_2.0.4        caTools_1.17.1       gtools_3.5.0         grid_3.2.5          
[19] nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.32.3 XML_3.98-1.4         survival_2.39-4      BiocParallel_1.4.3  
[25] foreign_0.8-66       gdata_2.17.0         latticeExtra_0.6-28  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.1.0       
[31] lambda.r_1.1.7       Hmisc_3.17-4         scales_0.4.0         splines_3.2.5        xtable_1.8-2         colorspace_1.2-6    
[37] KernSmooth_2.23-15   acepack_1.3-3.3      munsell_0.4.3        chron_2.3-47