Question

Check understanding of linear model

1

Entering edit mode

Jake ▴ 90

@jake-7236

Last seen 2.4 years ago

United States

I just wanted to check my understanding of the linear model used to build the design matrix for DESeq2 and was wondering if someone could tell me if all of this is correct. I have made a schematic of my understanding and what I think is going on below.

design ~ cell_type + treatment + cell_type:treatment
Line 1: intercept of model (baseline expression)
Line 2: effect of treatment in cell type A
Line 3: intrinsic difference in expression in cell type B vs. A
Line 4: effect of treatment in cell type B
Line 5: Interaction term (Line 4 - Line 2) is the difference in effectiveness of treatment in cell type B vs cell type A

So the linear model will be expression = beta1 + beta2*cell_type + beta3*treatment + beta4*interaction where cell_type, treatment, and interaction can take either 0 or 1.

If we first focus on cell type A, cell_type=0, interaction=0. For A control: expression = beta1 + beta2*0 + beta3*0 + beta4*0 = beta1 so in this case line 1. For A treatment: expression = beta1 + beta2*0 + beta3*1 + beta4*0 = beta1 + beta3 so in this case line 1 + line 2.

Looking at cell type B, cell_type=1. For B control: expression = beta1 + beta2*1 + beta3*0 + beta4*0 = beta1 + beta2 so in this case line 1 + line 3. For B treatment: expression = beta1 + beta2*1 + beta3*1 + beta4*1 = beta1 + beta2 + beta3 + beta4 so in this case line 1 + line 3 + line 4 (which is composed of line 2 + the extra interaction effect).

In what instances would you not include an interaction term and just have design ~ cell_type + treatment? When you think the treatment has the same effect in each cell type or when you don't care about cell type specific effects and are just interested in does treatment have any effect vs control?

Thanks

linear model deseq2 • 4.0k views

ADD COMMENT • link updated 9.7 years ago by James W. MacDonald 67k • written 9.7 years ago by Jake ▴ 90

0

Entering edit mode

hi Jake,

James has good answers below. Just a note that I'd recommend using DESeq(dds, betaPrior=FALSE) for designs with interactions, so you get the standard terms as you would in a linear model with interactions. I'm in the process of having DESeq() by default turn off the log fold change prior for designs with interactions, to make things easier.

ADD REPLY • link 9.7 years ago Michael Love 43k

0

Entering edit mode

Hi,

Can you expand on this briefly? I believe this is the relevant section from the help file: "When a log2 fold change prior is used (betaPrior=TRUE), then nbinomWaldTest will by default use expanded model matrices, as described in the modelMatrixType argument, unless this argument is used to override the default behavior or unless the design contains 2 level factors and an interaction term. This ensures that log2 fold changes will be independent of the choice of reference level. In this case, the beta prior variance for each factor is calculated as the average of the mean squared maximum likelihood estimates for each level and every possible contrast. The results function without any arguments will automatically perform a contrast of the last level of the last variable in the design formula over the first level."

So what exactly does setting betaPrior to false do and why is that a good idea for interaction terms?

ADD REPLY • link 9.7 years ago Jake ▴ 90

0

Entering edit mode

Sure, betaPrior=FALSE means that we do not put a prior on log2 fold changes in the model. For more background on the log2 fold change prior, you quickly scan the relevant section of our paper. But the important thing to know is that the inference (the results tables) are accurate with betaPrior=TRUE or betaPrior=FALSE. It's just an additional option which makes the log2 fold change estimates more directly interpretable. With betaPrior=FALSE, the log2 fold change for a simple comparison is simply the average of normalized counts in group B / average of normalized counts in group A.

When we do use a log2 fold change prior (so if betaPrior=TRUE), we need to make sure that the shrinkage effect is symmetric relative to the "reference level" of the factors. To accomplish this even when there are many factor levels, we implemented expanded model matrices, which include terms for all the levels as well as the intercept. We also discuss this in the Methods of the paper. But this means that the terms of an interaction design are slightly different than what you would get with model.matrix. I'm planning to just set betaPrior to FALSE by default if the design contains interaction terms, because users seem to have enough difficulty as it is interpreting the standard terms of an experimental design with interactions.

ADD REPLY • link 9.7 years ago Michael Love 43k

score 2 · Answer 1 · 2015-05-08

1.) Yes

2.) Yes

3.) If you adjust the line, then yes. Ideally the line would extend from the mean of the A samples (e.g., the mean of ALL the A samples), to the mean of the B samples. So it would start at about 2 on the vertical axis, and extend to about 9.

4.) Yes

5.) Yes

I didn't go through all your algebra. I assume it's correct, as you got the lines right, more or less.

In conventional statistics, where you are fitting at most a handful of models, you can pick and choose which model gets an interaction (depending of course on whether or not the interaction is significant). In genomics where you are in essence fitting models wholesale, you don't have this luxury, and in general have to specify a single model for all the genes, which may not be ideal for some of them.

But do note that in the presence of an interaction the main effects are uninterpretable. So for any gene that has a significant interaction term, you cannot simultaneously report a significant main effect (for either cell type or treatment), because the main effects in that instance cannot be interpreted.

To see why this so, consider your plot. You have a considerable interaction here, so you say that the treatment has a much larger effect on cell B than cell A. The main effect for treatment can be interpreted as the change in expression for a gene due to treatment after controlling for cell type. But you cannot control for cell type, because you have already shown that the treatment effect depends on cell type.

Does that make sense?