Continuous variable interaction with Groups in design matrix
1
0
Entering edit mode
jhj89 ▴ 10
@jhj89-9623
Last seen 7.6 years ago

Hi,

I am dealing with continuous covariate in my design matrix as well. I am using edgeR, I have created a following design matrix:

design1 <- model.matrix(~groups+groups:age)

where there are four different groups. 

So, colname(design1) would give:

(Intercept) group2 group3 group4 group1:age group2:age group3:age group4:age

If I were to do a comparison such as: lrt <- glmLRT(fit, coef=2), what am I comparing? Am I merely comparing group1 and group2? From what reading through forum questions/answers, it seems like this comparison is affected by the age covariate included in the design matrix.

But if I were to create a design matrix: design2 <- model.matrix(~group), and perform the same comparison between group1 and group2, would that give a same result?

Also, using design1 again, if I were to compare group2:age with intercept (coef=6), would that be testing for age effect on group2? And comparing group1:age and group2:age would be testing for the differences in age effect between group1 and group2?

Lastly, if I were to include another continuous variable(s) as covariates into the design matrix such that:

design3 <- model.matrix(~groups+groups:age+continuous1+continuous2)

in order to adjust for the covariates continuous1 and continuous2, other than being adjusted for them, would they affect how the above comparisons turn out.

My apologies for asking a lot of questions. I am quite confused by the usage of design matrix, especially with continuous variables interaction. If there is any confusion regarding the questions I ask, please tell me. Thank you.

edger design matrix • 3.8k views
ADD COMMENT
6
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

Let's have a look at what design1 is actually doing. For each gene in each group, imagine fitting a line to the log-expression values against the age for all samples of that group. The gradient of this line for group X is the groupX:age term, while the intercept of this line is (Intercept) for group 1 and (Intercept) + groupX for every other group. So, if you drop the second coefficient, you'll be testing for the same intercept between groups 1 and 2, i.e., your null hypothesis is that groups 1 and 2 have the same expression at an age of zero. This doesn't really seem very sensible to me - babies aside, would the behaviour of zero-age samples be particularly interesting? The more relevant comparisons would be:

con1 <- makeContrasts(group1_age - group2_age, levels=design1)
con2 <- makeContrasts(group1_age - group2_age, group2, levels=design1)

... assuming that you've replaced the colons with underscores in colnames(design1). The first contrast tests for a differential response to age between groups 1 and 2, while the second contrast tests for any differences in expression between groups 1 and 2 (as the first contrast will not reject the null hypothesis if the gradient is the same between groups).

Now, for design2, the interpretation of the coefficients is different. The (Intercept) will represent the average log-expression in group 1, while each groupX coefficient will represent the log-fold change of group X over group 1. Dropping coef=2 will do a straight-up, no-frills test for any DE between groups 1 and 2. This is unlikely to give the same result as coef=2 with design1; for starters, you're not accounting for age, which means that there's no comparable concept of "age of zero". The most similar contrast would probably be con2, but that is still likely to be different because your variances will change with the different design matrices.

Moving on; if you drop coef=6, then yes, you will be testing for the age effect in group 2. Similarly, the comparison between group1:age and group2:age (as done in con1) will test for a differential age response between groups.

Finally, it's hard to say how the addition of continuous variables will affect the results in design3. In general, I'd expect a decrease in the sample variance as you're increasing the complexity of the model to account for more sources of variability. However, this will come at the cost of a decrease in power, as you're using up more information to estimate the corresponding coefficients. Whether this is offset by the smaller variance depends on your data, so if you want to know the effect, you'll have to try it with and without the covariates. The more concerning issue is if either of the continuous covariates are confounded with age; if so, then this will also compromise detection of the age response, as any genuine increase in expression due to age can be absorbed into the covariate's coefficient and lost.

ADD COMMENT
0
Entering edit mode

Thank you Aaron for your reply. I think I better understand how the design matrix works now.

I do agree that given the continuous covariate is age, it wouldn't make sense to compare the intercept between group1 and group2. However, this was merely an example - the continuous covariate is glucose levels in patients and I want to find out the difference in effect of glucose levels between group1 and group 2 (for instance) - as you mentioned, this will be comparing group1:age to group2:age

I have tested the design matrix(s) and it looks like the addition of continuous variables severely diminishes the effect of glucose on each group. Using design3, I have obtained only 10 DE genes when coef=6 was done (group2:glucose against intercept) but the number of DE genes increased to ~300 when design1 was used with coef=6 (group2:glucose against intercept) again. I wonder what this means in terms of interaction between continuous covariates and glucose?

 

ADD REPLY
0
Entering edit mode

I'll guess that one or both of the continuous covariates are quite strongly correlated with glucose levels, at least for group 2. Try plotting the former against the latter and see if any relationship shows up between them.

ADD REPLY
0
Entering edit mode

Thank you! I will try that.

ADD REPLY
0
Entering edit mode

Hi Aaron, 

My apologies for trying to revive an old thread but I have a question related to what I asked above.

You mentioned that the decrease in the number of DE genes with the addition of continuous covariates may be due to it being strongly correlated with glucose levels. Since then, I have plotted the glucose levels against the continuous covariates but found no strong correlation. If this is the case, what does the decrease in DE genes imply?

Alternatively, if the addition of continuous covariates does nothing (no change in DE genes), then I would omit that continuous covariate from the model, and if it increase the number of DE genes, I would include it. Is that a right step for analysis?

It's just that I have multiple covariates that I want to adjust for, and I'm not entirely sure what is optimal.

Thank you again for your help. 

ADD REPLY
0
Entering edit mode

As for the correlation between glucose and the continuous covariates - there's two covariates, so you'll get a decrease if there's a correlation between glucose and any linear combination of those covariates, even if there is no strong correlation between glucose and each individual covariate. Try including each covariate separately into the design matrix and dropping coef=6, and see if you get fewer DE genes. Other than that, the only thing I can think of as that you're losing power from the loss in residual d.f. when you include two extra terms. This should only be relevant if your residual d.f. is already low, i.e., the number of parameters is close to the number of samples.

The "optimal" model is the simplest one that allows you to make useful inferences. Include all the factors that are important to interpretation of the experiment. If the dispersion estimates are low, and if you get a decent number of DE genes, then that model is probably good enough. If you've missed an important systematic effect, this will manifest as an inflated dispersion and fewer DE genes, so you can try adding a blocking factor to account for it; if it gives you lower estimates and more DE genes, then it improves the fit and usefulness of the model, so you should keep it in. If not, then leave it out.

ADD REPLY
0
Entering edit mode

Thank you Aaron,

Pardon my lack of knowledge but would adding a blocking factor be the same as adding a covariate (additive model)? If I were to adjust for gender differences, how would considering it as a blocking factor be different from adding it as a covariate? 

Also, I'm facing a slight conundrum where I'm comparing between four groups (two at a time) but depending on what continuous covariate I'm adding, the number of DE genes can change, and the change is different between different comparisons. To be more specific, addition of the second covariate instead of first covariate leads to increase in the number of DE genes for Group A vs Group B, but decrease in the number of DE genes for Group B vs Group C. Would I have to decide on one model to fit every comparison, but can I make a change in the model depending on the comparison?

 

ADD REPLY
1
Entering edit mode

A factor would be a categorical variable, whereas a covariate would be a continuous variable. Sex is either male or female, hence a factor - other variables like age, height, etc. are continuous, hence they are covariates. It doesn't make sense to consider sex as a covariate, unless you have a good quantitative reason to describe an individual as being, e.g., half-male or half-female.

From the sounds of it, you're getting DE genes in each comparison, regardless of the model. Given that they're both working, I wouldn't bother fitting a different model for each comparison. Otherwise, it would become tricky to interpret, especially if you wanted to make a statement about the relative amount of DE in A vs B compared to B vs C. As to which model to use, ask yourself whether the covariates are biologically important, or important to the interpretation of the experiment. If they're not, then leave them out, because they don't seem to provide any extra benefit with respect to more DE genes or lower dispersion estimates.

ADD REPLY
0
Entering edit mode

Got it. I was confused with the terms used. 

And yes, single model sounds like a good idea for what I'm trying to achieve. Thank you again.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I am sorry to revive an old thread but I have a question related to con1 and con2. I did not get the difference between the two  comparisons. coef 1 of con1 and con2 should output the same results, am I wrong? 

Keifa

ADD REPLY
0
Entering edit mode

I have no idea of what you actually did. Please post a new question with your sample metadata and code.

ADD REPLY

Login before adding your answer.

Traffic: 984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6