Hello!
Long time lurker on Bioconductor but it is finally time to make an account and ask a question. I have experience using DESeq2/edgeR/limma in the past and have always found it relatively straight forward to account for variables that are not necessarily of interest - i.e. technical effects. An example of this would be controlling for a mouse cage in an RNA-seq experiment like so:
design <- model.matrix(~treatment + cage)
fit <- lm(~expr + treatment + cage)
In my specific example I wish to control for cycle. So setting up my design as so:
design = model.matrix(~ response + cycle)
head(design)
(Intercept) B C1LATE C2EARLY C3EARLY C4EARLY C5EARLY SCREEN
1 1 1 0 0 1 0 0 0
2 1 1 0 0 0 0 0 1
3 1 1 0 0 0 0 0 0
4 1 1 0 1 0 0 0 0
5 1 1 0 0 0 1 0 0
6 1 1 0 0 0 0 0 0
I then effectively wish to run a linear model similar to:
fit <- lm(~expr + response + cycle)
However, I'm unsure how to set this up with contrasts in nanostringdiff. My best guess would be to weight the design as so, however I'm not sure if this is accounting for any of the cycle variables since they are given a value of 0 for contrasts.
contrast <- c(-1,1,0,0,0,0,0,0)
result <- glm.LRT(NanoStringData = cur, design.full = design, contrast = contrast)
Any resources or guidance would be greatly appreciated! Ryan
@james-w-macdonald-5106 and @gordon-smyth thank you very much for your prompt replies!
To follow up: From the nanostringdiff documentation, it seems that using this model
I can effectively control for cycle using either
or
As both of these methods will test whether the 2nd column is equal to zero in my model. My remaining questions are:
Again, thank you very much for your help.
Ryan
You are confusing linear regression and ANOVA. In linear regression, you have a continuous covariate and in that context the intercept is, as you mention, the value at which the continuous covariate is equal to zero. And in that context if you set the intercept to zero, you are forcing the intercept to be zero.
In the context of an ANOVA, and in particular in R, if you have an intercept then you are fitting a model using a parameterization in which the intercept estimates the mean of one group, and all the other coefficients for that factor are then differences between the given factor level and the baseline. If you fit a model without an intercept, then you are fitting a model that simply estimates the mean of each group. So if you fit a model using the former parameterization you don't necessarily need to compute a contrast, because all of the coefficients except the intercept are already comparisons. But if you use the latter model then you do need a contrast because the mean of any one group is uninteresting.
If you add in an additional factor, then you are 'controlling for changes in that factor', which in R means that the intercept is now estimating the mean of a baseline group that includes the 'cycle' factor as well. You can tell which one it is because it won't be a column name for your design. As an example,
In that model, the baseline is control, cycle 1. And the second coefficient estimates the difference between treatment and control for cycle 1. And all the other cycle coefficients estimate the difference between cycle N and cycle 1 (so cyclecycle2 is estimating cycle 2 - cycle 1). This implies that you can capture the difference between cycles using the mean (e.g., the variability within each group is comparable), If that assumption holds, then the second coefficient estimates the difference between treatment and control after adjusting for cycle differences.
Hi James,
Your point about the linear model vs ANOVA is well taken and your explanation of the ANOVA method estimation using the difference between group means make sense. The only thing I'm wondering is whether your points about the ANOVA is applicable to the model that I'm fitting using glm.LRT() which fits a generalized linear model of the negative binomial family. Wouldn't my points about forcing coefficients be applicable since I'm using linear regression?
Also - I briefly looked at your bio and saw you were a UM fan - any mixed feelings with Saturday's game vs UW?
A GLM is just an extension of classical linear regression, and if you use factor levels for your predictors it's analogous to ANOVA. The interpretation of the coefficients doesn't change, it's just that the null distribution changes from a normal to (in this case) a negative binomial, so the inference can't be based on a normal or t-distribution. And a likelihood ratio test tells you if the model with the coefficient of interest fits the data better than a model without that coefficient (which is equivalent to testing if that coefficient is plausibly zero). You could use a Wald test instead (which is what DESeq2 uses), and that looks exactly like a t-statistic, which we already know tests for evidence that the coefficient is plausibly zero. So from a user's perspective you can just think about what you are doing in the context of an ANOVA.
I do have mixed feelings about the game. I've had season tickets for years now, and just can't bring myself to go sit with a bunch of people (for whom the naive expectation in MI is 49% unvaccinated) who are bellowing at the top of their lungs for ~3 hours straight. Sounds like a super-spreader event to me. And also Giles Jackson went to UW via the transfer portal, and I don't really like rooting against UM players, even those who transferred. I work for UW, but my fan loyalty is to UM, so GO BLUE!
Err. The null distribution used is a Chi-square. It's the distribution of the residuals that are assumed to be negative binomial...