I am trying to perform some linear regressions with respect to some covariates in limma so that I can predict expression values after controling for 2 covariates, Afactor
and Efactor
. Basically, this is my sample setup:
Afactor Efactor
sample1 a1 e2
sample2 a0 e1
sample3 a0 e2
sample4 a1 e0
sample5 a1 e0
sample6 a1 e1
...
Here, Afactor
is a factor with 2 unique values a1
and a2
, and Efactor
is a factor with 3 unique values e1
, e2
, e3
. To fit it, I'm using limma as follows:
design=model.matrix('~0 + Afactor + Efactor', my_metadata)
fit= lmFit(my_data, design)
This allows me to find the fitting coefficients with fit$coefficients
, that resemble the following:
>head(fit$coefficients)
Afactora1 Afactora2 Efactore1 Efactore2
gene1 2.3 3.7 -1.3 0.9
gene2 1.1 2.1 -2.5 2.1
gene3 1.3 3.1 0.2 1.3
My questions:
- Note that there is no coefficient for
Efactore3
, but why is that so? - Does this have to do with the fact that possible factors for Efactor are
e1
,e2
ande3
? - Shouldn't it be possible to extract the coefficient for
e3
so that I could use each coefficient to estimate gene expression from the fitted model by imputing the values ofAfactor
andEfactor
for each sample?
Hi James, many thanks for your comments and help!
You are correct indeed. I had edited my post and changed the levels in
Afactor
toa0
anda1
, and updated the post to say (mistakenly)a1
anda2
. I should've hadAfactora0
andAfactora1
instead.That had been my interpretation of the coefficients actually: that this would be a fitting coefficient, such that when replacing some given factor levels in it (e.g. sample1
a1
e2
) I would be able to predict the gene expression value as given by the model, and that the coefficient value would represent the mean of all values in that group.The last bit of your sentence was actually why I thought of avoiding the intercept altogether, in order to facilitate the interpretation of the coefficients. Your suggestion of combining the factor levels is quite interesting (though it may pose a challenge if I include many covariates!). Am I correct in assuming the remaining steps in retrieving the adjusted gene expression is by replacing the sample values as I mentioned above?
I had googled a bit for this question and wasn't able to find much, though I'd think this is a common question (though I did find this). I do find many controlling for covariates when contrasting effects
It's not super clear to me exactly what you are after. Maybe you just want covariate-adjusted values? In which case
removeBatchEffect
will do that for you. But do note (as pointed out in the help page for that function) that it's not intended as something you do with your data prior to fitting your 'real' model, which would artificially increase your degrees of freedom.Many thanks James. My aim is exactly to adjust the values for these covariates. In this case I do not want to remove any batch effect (beyond the effect of the covariates), so would you encode all samples with the same batch id to enter in
batch1=some_batch
? In practice, is this the same as employing thelmFit
approach above and using the residuals of the model (residuals.MArrayLM
) for further analysis?I will accept your answer above, but feel free to edit it with any information if you feel it may be useful to others.
Once again many thanks!
Please read the help page for
removeBatchEffect
a bit more closely. There are more arguments than the batch arguments!And yes, it's just fitting the model using the covariates and returning the residuals.