I am analysing an RRBS experiment where my conditions are a mixture of three compounds. These compounds sum to one, so A + B + C = 1, and my conditions are various percentages of A, B and C.
The statisticians who usually analyze clinical data using these models have been recommended to use models from the mixexp
package, specifically the linear and quadratic models.
This means I want to compare three models:
model_null <- ~ 1
model_linear <- ~ 0 + A + B + C
model_quadratic <- ~ 0 + A + B + C + A:B + A:C + B:C
The problem is that testing the coefficients, as recommended multiple times, e.g. here, here and here is not really possible.
I suppose that for comparing the quadratic to the linear model, I can really just look at whether the interaction terms are significant, but for comparing the linear to the null model is not possible by testing whether A, B and C are significant since there is no intercept.
As I understand it, this model better handles to covariance between the three compounds than a regular ~A + B
model would.
My goal is to select sites where the mixture affects the methylation level, and then afterwards work with the estimated coefficients, so I don't need to be able to test the significance of individual compounds. I really only want to detect if the site is affected at all.
Is there some way to select significant sites using such a model? Maybe fitting a model of the form ~A + B
, selecting sites that are significant and then estimating the coefficients using a ~0 + A + B + C
model?
I realize that there might not be a best model for all sites, but I was hoping to select one model that fit most sites and use that for all sites.
Thank you for the quick reply. I took the naming from the
mixexp
publication, but never mind that.The three models cannot be exactly the same as evidenced in the code below:
The first two are identical, but only the last model is of full rank.
My apologies. I had actually assumed from your use of
:
that A, B and C were defined as factors. If A, B and C had been factors then the three models would be the same.Even in your case, the three models are the same once the non-estimable columns are removed from mod1 and mod2. That would be done automatically by limma but isn't done automatically by edgeR.
I'm going to edit my answer to give different advice that doesn't involve a non-full rank matrix.
I have also consulted Scheffe (1958) on mixture models and I see now that your
model_quadratic
is indeed a quadratic model when A, B and C add to unity. The linearly dependency between A, B and C makes the extra terms that usually appear in a quadratic formula redundant.Thank you for the update, it helped to clarify it for me. Would the following procedure then be how to select an appropriate model:
mod1 <- ~ 1 + B + C + A:B + A:C + B:C
and testglmLRT(coef = c("A:B", "A:C", "B:C"))
.mod2 <- ~ 0 + A + B + C + A:B + A:C + B:C
and extract the coefficients whereglmLRT(coef = c("B", "C", "A:B", "A:C", "B:C"))
in mod1 are significant.mod1
, fit the modelsmod3 <- ~A + B
andmod4 <- ~ 0 + A + B + C
and extract the coefficients inmod4
whereglmLRT(coef = c("B", "C"))
is significant inmod3
.I just realized that I can fit the model
~ 0 + A + B + C + A:B + A:C + B:C
and if the columns A:B, A:C and B:C are significant, use the estimated coefficients directly. If they are not I can fit the~ A + B
and~ 0 + A + B + C
models and, since they are identical, use~ A + B
to select significant sites. I am marking this answer as accepted. Thank you for your time.