I have microarray data for which the columns represent the disease condition status of each individual.
I plan to fit a linear model and then make a contrast amongst the different columns using limma. My question is: how do I make the most informative comparison when contrasting all of the columns against all of the other disease stasuses...
Other<-ifelse(Information_file$Diagnostic.category=="Other diagnoses", 1,0)
TB<-ifelse(Information_file$Diagnostic.category=="TB", 1,0)
So as you can see above I am trying to test for a significant difference between TB and all other diagnoses and I need to eliminate any confounding to see if there are any differentially expressed genes that are genuinely due to TB and not just one of the other diagnoses....
The first attempt that I made at this was using the following design matrix:
> head(design)
TB Intercept Other Sex Age
[1,] 1 1 0 1 155
[2,] 1 1 0 2 16
[3,] 1 1 0 1 22
[4,] 1 1 0 1 114
[5,] 1 1 0 2 56
[6,] 1 1 0 2 47
And the following code produced a nice plot:
fit<-lmFit(E.ncRNA1, design)
contrast.matrix <- makeContrasts(TB, levels=design)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)
volcanoplot(fit, highlight=10, main="TB vs Everything")
However my supervisor says that this is not statistically robust as:
- including an intercept over-specifies the model
- he'd like to know if there's a non-zero difference between the TB and Other columns i.e. specifying 1 for the TB covariate and -1 for the OD covariate
Hence, lets say my new design matrix is designed as follows:
> head(design)
TB Other Sex Age
[1,] 1 0 1 155
[2,] 1 0 2 16
[3,] 1 0 1 22
[4,] 1 0 1 114
[5,] 0 -1 2 56
[6,] 0 -1 2 47
And yet the exclusion of an Intercept gives rise to plots that look like this:
So overall my questions are as follows how do I make the most valuable comparison to test for a non-zero difference between TB and other diseases without including an intercept?
If I'm getting you right, and you are just trying to test for TB effects, and control for age and sex (assuming they are not confounded with each other), then assuming that:
clinical1$Diagnostic.category
are "Other diagnosis" and "TB"; andage.breaks
(like 'young' and 'old', or 'young', 'medium', 'old', etc.)I'd construct your design like so:
You can then test against the second coefficient (probably names something like
statusTB
, right?)If you want to further split your "other" diagnostic category into more specific diseases (assuming you have enough degrees of freedom and your ages and sex span the disease states) you can do similar moves.