I have posted a similar question before but it was poorly formulated so I'll try to make it better this time.
I have 258 samples and they are all settled in 26 different clusters, with some clusters with only 1 sample and some with over 40. The clusters are mutually exclusive. I will illustrate it with an example due to the large number of clusters and samples.
Let's say I have 12 samples in 4 blocks
> cluster <- as.factor(rep(c("A","B","C","D"), each = 3))
> design<-model.matrix(~cluster)
The design matrix will look like
> design
(Intercept) clusterB clusterC clusterD
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 1 0 1 0
8 1 0 1 0
9 1 0 1 0
10 1 0 0 1
11 1 0 0 1
12 1 0 0 1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cluster
[1] "contr.treatment"
If I want to compare A with B+C+D, do I just do this?
> makeContrasts("Intercept-(clusterB+clusterC+clusterD)/3",levels = design)
Contrasts
Levels Intercept-(clusterB+clusterC+clusterD)/3
Intercept 1.0000000
clusterB -0.3333333
clusterC -0.3333333
clusterD -0.3333333
And for B vs A+C+D, do I just do this?
> makeContrasts("clusterB-(Intercept+clusterC+clusterD)/3",levels = design)
Contrasts
Levels clusterB-(Intercept+clusterC+clusterD)/3
Intercept -0.3333333
clusterB 1.0000000
clusterC -0.3333333
clusterD -0.3333333
I need to use the intercept because mod0
argument in svaseq()
, if not taking the intercept as the null model, won't work, but this make it kind of complicated for myself.
Thanks for correcting the terminology. I know very little about bioinformatics or statistics as you can tell.
So it will be like, carry on from the example:
And the contrast matrix will be like
Is this correct?
Assuming that you want to compare clusterA with the mean of clustersB-D, yes.
Hope this thread can still find you James. I have just realised that there's one more questions I have. The size of the cluster Is not even, which means it will be like:
If I am to do a B vs the entire population (the idea is what B is compare to the overall mean), should I use weighted mean in my contrast matrix instead? LIke:
Probably not. While you could make the argument that the mean of the groups with smaller numbers of observations are less reliable, that's not a compelling argument. The mean is a pretty efficient statistic, so that is less of a concern than the variance, which is less efficient.
In the context of an ANOVA, where you are using a variance estimate based on (basically) the mean of the within-group variances, having widely different group sizes can be problematic because the smaller groups provide less reliable estimates, but the computation doesn't take that into account. So you would be 'fixing' the part that probably isn't in need of being fixed, and ignoring the part that is more likely to need help.
And if this were a thing that people did, you can rest assured that there would be something in the limma User's Guide detailing how it's done. But there are no such instructions, because it isn't a thing.