Question

Design matrix with intercept for multiple cluster comparison

0

Entering edit mode

kentfung ▴ 20

@kentfung-17051

Last seen 5.0 years ago

I have posted a similar question before but it was poorly formulated so I'll try to make it better this time.

I have 258 samples and they are all settled in 26 different clusters, with some clusters with only 1 sample and some with over 40. The clusters are mutually exclusive. I will illustrate it with an example due to the large number of clusters and samples.

Let's say I have 12 samples in 4 blocks

> cluster <- as.factor(rep(c("A","B","C","D"), each = 3))
> design<-model.matrix(~cluster)

The design matrix will look like

> design
   (Intercept) clusterB clusterC clusterD
1            1        0        0        0
2            1        0        0        0
3            1        0        0        0
4            1        1        0        0
5            1        1        0        0
6            1        1        0        0
7            1        0        1        0
8            1        0        1        0
9            1        0        1        0
10           1        0        0        1
11           1        0        0        1
12           1        0        0        1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cluster
[1] "contr.treatment"

If I want to compare A with B+C+D, do I just do this?

> makeContrasts("Intercept-(clusterB+clusterC+clusterD)/3",levels = design)
           Contrasts
Levels      Intercept-(clusterB+clusterC+clusterD)/3
  Intercept                                1.0000000
  clusterB                                -0.3333333
  clusterC                                -0.3333333
  clusterD                                -0.3333333

And for B vs A+C+D, do I just do this?

> makeContrasts("clusterB-(Intercept+clusterC+clusterD)/3",levels = design)
           Contrasts
Levels      clusterB-(Intercept+clusterC+clusterD)/3
  Intercept                               -0.3333333
  clusterB                                 1.0000000
  clusterC                                -0.3333333
  clusterD                                -0.3333333

I need to use the intercept because mod0 argument in svaseq(), if not taking the intercept as the null model, won't work, but this make it kind of complicated for myself.

limma sva • 1.2k views

ADD COMMENT • link updated 5.6 years ago by centralmenu10 • 0 • written 5.6 years ago by kentfung ▴ 20

score 3 · Accepted Answer · 2019-08-30

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

No, that's not how you do things.

First, about your terminology. In general, 'blocks' sounds like some sort of batch-effect like thing that one would normally want to account for, but that is uninteresting. If you mean 'groups' or 'treatments' or 'phenotypes' or some interesting thing that one might care to know something about, then you should call them what they are. If these are blocks of samples that were processed separately, or for which you expect there to be technical differences, then it's rather uncommon for people to be testing for differences (you usually control for those differences, rather than test for them). Aaaanyway,

If you have an intercept, and you are using the default treatment contrasts (which is what you are doing here), then the intercept is a 'baseline' group (in your case clusterA), and all the other coefficients are estimates of the difference between a given cluster and the baseline. So clusterB is actually clusterB - clusterA, and inherently is a contrast.

So it makes no sense to do what you are doing. For example, your first contrast is actually clusterA - (clusterB - clusterA + clusterC - clusterA + clusterD - clusterA)/3, which is definitely something, but probably not what you intend.

You are correct that svaseq needs a model with an intercept, but that has nothing to do with anything but running svaseq. The requirement for an intercept doesn't extend to the modeling step. So you can run svaseq with an intercept, and then add those surrogate variables to your design matrix, and then use a design matrix without an intercept to make any comparison you desire using makeContrasts.

ADD COMMENT • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for correcting the terminology. I know very little about bioinformatics or statistics as you can tell.

So it will be like, carry on from the example:

> dat0 = as.matrix(raw_data$counts)  # this is the actual count matrix
> mod1 = design
> mod0 = cbind(mod1[,1])
> set.seed(12345)
> svseq = svaseq(dat0,mod1,mod0)
Number of significant surrogate variables is:  16 
Iteration (out of 5 ):1  2  3  4  5  

> new_design <- model.matrix(~0+cluster)
> new_design <- cbind(new_design,svseq$sv)
> voom_exprs = voom(raw_data, new_design, normalize="none", plot = TRUE)

And the contrast matrix will be like

> makeContrasts("clusterA-(clusterB+clusterC+clusterD)/3",levels = new_design)

Is this correct?

ADD REPLY • link 5.6 years ago kentfung ▴ 20

0

Entering edit mode

Assuming that you want to compare clusterA with the mean of clustersB-D, yes.

ADD REPLY • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

Hope this thread can still find you James. I have just realised that there's one more questions I have. The size of the cluster Is not even, which means it will be like:

> cluster <- as.factor(rep(c("A","B","C","D"), c(3,4,5,3)))
> cluster
 [1] A A A B B B B C C C C C D D D
Levels: A B C D
> design<-model.matrix(~0+cluster)
> design
   clusterA clusterB clusterC clusterD
1         1        0        0        0
2         1        0        0        0
3         1        0        0        0
4         0        1        0        0
5         0        1        0        0
6         0        1        0        0
7         0        1        0        0
8         0        0        1        0
9         0        0        1        0
10        0        0        1        0
11        0        0        1        0
12        0        0        1        0
13        0        0        0        1
14        0        0        0        1
15        0        0        0        1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cluster
[1] "contr.treatment"

If I am to do a B vs the entire population (the idea is what B is compare to the overall mean), should I use weighted mean in my contrast matrix instead? LIke:

> makeContrasts("clusterB-(3*clusterA+5*clusterC+3*clusterD)/(3+5+3)",levels = design)
          Contrasts
Levels     clusterB-(3*clusterA+5*clusterC+3*clusterD)/(3+5+3)
  clusterA                                          -0.2727273
  clusterB                                           1.0000000
  clusterC                                          -0.4545455
  clusterD                                          -0.2727273

ADD REPLY • link 5.6 years ago kentfung ▴ 20

1

Entering edit mode

Probably not. While you could make the argument that the mean of the groups with smaller numbers of observations are less reliable, that's not a compelling argument. The mean is a pretty efficient statistic, so that is less of a concern than the variance, which is less efficient.

In the context of an ANOVA, where you are using a variance estimate based on (basically) the mean of the within-group variances, having widely different group sizes can be problematic because the smaller groups provide less reliable estimates, but the computation doesn't take that into account. So you would be 'fixing' the part that probably isn't in need of being fixed, and ignoring the part that is more likely to need help.

And if this were a thing that people did, you can rest assured that there would be something in the limma User's Guide detailing how it's done. But there are no such instructions, because it isn't a thing.

ADD REPLY • link 5.6 years ago James W. MacDonald 68k