Question

Suggestion for next revision of edgeR User Guide re Design matrix setup

0

Entering edit mode

pdunn • 0

@pdunn-23624

Last seen 4.9 years ago

Hi I would like to suggest that the authors of the edgeR User Guide include some more explicit information in their next revision about how to set up (or fix) the design matrix for more complex analyses. The issue is that you will get a "Design matrix is not full rank" error when you do an analysis with interactions (like section 3.5 "Comparisons both between and within subjects" in the User Guide) and you have unequal sample sizes in your groups. The example in section 3.5 has equal sample sizes in the different groups of patients, so there is no error.

After a few days of trial and error and mostly futile searches of the web, I found the simple answer on SeqAnswers here. Simply manually edit the design matrix so it has no columns that are all zeros. In retrospect, I should have figured this out sooner, but there are a lot of questions about this on the web, and not a lot of clear answers. It would be easy to just add a few sentences about this to the User Guide.

Thanks, Peter

edger error • 1.7k views

ADD COMMENT • link updated 4.9 years ago by Gordon Smyth 52k • written 4.9 years ago by pdunn • 0

0

Entering edit mode

I admit that it is annoying to find out the right answer in the internet. I already faced the same error with limma, but found quite quickly the answer. EdgeR uses the same logic as limma for specifying the design matrix, so it's a good idea to include limma in your research. This forum is open to everyone who already searched a solution on his/her own. Don't waste your time, post sooner next time. Best.

ADD REPLY • link 4.9 years ago SamGG ▴ 360

0

Entering edit mode

The error you experience is most likely due to linear dependencies that you created with your design, not due to odd numbers in sample sizes. I suggest reading https://support.bioconductor.org/p/68092/ as a starting point. If you post your group information and which comparisons you aim to make then people can help you putting together a design that is probably much simpler than a complicated interaction model, e.g. a factorial design while still giving the desired results.

ADD REPLY • link 4.9 years ago ATpoint ★ 4.8k

0

Entering edit mode

Indeed, the design formula via the SeqAnswers link is overly complex and will result in problems irrespective of the program used:

~ status + status:animal + status:tempo

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

I think the OP is talking about this situation, which does indeed result in (in this case) a single column with all zeros.

> d.f <- data.frame(Disease = factor(rep(c("Healthy","Disease"), c(6,4))), Patient = factor(rep(c(1:3,1:2), each = 2)), Treatment = factor(rep(c("None","Hormone"), 5)))
> d.f
   Disease Patient Treatment
1  Healthy       1      None
2  Healthy       1   Hormone
3  Healthy       2      None
4  Healthy       2   Hormone
5  Healthy       3      None
6  Healthy       3   Hormone
7  Disease       1      None
8  Disease       1   Hormone
9  Disease       2      None
10 Disease       2   Hormone
> model.matrix(~Disease+Disease:Patient+Disease:Treatment, d.f)
   (Intercept) DiseaseHealthy DiseaseDisease:Patient2 DiseaseHealthy:Patient2
1            1              1                       0                       0
2            1              1                       0                       0
3            1              1                       0                       1
4            1              1                       0                       1
5            1              1                       0                       0
6            1              1                       0                       0
7            1              0                       0                       0
8            1              0                       0                       0
9            1              0                       1                       0
10           1              0                       1                       0
   DiseaseDisease:Patient3 DiseaseHealthy:Patient3 DiseaseDisease:TreatmentNone
1                        0                       0                            0
2                        0                       0                            0
3                        0                       0                            0
4                        0                       0                            0
5                        0                       1                            0
6                        0                       1                            0
7                        0                       0                            1
8                        0                       0                            0
9                        0                       0                            1
10                       0                       0                            0
   DiseaseHealthy:TreatmentNone
1                             1
2                             0
3                             1
4                             0
5                             1
6                             0
7                             0
8                             0
9                             0
10                            0
attr(,"assign")
[1] 0 1 2 2 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$Disease
[1] "contr.treatment"

attr(,"contrasts")$Patient
[1] "contr.treatment"

attr(,"contrasts")$Treatment
[1] "contr.treatment"

ADD REPLY • link 4.9 years ago James W. MacDonald 68k

score 3 · Answer 1 · 2020-05-31

I suppose a small section could be added describing common mistakes, but to me, this is what the support site is for. I would go so far as to say that the real problem is that, somehow, the Bioconductor Support pages have disappeared from Google's top hits and also that the search interface on this website is rubbish.

Moving on:

Simply manually edit the design matrix so it has no columns that are all zeros.

Were it so easy.

Arbitrarily removing columns from a design matrix is fraught with danger. The remaining columns may or may not retain their original scientific interpretation, and so it is essential that you understand how your design matrix relates to your experimental factors. If you lack this understanding, I'm afraid that the entire DE analysis is going to be very tough for you: but hey, if everyone could do complex analyses without asking for help, we wouldn't need staff statisticians.

For the sake of completeness and to stop people from mentioning this again, there actually is a way of converting a design matrix to full rank via the QR decomposition. However, this is not a useful solution as columns are more-or-less arbitrarily discarded and so you still have to trawl through the design matrix to figure out the meaning of the remaining coefficients.

score 3 · Answer 2 · 2020-05-31

Dear Peter,

Thanks for using edgeR and for reading the User's Guide. We are planning a major revision of the User's Guide this year and, as part of that, we will expand Section 3.5 to cover the case where the two top-level groups have different numbers of subjects.

My preference for unbalanced multilevel models is not to fit the interaction model recommended in Section 3.5 but instead to build the design matrix up in a way that redundant columns are avoided in the first place. I have given advice about how to do this on this forum in the past but the relevant posts are hard to find given that search engine support for this forum is so poor.

It is true, in the context of Section 3.5, that you probably can safely remove the design columns that are all zero. You should appreciate though that this is a very special phenomenon that is specific to this particular experimental design and to the nested factorial models suggested in that section of the User's Guide. In the vast majority of cases where people get the Design matrix is not full rank error, they have fitted an overparametrized model that is inappropriate for their experiment. Removing columns allows edgeR to run but yields results that might not be meaningful.

In equivalent situations, the limma package automatically removes design columns to make the design of full rank and issues a warning that it has done so. We could have taken the same approach in edgeR, but we decided to make it an error in order to force users to rethink their model. It's not clear which is the better approach. In my own work, I always make sure the design matrix is of full rank in the first place and do not rely on automatic fixes that might lead to unexpected effects.

A rewrite of Section 3.5

Here is a reanalysis of example in Section 3.5 of the User's Guide. This new approach is equivalent to that in the User's when the disease groups are balanced but will work regardless of the number of patients in each disease group.

First, setup the relevant factors. Note that in this approach we do not need to renumber the patients:

Patient <- factor(rep(1:9,each=2))
Treatment <- rep(c("None","Hormone"),9)
Disease <- rep(c("Healthy","Disease1","Disease2"),each=6)

Assemble the design matrix:

design <- model.matrix(~Patient)
Healthy.Hormone <- Disease=="Healthy" & Treatment=="Hormone"
Disease1.Hormone <- Disease=="Disease1" & Treatment=="Hormone"
Disease2.Hormone <- Disease=="Disease2" & Treatment=="Hormone"
design <- cbind(design, Healthy.Hormone, Disease1.Hormone, Disease2.Hormone)

After estimating the dispersions (code not shown), we can fit a linear model:

fit <- glmQLFit(y, design)

To find genes responding to the hormone in Healthy patients:

qlf <- glmQLFTest(fit, coef="Healthy.Hormone")
topTags(qlf)

To find genes responding to the hormone in Disease1 patients:

qlf <- glmQLFTest(fit, coef="Disease1.Hormone")
topTags(qlf)

To find genes responding to the hormone in Disease2 patients:

qlf <- glmQLFTest(fit, coef="Disease2.Hormone")
topTags(qlf)

To find genes that respond to the hormone in any disease group:

qlf <- glmQLFTest(fit, coef=10:12)
topTags(qlf)

To find genes that respond differently to the hormone in Disease1 vs Healthy patients:

qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,1,0))
topTags(qlf)

To find genes that respond differently to the hormone in Disease2 vs Healthy patients:

qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,0,1))
topTags(qlf)

To find genes that respond differently to the hormone in Disease2 vs Disease1 patients:

qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,-1,1))
topTags(qlf)