Nested design vs averaging coefficients
1
0
Entering edit mode
i.sudbery ▴ 40
@isudbery-8266
Last seen 7 weeks ago
European Union

I am performing a differential expression analysis (it happens to be on ATAC counts, but I that shouldn't matter?) using DESeq2.

My experimental design is that I have several experimental variables, but as I am trying to get to the bottom of the design, I will concentrate on two: Disease vs Non-disease, and subtype. There is only subtype information for disease, but not non-disease samples. There are 27 normals and 5 disease - and the disease have three of one subtype and one of the other.

I could think of two ways to idetnifiy disease (i.e. sub-type A or B) relevant genes. Instead of showing 33 samples, I'll just show a minimal exapmle of the same thing.

First averaging over the two sutbtypes:

design =    ~ 0 + disease_and_subtype
  subtypeA subtypeB subtypenormal
1        1        0             0
2        1        0             0
3        0        1             0
4        0        1             0
5        0        0             1
6        0        0             1

and testing the contrast constrast = list(c("subtypenormal"), c("subtypeA", "subtypeB")), listValues=c(1,-1/2)

The second alternative is to nest subtype within disease (and remove the empty matrix columns):

design = ~disease + disease:subtype
  (Intercept) diseaseTRUE  diseaseTRUE:subtypeB
1           1           1                    0
2           1           1                    0
3           1           1                    1
4           1           1                    1
5           1           0                    0
6           1           0                    0

and testing the coefficient diseaseTRUE.

To my mind these are equivalent. Bu the first method gives 25,000 significant regions, while the second gives 11.

Clearly I am misunderstanding something about these designs, and I;d be grateful if someone could point out what. I guess the advice might be just to forget the subtype, and test the disease state irrespective, but I'd still like to understanding what is going on.

deseq2 • 630 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

The first design is testing whether the average over disease subtypes is different than normal.

The second design is just comparing the reference level of disease to normal. This is due to the way interactions work when there is a main effect in the formula as well. We have a diagram in the vignette.

It seems like your null hypothesis is that all subtypes are similar to normal? You could do an LRT comparing your second design to ~1.

ADD COMMENT
0
Entering edit mode

I guess what I'm thinking is that there will be some effects that are subtypes specific, and some which are general to the disease, and we want to isolate the disease general effects. By accounting for the subtype effect, I thought we might reduce an unwanted source of variance. I think testing against ~1 would also find things where either subtypeA or subtype B differed from normal or each other - so you'd get the subtype specific effects rather than disease general ones.

ADD REPLY
0
Entering edit mode

Sounds like you want the first design then.

ADD REPLY

Login before adding your answer.

Traffic: 960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6