Question

Nested design vs averaging coefficients

0

Entering edit mode

i.sudbery ▴ 40

@isudbery-8266

Last seen 6 months ago

European Union

I am performing a differential expression analysis (it happens to be on ATAC counts, but I that shouldn't matter?) using DESeq2.

My experimental design is that I have several experimental variables, but as I am trying to get to the bottom of the design, I will concentrate on two: Disease vs Non-disease, and subtype. There is only subtype information for disease, but not non-disease samples. There are 27 normals and 5 disease - and the disease have three of one subtype and one of the other.

I could think of two ways to idetnifiy disease (i.e. sub-type A or B) relevant genes. Instead of showing 33 samples, I'll just show a minimal exapmle of the same thing.

First averaging over the two sutbtypes:

design =    ~ 0 + disease_and_subtype
  subtypeA subtypeB subtypenormal
1        1        0             0
2        1        0             0
3        0        1             0
4        0        1             0
5        0        0             1
6        0        0             1

and testing the contrast constrast = list(c("subtypenormal"), c("subtypeA", "subtypeB")), listValues=c(1,-1/2)

The second alternative is to nest subtype within disease (and remove the empty matrix columns):

design = ~disease + disease:subtype
  (Intercept) diseaseTRUE  diseaseTRUE:subtypeB
1           1           1                    0
2           1           1                    0
3           1           1                    1
4           1           1                    1
5           1           0                    0
6           1           0                    0

and testing the coefficient diseaseTRUE.

To my mind these are equivalent. Bu the first method gives 25,000 significant regions, while the second gives 11.

Clearly I am misunderstanding something about these designs, and I;d be grateful if someone could point out what. I guess the advice might be just to forget the subtype, and test the disease state irrespective, but I'd still like to understanding what is going on.

deseq2 • 688 views

ADD COMMENT • link updated 5.4 years ago by Michael Love 43k • written 5.4 years ago by i.sudbery ▴ 40

score 0 · Answer 1 · 2019-11-29

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

The first design is testing whether the average over disease subtypes is different than normal.

The second design is just comparing the reference level of disease to normal. This is due to the way interactions work when there is a main effect in the formula as well. We have a diagram in the vignette.

It seems like your null hypothesis is that all subtypes are similar to normal? You could do an LRT comparing your second design to ~1.

ADD COMMENT • link 5.4 years ago Michael Love 43k

0

Entering edit mode

I guess what I'm thinking is that there will be some effects that are subtypes specific, and some which are general to the disease, and we want to isolate the disease general effects. By accounting for the subtype effect, I thought we might reduce an unwanted source of variance. I think testing against ~1 would also find things where either subtypeA or subtype B differed from normal or each other - so you'd get the subtype specific effects rather than disease general ones.