Question

Phenotype effect - building a contrast matrix - grouping vs interaction

0

Entering edit mode

Jonathan ▴ 10

@c31cf0e5

Last seen 11 weeks ago

United States

Hi, I have an RNASeq experiment with control and cohort/diseased samples, whereas the cohort samples can be lesional or non-lesional. The cohort itself can have a unique phenotype/characteristic (which does not appear in controls). Thus I have three main research questions:

What is the (age&sex adjusted) effect of the disease: Lesional vs. Non-Lesional, Lesional vs. Control, Non-Lesional vs. Control.
What is the (age&sex adjusted) effect of the disease in a sub-group analysis: Lesional/Phenotype+ vs Non-Lesional/Phenotype-, and Lesional/Phenotype- vs Non-Lesional/Phenotype-.
What is the (age&sex adjusted) effect of the phenotype in both cohort groups: Lesional/Phenotype+ vs Lesional/Phenotype-, and Non-Lesional/Phenotype+ vs Non-Lesional/Phenotype-.

Following the user-guide, I ended up wondering which approach is better for my design/contrast matrices, and more importantly, what are their differences.

For the following code vignette, Src is the sample origin in which CN is controls, NL is non-lesional, LS is lesional; Pheno ("Y"/"N") is whether cohort samples have the phenotype or not; I use "B" for phenotype-agonistic comparisons.

Approach #1 - Grouping samples type ("Src") and phenotype ("Pheno"):

pData %>% mutate(grp = paste0(Src, "_", Pheno)) %>% (... converting to a factor ...) # will create the groups: CN_N, NL_N, LS_N, NL_Y, LS_Y
design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples) # the design table will have columns as groups above, plus age and sex columns.
contrasts_table = makeContrasts(
     LS_B.vs.CN = (grpLS_Y + grpLS_N) / 2 - grpCN,
     LS_B.vs.NL_B = (grpLS_Y + grpLS_N) / 2 - (grpNL_Y + grpNL_N) / 2, 
     LS_Y.vs.NL_Y = grpLS_Y - grpNL_Y,
     LS_N.vs.NL_N = grpLS_N - grpNL_N,
     LS_Y.vs.LS_N = grpLS_Y - grpLS_N, 
     NL_Y.vs.NL_N = grpNL_Y - grpNL_N, 
     DiffOfDiff = (grpLS_Y - grpLS_N) - (grpNL_Y - grpNL_N)
)  # ... also, including other relevant comparisons such NL_B.vs.CN, NL_Y.vs.NL_N, LS_N.vs.CN, NL_N.vs.CN

Approach #2 - Interaction between samples type ("Src") and phenotype ("Pheno"):

design<-model.matrix("~0 + Src + Src:Pheno + age + sex", data=DGE.cpm$samples)
design<-design %>% as.data.frame() %>% dplyr::select(-SrcCN:PhenoY)   # as control group ("CN") does not have a phenotype, it must be removed to keep the matrix full-rank
contrasts_table = makeContrasts(
     LS_B.vs.CN = SrcLS + SrcLS.PhenoY/2 - SrcCN,    # Is this the way to do it?
     LS_B.vs.NL_B = SrcLS + SrcLS.PhenoY/2 - SrcNL - SrcNL.PhenoY/2,   # Is this the way to do it?
     LS_Y.vs.NL_Y = SrcLS + SrcLS.PhenoY - SrcNL - SrcNL.PhenoY
     LS_N.vs.NL_N = SrcLS - SrcNL, 
     LS_Y.vs.LS_N = SrcLS.PhenoY  # Is this the way to do it?
     NL_Y.vs.NL_N = SrcNL.PhenoY  # Is this the way to do it?
     DiffOfDiff = SrcLS.PhenoY - SrcNL.PhenoY
)  # ... again, including other relevant comparisons such NL_B.vs.CN, LS_N.vs.CN, NL_N.vs.CN

Running the voom-duplicate-correlation-limma pipeline yields a similar number of DEGs, except for the LS_Y.vs.LS_N / NL_Y.vs.NL_N comparisons, which made me curious.

Thank you!

limma • 1.3k views

ADD COMMENT • link updated 18 months ago by Gordon Smyth 52k • written 19 months ago by Jonathan ▴ 10

score 3 · Accepted Answer · 2023-09-11

3

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 4 minutes ago

WEHI, Melbourne, Australia

The two design matrices are equivalent and will give the same results for equivalent contrasts, except for approximations introduced by the contrasts.fit function. (If you were using limma-trend in instead of limma-voom, the results from the two design matrices would be completely indentical for correctly computed contrasts.)

I recommend the first approach as it is transparent and shows you explicitly what each contrast is comparing. The second approach just complicates everything and has no advantages for the questions you ask.

ADD COMMENT • link 19 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you. Would that recommendation will also hold if there are two phenotypes?

In following the vignette, "BB" is agnostic for both phenotype, "BY" is agnostic for the first phenotype and "Y" for the second phenotype, etc.

pData %>% mutate(grp = paste0(Src, "_", Pheno1, "_", Pheno2)) %>% ... factoring ...
design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples)
contrasts_table = makeContrasts(
     LS_BB.vs.CN = (grpLS_NN + grpLS_YN + grpLS_NY + grpLS_YY) / 4 - grpCN,
     LS_BB.vs.NL_BB = (grpLS_NN + grpLS_YN + grpLS_NY + grpLS_YY) / 4 - (grpNL_NN + grpNL_NL + grpNL_NY + grpNL_YY) / 4, 
     LS_BY.vs.NL_BY = (grpLS_NY + grpLS_YY) / 2 - (grpNL_NY + grpNL_YY) / 2, 
     NL_YB.vs.NL_NB = (grpNL_YY + grpNL_YN) / 2 - (grpNL_NY + grpNL_NN) / 2, 
     NL_NY.vs.NL_NN = grpNL_NY - grpNL_NN, 
     ... # etc etc
)

ADD REPLY • link 19 months ago Jonathan ▴ 10

0

Entering edit mode

Would that recommendation will also hold if there are two phenotypes?

Yes.

"BB" is agnostic for both phenotype, "BY" is agnostic for the first phenotype ...

I don't really accept the idea of contrasts being "agnostic". I feel that effects can only be "agnostic" to a second factor if the effects of the two factors are purely additive. I don't generally find the idea of averaging effects over the levels of another factor to lead to something that is scientifically interpretable and meaningful. I generally find it better to compute separate effects for each level of the second factor. In this case, that would mean separate origin effects for each phenotype and separate phenotype effects for each origin. To say the same thing in more statistical terminology, I don't attempt to interpret main effects in the presence of interactions.

That's just my preference however. It is your experiment and your data, so you are free to make whatever contrasts you feel are useful to you and you are willing to defend in a publication.

ADD REPLY • link 19 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you. I'm not entirely sure I understand what you mean by

I generally find it better to compute separate effects for each level of the second factor. In this case, that would mean separate origin effects for each phenotype and separate phenotype effects for each origin.

Can you please clarify or elaborate a little?

ADD REPLY • link 19 months ago Jonathan ▴ 10

1

Entering edit mode

I am telling you that LS_B.vs.NL_B is completely uninteresting. You need instead to examine LS_Y.vs.NL_Y and LS_N.vs.NL_N separately.

ADD REPLY • link 18 months ago Gordon Smyth 52k