Hi, I have an RNASeq experiment with control and cohort/diseased samples, whereas the cohort samples can be lesional or non-lesional. The cohort itself can have a unique phenotype/characteristic (which does not appear in controls). Thus I have three main research questions:
- What is the (age&sex adjusted) effect of the disease: Lesional vs. Non-Lesional, Lesional vs. Control, Non-Lesional vs. Control.
- What is the (age&sex adjusted) effect of the disease in a sub-group analysis: Lesional/Phenotype+ vs Non-Lesional/Phenotype-, and Lesional/Phenotype- vs Non-Lesional/Phenotype-.
- What is the (age&sex adjusted) effect of the phenotype in both cohort groups: Lesional/Phenotype+ vs Lesional/Phenotype-, and Non-Lesional/Phenotype+ vs Non-Lesional/Phenotype-.
Following the user-guide, I ended up wondering which approach is better for my design/contrast matrices, and more importantly, what are their differences.
For the following code vignette, Src is the sample origin in which CN is controls, NL is non-lesional, LS is lesional; Pheno ("Y"/"N") is whether cohort samples have the phenotype or not; I use "B" for phenotype-agonistic comparisons.
Approach #1 - Grouping samples type ("Src") and phenotype ("Pheno"):
pData %>% mutate(grp = paste0(Src, "_", Pheno)) %>% (... converting to a factor ...) # will create the groups: CN_N, NL_N, LS_N, NL_Y, LS_Y
design<-model.matrix("~0 + grp + age + sex", data=DGE.cpm$samples) # the design table will have columns as groups above, plus age and sex columns.
contrasts_table = makeContrasts(
LS_B.vs.CN = (grpLS_Y + grpLS_N) / 2 - grpCN,
LS_B.vs.NL_B = (grpLS_Y + grpLS_N) / 2 - (grpNL_Y + grpNL_N) / 2,
LS_Y.vs.NL_Y = grpLS_Y - grpNL_Y,
LS_N.vs.NL_N = grpLS_N - grpNL_N,
LS_Y.vs.LS_N = grpLS_Y - grpLS_N,
NL_Y.vs.NL_N = grpNL_Y - grpNL_N,
DiffOfDiff = (grpLS_Y - grpLS_N) - (grpNL_Y - grpNL_N)
) # ... also, including other relevant comparisons such NL_B.vs.CN, NL_Y.vs.NL_N, LS_N.vs.CN, NL_N.vs.CN
Approach #2 - Interaction between samples type ("Src") and phenotype ("Pheno"):
design<-model.matrix("~0 + Src + Src:Pheno + age + sex", data=DGE.cpm$samples)
design<-design %>% as.data.frame() %>% dplyr::select(-SrcCN:PhenoY) # as control group ("CN") does not have a phenotype, it must be removed to keep the matrix full-rank
contrasts_table = makeContrasts(
LS_B.vs.CN = SrcLS + SrcLS.PhenoY/2 - SrcCN, # Is this the way to do it?
LS_B.vs.NL_B = SrcLS + SrcLS.PhenoY/2 - SrcNL - SrcNL.PhenoY/2, # Is this the way to do it?
LS_Y.vs.NL_Y = SrcLS + SrcLS.PhenoY - SrcNL - SrcNL.PhenoY
LS_N.vs.NL_N = SrcLS - SrcNL,
LS_Y.vs.LS_N = SrcLS.PhenoY # Is this the way to do it?
NL_Y.vs.NL_N = SrcNL.PhenoY # Is this the way to do it?
DiffOfDiff = SrcLS.PhenoY - SrcNL.PhenoY
) # ... again, including other relevant comparisons such NL_B.vs.CN, LS_N.vs.CN, NL_N.vs.CN
Running the voom-duplicate-correlation-limma pipeline yields a similar number of DEGs, except for the LS_Y.vs.LS_N / NL_Y.vs.NL_N comparisons, which made me curious.
Thank you!
Thank you. Would that recommendation will also hold if there are two phenotypes?
In following the vignette, "BB" is agnostic for both phenotype, "BY" is agnostic for the first phenotype and "Y" for the second phenotype, etc.
Yes.
I don't really accept the idea of contrasts being "agnostic". I feel that effects can only be "agnostic" to a second factor if the effects of the two factors are purely additive. I don't generally find the idea of averaging effects over the levels of another factor to lead to something that is scientifically interpretable and meaningful. I generally find it better to compute separate effects for each level of the second factor. In this case, that would mean separate origin effects for each phenotype and separate phenotype effects for each origin. To say the same thing in more statistical terminology, I don't attempt to interpret main effects in the presence of interactions.
That's just my preference however. It is your experiment and your data, so you are free to make whatever contrasts you feel are useful to you and you are willing to defend in a publication.
Thank you. I'm not entirely sure I understand what you mean by
Can you please clarify or elaborate a little?
I am telling you that
LS_B.vs.NL_B
is completely uninteresting. You need instead to examineLS_Y.vs.NL_Y
andLS_N.vs.NL_N
separately.