Hello,
I've carried out a number of differential expression analyses using the Limma-Voom pipeline for a rather complicated data set. I'd just like to clarify that my approach detailed below is appropriate.
The design of the experiment is very much like those described in sections 9.6 and 9.7 of the Limma manual (time-course experiments and multi-level experiments).
Our data set comprises of RNA-Seq data from patients undertaking treatment with one of two drugs (drug A and drug B). For each subject in our data set, samples were collected at three time points from two different tissue types on the body. Samples from diseased tissue and healthy tissue were collected at week 0 (before commencing treatment), week 2 (two weeks after commencing treatment), and week 10 (ten weeks after commencing treatment).
In addition, each patient was genotyped for an allele of interest (we'll call it Allele.X), and designated as Allele.X-positive (1 or 2 copies of allele) or Allele.X-negative (0 copies of allele).
Therefore, a snippet of the data set looks something like this:
Subject Sample Drug Time Tissue Allele.X
Subject.01 Sample.01 Drug.A Week.0 Diseased Positive
Subject.01 Sample.02 Drug.A Week.2 Diseased Positive
Subject.01 Sample.03 Drug.A Week.10 Diseased Positive
Subject.01 Sample.04 Drug.A Week.0 Healthy Positive
Subject.01 Sample.05 Drug.A Week.2 Healthy Positive
Subject.01 Sample.06 Drug.A Week.10 Healthy Positive
Subject.02 Sample.07 Drug.B Week.0 Diseased Negative
Subject.02 Sample.08 Drug.B Week.2 Diseased Negative
Subject.02 Sample.09 Drug.B Week.10 Diseased Negative
Subject.02 Sample.10 Drug.B Week.0 Healthy Negative
Subject.02 Sample.11 Drug.B Week.2 Healthy Negative
Subject.02 Sample.12 Drug.B Week.10 Healthy Negative
In total, 35 patients were treated with drug A and 31 were treated with drug B.
In the drug A group, 20 were Allele.X-negative and 15 were were Allele.X-positive. In the drug B group, 18 were Allele.X-negative and 13 were Allele.X-positive.
I would like to identify genes that change in expression over the course of treatment in both tissue types in both drug groups, irrespective of Allele.X status, i.e: week 2 and week 10 compared to week 0 in ALL of the drug A group and ALL of the drug B group.
I would also like to do the above comparisons, but also taking into account Allele.X status.
My approach do this has been to analyse each drug group separately, so when looking at drug A for example, this is done by subsetting the clinical data and expression data...
drug = "Drug.A"
clinical.data <- clinical.data[clinical.data$Drug == "Drug.A",]
exp.data <- exp.data[, clinical.data$Sample]
...and creating an interaction variable that combines tissue, time, and Allele.X status...
clinical.data$Tissue.Time.Allele.X = paste(clinical.data$Tissue, clinical.data$Time, clinical.data$Allele.X, sep = ".")
...and an interaction variable that combines subject and tissue...
clinical.data$Subject.Tissue = paste(clinical.data$Subject, clinical.data$Tissue, sep = ".")
The Tissue.Time.Allele.X variable is used to define the design...
limma_design = model.matrix(~ 0 + Tissue.Time.Allele.X, data = keep_samp)
voomed_counts = voom(edata, limma_design)
...and the Subject.Tissue variable is used to for sample blocking...
# Block correlation
my_cor = duplicateCorrelation(voomed_counts, limma_design, block = clinical.data$Subject.Tissue)
fit = lmFit(voomed_counts, limma_design, correlation = my_cor$cor, block = clinical.data$Subject.Tissue)
The makeContrasts function is used to extract the relevant contrasts...
# Extract contrasts of interest
my_contrasts <- makeContrasts(
Diseased_week2_vs_week0_All =
((15 * Diseased.Week.2.Positive + 20 * Diseased.Week.2.Negative) / 35) -
((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),
Diseased_week10_vs_week0_All =
((15 * Diseased.Week.10.Positive + 20 * Diseased.Week.10.Negative) / 35) -
((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),
Healthy_week2_vs_week0_All =
((15 * Diseased.Week.2.Positive + 20 * Diseased.Week.2.Negative) / 35) -
((15 * Diseased.Week.0.Positive + 20 * Diseased.Week.0.Negative) / 35),
Healthy_week10_vs_week0_All =
((15 * Healthy.Week.10.Positive + 20 * Healthy.Week.10.Negative) / 35) -
((15 * Healthy.Week.0.Positive + 20 * Healthy.Week.0.Negative) / 35),
Diseased_week2_vs_week0_Pos = Diseased.Week.2.Positive - Diseased.Week.0.Positive,
Diseased_week2_vs_week0_Neg = Diseased.Week.2.Negative - Diseased.Week.0.Negative,
Diseased_week10_vs_week0_Pos = Diseased.Week.10.Positive - Diseased.Week.10.Positive,
Diseased_week10_vs_week0_Neg = Diseased.Week.10.Negative - Diseased.Week.10.Negative,
Healthy_week2_vs_week0_Pos = Healthy.Week.2.Positive - Healthy.Week.0.Positive,
Healthy_week2_vs_week0_Neg = Healthy.Week.2.Negative - Healthy.Week.0.Negative,
Healthy_week10_vs_week0_Pos = Healthy.Week.10.Positive - Healthy.Week.10.Positive,
Healthy_week10_vs_week0_Neg = Healthy.Week.10.Negative - Healthy.Week.10.Negative
levels = limma_design
)
# Compute contrasts
fit <- contrasts.fit(fit, my_contrasts)
fit <- eBayes(fit)
Hopefully it is clear that for the contrasts in all of the patients within the drug group, the average expression (weighted by sample number) of the Allele.X-positive and negative patients is being contrasted.
So my main question - is the above approach valid?
Thank you! And sorry for long post, but hopefully it is clear what I'm getting at.
Thank you for your answer.
My reasoning for averaging the expression for the contrasts in all of the patients comes from this post... https://support.bioconductor.org/p/69764/#69765 ...but I may have misinterpreted it.
And regarding blocking at the subject/tissue level - the two different tissue types from each subject are so different in expression that I though it would make sense to effectively treat them as different subjects.
The post you reference shows that you can do an unweighted average. There is no reason to do a weighted average in this context. Put a different way, the post you link to is showing how to compute the contrast of one group versus the average of two other groups, not how to adjust for perceived accuracy of one group versus the other. The average is very efficient, and there's no reason to think that an average based on 20 subjects is materially more accurate than one based on 15.
The idea behind a mixed model is that the observations within a particular group are correlated and you want to account for that correlation structure in the model (statistical tests usually assume that all the observations are independent of each other). Even if the two tissues have very different expression levels, the correlation between all the tissues within a subject are likely to be high, and that's what you want to control for. And as I already mentioned, if you have complete observations for all subjects (you measure each tissue the same number of times for each subject), it's probably better to simply block on subject.
Thank you, I think I understand. I was thinking that because my subgroups are unequal in size (unlike in the post I linked to) the average should be weighted.
Sorry to keep dragging you back here, but I have one final query... do you think there is any benefit to contrasting the different time points in all the subjects, and in the alleleX-positive and negative patients separately, in the same model? I.e. rather than all this averaging expression business, I should just have one model that contrasts using a Tissue.Time interaction variable (a model that doesn't consider alleleX), and another model that contrasts using a Tissue.Time.AlleleX variable (a model which does consider alleleX positive and negative subjects separately) ?
That's something you need to decide as the analyst. It may be that the allele affects very few, if any genes, in which case it's burning degrees of freedom to include it. Or maybe the allele is the main hypothesis?
You can always make the most specific coefficient and then average to get at less specific questions. In other words you could include the allele and then average time/treatment combinations that have different alleles to do the same thing as your first model.