Dear all,
I'm trying to do differential expression analysis using limma with a somewhat complicated design. I have an experiment with three levels of treatment (untreated, and treatment I and II) in over 200 donors (A, B...ZZ) of different genotype each. Due to experimental limitations, the samples have been treated in the same way but in 15 different batches. Each batch includes 8-40 different donors each, and 2 donors are present in every batch (donor A and B), while others are present in a variety of batches (some in 2, some in 3 or 4). So the majority of donors are present only in one batch, and there is not a clear baseline. Each donor has all three levels of treatment in each batch in which it is present. So the metadata looks like this:
donor treatment batch
untreated_A_batch13 A untreated batch13
untreated_B_batch17 B untreated batch17
untreated_B_batch2 B untreated batch2
untreated_C_batch3 C untreated batch3
untreated_D_batch6 D untreated batch6
....
II_ZZ_batch15 ZZ II batch15
II_ZZ_batch16 ZZ II batch16
I wanted to answer two questions:
- What is the effect of each treatment, controlling for the donor and batch effects?
- Is there an interaction of donor x treatment?
To answer question 1 I am controlling for batch using the blocking method explained on [p. 127][1] of the manual (duplicateCorrelation()
, under "18.1.9 Linear modeling") and following the usual voom()
-> lmFit()
-> eBayes()
, and fitting the design excluding batch:
mat = model.matrix(~ 0 + treatment + donor, metadata)
This works fine for the contrasts treatmentI-treatmentuntreated and treatmentII-treatmentuntreated. My only issue is that the adj.P.Val are tiny (and <0.05 in 91% of genes). Is this just due to the amount of replicates (donors) that I have, or should I be worried? There are ~10% with abs(logFC)>1, which is in the line of what I've seen in other experiments.
Now the problem is question 2. I create a new design matrix like this:
mat = model.matrix(~ 0 + donor*treatment, metadata)
The column names of the design matrix look like this if I set donorA and treatment I as reference:
"donorA" "donorB" "donorC" (....) "treatmentII" "treatmentuntreated" "donorB:treatmentII"
"donorC:treatmentII" (...) "donorB:treatmentuntreated" "donorC:treatmentuntreated" (...) "donorZZ:treatmentuntreated"
Because the batch and donor effects are correlated, I expect correcting using duplicateCorrelation()
will remove any signal I may have in the donors that are present in only one batch. So because of this and to prevent the matrix not being full rank, I manually remove all columns from the metadata matrix that pertain to donors that are not shared across all batches (so, I remove "donorC" to "donor ZZ", but leave their interaction columns: so for example "donorZZ:treatmentuntreated" would remain.
v = voom(dge, design = mat, plot=TRUE)
cor = duplicateCorrelation(v, design=mat, block = metadata$batch)
#voom weights may have changed:
v = voom(dge, design = mat, plot=TRUE, block = metadata$pool, correlation = cor$consensus)
cor = duplicateCorrelation(v, design = mat, block = metadata$pool) # extract cor again
cor$consensus
# fit
fit = lmFit(v, design=mat,block = metadata$pool, correlation = cor$consensus)
Now I fit the contrast matrix. For every donor in mat
above (I abbreviated the code for clarity):
cont = c( paste0("(donor",donor,".treatmentII-donor", donor,".treatmentuntreated)-(treatmentII-treatmentuntreated)-((donorA + donorB)/2)"))
colnames(contrast.matrix) = c(paste0(donor,"_II_interaction"))
contrast.matrix = makeContrasts(contrasts = cont,
levels=mat)
The contrast matrix looks like this:
Contrasts
Levels B_II_interaction C_II_interaction D_II_interaction
donorA -0.5 -0.5 -0.5
donorB -0.5 -0.5 -0.5
treatmentII -1 -1 -1
treatmentuntreated 1 1 1
donorB.treatmentII 1 0 0
(...)
donorB.treatmentuntreated -1 0 0
- Am I correct in my interpretation that treatmentII - treatmentuntreated is the effect of treatment II vs untreated for the average of all donors, and not the 2 donors that I'm retaining in the matrix to estimate the donor effects? What would be then the caveat of including only two donors vs including the other donors that are only partially shared across batches?
- Is N_II_interaction = (donorN:treatmentII - donorN:treatmentuntreated)-(treatmentII - treatmentuntreated)-((donorA + donorB)/2) interpretable as the interaction for donor N with respect of the average of donor A and B? I noticed that calculating this for donor B gives more differentially expressed genes than for the other donors, why is this?
- Can you suggest a better way of fitting these interactions in an interpretable way for all donors?
I hope I have included all the information needed, let me know if not. Thanks in advance for your advice.
Clarification: This is pseudobulk data from scRNA-seq, so I could potentially create replicates for some donors within each batch and treatment, provided I have enough cells for that donor (some are much more abundant than others).
Do you observe all three treatments for every donor?
If a donor appears in more than one batch, are all three treatments repeated for that donor in each of the batches?
Are there any repeats of the same donor with the same treatment in the same batch?
Yes, all three treatments are present for every donor, in all the batches in which that donor appears. There are no repeats of the same donor with the same treatment in the same batch (but I could create some from different cells of the same donor - these is pseudobulk data from scRNA-seq)
I guess you mean using duplicateCorrelation as is done for cell line in Section 18.1.19 of the limma User's Guide. (That's page 127 of the latest User's Guide.)
Yes, sorry, duplicateCorrelation as in p.127 section "18.1.9 Linear modeling"