Dear Bioconductor Experts.
I have RNAseq data with somewhat difficult experimental design and need advice how to deal with it.
Briefly, there are 10 paired samples before and after treatment.Samples are derived from human beings with some disorder which were participating in proof of concept clinical study (initially there were more patients involved, but at the end of the experiment only 10 valid pairs left). Things would be easy for simple paired design experiment, but 4 of the patients were treated at 'baseline' level starting treatment at week 0 till week 12, another 6 patients were treated with placebo for the same 12 weeks and then treated with the drug for another 12 weeks.
One of the pairs in 'baseline' group is a possible outlier (sample pair 3 on the MDS and PCA plots), reducing the number of pairs to three for this subgroup (also both samples in this pair have much lower library sizes of around 13 mils as compared to other samples, which are in a range of 18-22 mills).
MDS shows no separation according to the treatment - all samples are mixed, except sample 3 pair stands far apart. And I also don't see any clear batch effect between 'baseline' and 'placebo' groups. Below are MDS (open circles = wk0, filled circles = wk12) before and after removeBatchEffect():
Q1: Is it justified to remove this sample pair from the analysis based only on the fact of the smaller library sizes?
Some R code:
Week <- factor(exp.design$time, levels = c('wk0', 'wk12')) Batch0 <- factor(exp.design$batch0, levels = c('baseline', 'placebo')) Donor <- factor(exp.design$ind) Donor.n <- factor(exp.design$ind.n) # nested within 'Batch0' Exp.Design <- data.frame(Donor, Donor.n, Week, Batch0) y <- DGEList(count = Count.Data, genes = gene.annot, samples = Exp.Design) keep <- rowSums(cpm(y) > 0.5) >= 2 table(keep) y <- y[keep, , keep.lib.sizes = F] y <- calcNormFactors(y) design <- model.matrix( ~ Donor + Week, Exp.Design) y <- estimateDisp(y, design, robust = T) fit <- glmQLFit(y, design, robust = T) res <- glmQLFTest(fit) # no DEGs
Biologist responsible for the project insists that all 10 pairs should be treated together as one group, however not surprisingly there were no DEGs (not in deseq2 and not in edgeR in both QLF and LRT pipelines), so I included batch effect in the model (actually a placebo effect):
design1 <- model.matrix(~ Batch0 + Batch0 : Donor.n + Batch0 : Week, Exp.Design) all.zero <- apply(design1, 2, function(x) all(x==0)) idx <- which(all.zero) design1 <- design1[ , -idx] unname(design1) y <- estimateDisp(y, design1, robust = T) colnames(design1) fit <- glmQLFit(y, design1, robust = T) plotQLDisp(fit) res <- glmQLFTest(fit, coef = "Batch0.baseline:Weekwk12") # no DEGs res <- glmQLFTest(fit, coef = "Batch0placebo:Weekwk12") # no DEGs res <- glmQLFTest(fit, coef = 11:12) # no DEGs res <- glmQLFTest(fit, contrast = c(0,0,0,0,0,0,0,0,0,0,-1,1)) # no DEGs
However, the same workflow with LRT model produces some DEGs:
fit <- glmFit(y, design1) res <- glmLRT(fit, coef = "Batch0baseline:Weekwk12") is.de <- decideTestsDGE(res, p.value = 0.1) summary(is.de) Batch0baseline:Weekwk12 Down 89 NotSig 17398 Up 84 res <- glmLRT(fit, coef = "Batch0placebo:Weekwk12") is.de <- decideTestsDGE(res, p.value = 0.1) summary(is.de) Batch0placebo:Weekwk12 Down 18 NotSig 17526 Up 27 res <- glmLRT(fit, coef = 11:12) is.de <- decideTestsDGE(res, p.value = 0.1) summary(is.de) Batch0baseline:Weekwk12+Batch0placebo:Weekwk12 NotSig 17196 Sig 375 res <- glmLRT(fit, contrast = c(0,0,0,0,0,0,0,0,0,0,-1,1)) is.de <- decideTestsDGE(res, p.value = 0.1) summary(is.de) -1*Batch0baseline:Weekwk12 1*Batch0placebo:Weekwk12 Down 219 NotSig 16989 Up 363
Here more questions arise:
Q2: Why QLF produces no DEGs, while LRT gives some DEGs?
Q3: How to interpret results? I wanted to account for the batch effect and to answer the question, which genes are regulated in response to treatment irrespective of the 'placebo' effect. Instead, I have DEGs in the baseline group, DEGs in the placebo group, DEGs which respond to treatment in any of the two groups (by the way Q4: why the number is much higher than a sum of the two?), and a difference between 'placebo' versus 'baseline'.
comment removed