Dear Bioconductor Community,
I have an RNAseq experiment and was wondering if I could get some input on my design matrix (here used for an analysis in edgeR).
Tissue samples from 7 healthy individuals were evaluated after treatment with a drug (XTC) or placebo at two time points, 1 and 2 h after treatment, for all participants, and after 6 h for 3 of the participants. For every time point, except the 6 h time point, we have matching samples from each patient.
The target frame looks like this:
Patient Treatment TimeHours
P1XTC2 P1 XTC 2
P1XTC1 P1 XTC 1
P1Placebo2 P1 Placebo 2
P1Placebo1 P1 Placebo 1
P2XTC2 P2 XTC 2
P2XTC1 P2 XTC 1
P2XTC6 P2 XTC 6
P2Placebo2 P2 Placebo 2
P2Placebo1 P2 Placebo 1
P3XTC2 P3 XTC 2
P3XTC1 P3 XTC 1
P3XTC6 P3 XTC 6
P3Placebo2 P3 Placebo 2
P3Placebo1 P3 Placebo 1
P4XTC2 P4 XTC 2
P4XTC1 P4 XTC 1
P4XTC6 P4 XTC 6
P4Placebo2 P4 Placebo 2
P4Placebo1 P4 Placebo 1
P5XTC2 P5 XTC 2
P5XTC1 P5 XTC 1
P5XTC6 P5 XTC 6
P5Placebo2 P5 Placebo 2
P5Placebo1 P5 Placebo 1
P6XTC2 P6 XTC 2
P6XTC1 P6 XTC 1
P6Placebo2 P6 Placebo 2
P6Placebo1 P6 Placebo 1
P7XTC2 P7 XTC 2
P7XTC1 P7 XTC 1
P7Placebo2 P7 Placebo 2
P7Placebo1 P7 Placebo 1
The count matrix like this:
Gene P1XTC2 P1XTC1 P1Placebo2 P1Placebo1 P2XTC2 P2XTC1 P2XTC6
Gene1 130041 294709 693018 392083 271540 149505 450974
Gene2 125 473 548 251 926 327 171
Gene3 0 22 25 16 34 6 12
Gene4 56870 195167 194034 230998 140803 83530 344095
Gene5 256 540 1153 999 808 291 485
Gene6 18813 51631 101162 74585 27923 19618 68625
Gene7 19 27 69 41 88 10 27
Gene8 7817 18459 33485 20042 22944 9367 22605
Gene9 776 1449 2652 1985 1303 911 1448
Gene10 5089 9470 24026 13697 11781 4119 10764
Now, fitting and testing the model:
> library(edgeR)
> Group = factor(paste(targets$Treatment, targets$TimeHours, sep = "."))
> Group = relevel(Group, ref = "Placebo.1")
> cbind(targets, Group)
> d = DGEList(counts, group = Group)
> d = calcNormFactors(d)
> cpm_d = cpm(d, normalized.lib.sizes = TRUE)
> d = d[rowSums(cpm_d > 1) >= 2, ]
> d = estimateDisp(d)
> patient = factor(targets$Patient)
> design = model.matrix(~ 0 + patient + Group)
> colnames(design) = gsub("patient|Group", "", colnames(design))
> fit = glmFit(d, design)
If defined like this, the comparisons are based on the 1 h placebo (Placebo.1),
which as expected does not appear as a coefficient in the design matrix.
> colnames(fit)
[1] "P1" "P2" "P3" "P4" "P5" "P6" "P7" "Placebo.2"
[9] "XTC.1" "XTC.2" "XTC.6”
Now testing for the different comparisons of interest.
## 1 hour treatment vs. 1 hour placebo
> lrt = glmLRT(fit, coef = 9)
## 2 hour treatment vs. 2 hour placebo
> lrt = glmLRT(fit, contrast = c(-1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))
## 6 hour treatment vs. 1 hour placebo, no 6 hour placebo available.
> lrt = glmLRT(fit)) # last coefficient, no need to specify.
I have tested a variety of design matrices (with and without intercept, not considering the pairing etc.) and the results are quite robust across all these tests. But I would be thankful for some insightful comments from the community - of course pointing at obvious mistakes/misconceptions, or giving examples of some potentially informative alternative approaches.
Cheers,
Kemal
--
Dr. Kemal Akat
HHMI/The Rockefeller University
Laboratory of RNA Molecular Biology
1230 York Avenue, Box #186
New York, NY 10065
R> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] locfit_1.5-9.1 edgeR_3.8.2 limma_3.22.1 colorout_1.0-3 setwidth_1.0-3
loaded via a namespace (and not attached):
[1] grid_3.1.2 lattice_0.20-29 tools_3.1.2