Hello,
I am struggling with the analysis of an unbalanced intervention study, and would appreciate any advice on this.
I got a dataset from a coworker in which he analyzed gene expression in tissue biopsies from healthy (N=14) and obese (N=29) subjects at baseline (B), and after 2 interventions that were performed in the obese subjects only [Treat1 and Treat2]. In total my dataset consists of 69 arrays, comprised of 14 healthy baseline arrays, and for the obese subjects 13 complete array pairs (baseline vs after) for Treat1, 13 complete pairs for Treat2, and 3 incomplete pairs (2 arrays for obese subjects baseline only, and 1 array for Treat2 only).
Of interest are:
-
genes differentially expressed at baseline between all 14 healthy samples vs all 28 obese baseline samples.
-
genes differentially expressed in obese subjects after Treatment1 (after vs before in subset of 13 ppl (=26 arrays) + 2 unpaired baseline arrays.
-
genes differentially expressed in obese subjects after Treatment2 (after vs before in subset of 13 ppl +26 arrays) + 1 unpaired Treat2 array + 2 unpaired baseline arrays)
-
genes that respond different between the 2 treatments (“delta” of the 2 “delta’s”, i.e. comparison 2 vs 3).
In this experiment I thus would like to make comparisons within and between subjects.
AFAIK i do have two options:
-
analyze using a paired design, i.e. block on SubjectID
-
analyze according the multilevel example in the limma user guide (thus using the function duplicateCorrelation).
However, what would be the best approach in mycase?
Advantage of A is that paired (sub-)structure of the design is optimally used, disadvantage is that design is not of full rank and that some coefficients are thus not estimable. The latter is not really an issue because it are two coefs for two subjects?
Advantage of B is that design is of full rank, but the paired substructure is not optimally used, and hence is less powerful?
Any insights are appreciated!
Thanks,
Guido
> targets Filename Status Time SubjectID 1 DELTA_G131_A01_1_PP002_T0.CEL lean base s2 2 DELTA_G131_A03_3_PP003_T0.CEL lean base s3 3 DELTA_G131_A05_5_PP004_T0.CEL lean base s4 4 DELTA_G131_A07_7_PP005_T0_2.CEL lean base s5 5 DELTA_G131_A09_9_PP006_T0.CEL lean base s6 6 DELTA_G131_A11_11_PP007_T0.CEL lean base s7 7 DELTA_G131_B01_15_PP009_T0.CEL lean base s9 8 DELTA_G131_B03_17_PP010_T0.CEL lean base s10 9 DELTA_G131_B05_23_PP013_T0.CEL lean base s13 10 DELTA_G131_B07_27_PP015_T0.CEL lean base s15 11 DELTA_G131_B09_29_PP101_BL_T0.CEL obese baseT2 s101 12 DELTA_G131_B11_31_PP101_FU_T0.CEL obese T2 s101 13 DELTA_G131_C01_33_PP102_BL_T0.CEL obese baseT1 s102 14 DELTA_G131_C03_35_PP102_FU_T0.CEL obese T1 s102 15 DELTA_G131_C05_41_PP104_BL_T0.CEL obese baseT1 s104 16 DELTA_G131_C07_43_PP104_FU_T0.CEL obese T1 s104 17 DELTA_G131_C09_45_PP105_BL_T0.CEL obese baseT2 s105 18 DELTA_G131_C11_47_PP105_FU_T0.CEL obese T2 s105 19 DELTA_G131_D01_49_PP106_BL_T0.CEL obese baseT1 s106 20 DELTA_G131_D03_51_PP106_FU_T0.CEL obese T1 s106 21 DELTA_G131_D05_53_PP107_BL_T0.CEL obese baseT2 s107 22 DELTA_G131_D07_55_PP107_FU_T0.CEL obese T2 s107 23 DELTA_G131_D09_63_PP110_BL_T0.CEL obese baseT2 s110 24 DELTA_G131_D11_65_PP110_FU_T0.CEL obese T2 s110 25 DELTA_G131_E01_67_PP111_BL_T0.CEL obese baseT1 s111 26 DELTA_G131_E03_69_PP111_FU_T0.CEL obese T1 s111 27 DELTA_G131_E05_71_PP112_BL_T0.CEL obese baseT2 s112 28 DELTA_G131_E07_73_PP112_FU_T0.CEL obese T2 s112 29 DELTA_G131_E09_75_PP113_BL_T0.CEL obese baseT1 s113 30 DELTA_G131_E11_77_PP113_FU_T0.CEL obese T1 s113 31 DELTA_G131_F01_82_PP115_BL_T0.CEL obese baseT1 s115 32 DELTA_G131_F03_84_PP115_FU_T0.CEL obese T1 s115 33 DELTA_G131_F09_90_PP117_BL_T0.CEL obese baseT2 s117 34 DELTA_G131_F11_92_PP117_FU_T0.CEL obese T2 s117 35 DELTA_G131_G01_94_PP118_BL_T0.CEL obese baseT1 s118 36 DELTA_G131_G03_96_PP118_FU_T0.CEL obese T1 s118 37 DELTA_G131_G05_104_PP121_BL_T0.CEL obese baseT1 s121 38 DELTA_G131_G07_106_PP121_FU_T0.CEL obese T1 s121 39 DELTA_G131_G09_108_PP122_BL_T0.CEL obese baseT2 s122 40 DELTA_G131_G11_110_PP122_FU_T0.CEL obese T2 s122 41 DELTA_G131_H01_112_PP123_BL_T0.CEL obese baseT2 s123 42 DELTA_G131_H03_114_PP123_FU_T0.CEL obese T2 s123 43 DELTA_G131_H05_116_PP124_BL_T0.CEL obese baseT1 s124 44 DELTA_G131_H07_118_PP124_FU_T0.CEL obese T1 s124 45 DELTA_G131_H09_128_PP127_BL_T0.CEL obese baseT2 s127 46 DELTA_G131_H11_130_PP127_FU_T0.CEL obese T2 s127 47 DELTA_G132_A05_13_PP008.T0.CEL lean base s8 48 DELTA_G132_A09_19_PP011.T0.CEL lean base s11 49 DELTA_G132_B07_21_PP012.T0.CEL lean base s12 50 DELTA_G132_C05_25_PP014.T0.CEL lean base s14 51 DELTA_G132_C09_37_PP103.BL.T0.CEL obese baseT2 s103 52 DELTA_G132_D07_39_PP103.FU.T0_2.CEL obese T2 s103 53 DELTA_G132_E05_57_PP108.BL.T0.CEL obese baseT1 s108 54 DELTA_G132_E09_59_PP108.FU.T0.CEL obese T1 s108 55 DELTA_G132_F07_120_PP125.BL.T0.CEL obese baseT1 s125 56 DELTA_G132_G05_122_PP125.FU.T0.CEL obese T1 s125 57 DELTA_G132_G09_124_PP126.BL.T0.CEL obese baseT1 s126 58 DELTA_G132_H07_126_PP126.FU.T0.CEL obese T1 s126 59 DELTA_G133_A05_61_PP109.BL.T0.CEL obese baseT1 s109 60 DELTA_G133_A09_79_PP114.BL.T0.CEL obese T2 s114 61 DELTA_G133_B09_98_PP119.BL.T0.CEL obese baseT2 s119 62 DELTA_G133_C07_100_PP119.FU.T0.CEL obese T2 s119 63 DELTA_G133_D05_102_PP120.BL.T0.CEL obese baseT2 s120 64 DELTA_G133_D09_132_PP128.BL.T0.CEL obese baseT2 s128 65 DELTA_G133_E07_134_PP128.FU.T0.CEL obese T2 s128 66 DELTA_G133_F05_136_PP129.BL.T0.CEL obese baseT1 s129 67 DELTA_G133_F09_138_PP129.FU.T0.CEL obese T1 s129 68 DELTA_G133_G07_86_PP116.BL.T0.CEL obese baseT2 s116 69 DELTA_G133_H05_88_PP116.FU.T0.CEL obese T2 s116 > # in yellow the 3 incomplete pairs
> Status <- factor(targets$Status) > Time <- factor(targets$Time) > Subject <- factor(targets$SubjectID) > TS <- as.factor(paste(Status, Time, sep=".")) > TS [1] lean.base lean.base lean.base lean.base lean.base lean.base [7] lean.base lean.base lean.base lean.base obese.baseT2 obese.T2 [13] obese.baseT1 obese.T1 obese.baseT1 obese.T1 obese.baseT2 obese.T2 [19] obese.baseT1 obese.T1 obese.baseT2 obese.T2 obese.baseT2 obese.T2 [25] obese.baseT1 obese.T1 obese.baseT2 obese.T2 obese.baseT1 obese.T1 [31] obese.baseT1 obese.T1 obese.baseT2 obese.T2 obese.baseT1 obese.T1 [37] obese.baseT1 obese.T1 obese.baseT2 obese.T2 obese.baseT2 obese.T2 [43] obese.baseT1 obese.T1 obese.baseT2 obese.T2 lean.base lean.base [49] lean.base lean.base obese.baseT2 obese.T2 obese.baseT1 obese.T1 [55] obese.baseT1 obese.T1 obese.baseT1 obese.T1 obese.baseT1 obese.T2 [61] obese.baseT2 obese.T2 obese.baseT2 obese.baseT2 obese.T2 obese.baseT1 [67] obese.T1 obese.baseT2 obese.T2 Levels: lean.base obese.baseT1 obese.baseT2 obese.T1 obese.T2 >
Option A; include Subject in model, but design is not full rank...
> design <- model.matrix((~0+TS+Subject)) > is.fullrank(design) [1] FALSE > nonEstimable(design) [1] "Subjects128" "Subjects129" >
Option B; use duplicateCorrelation()
> design <- model.matrix(~0+TS) #thus not including the subjects > is.fullrank(design) [1] TRUE > nonEstimable(design) NULL > corfit <- duplicateCorrelation(x.norm,design,block=Subject)
Define contrasts
> cont.matrix <- makeContrasts( + ObeseBLvsLeanBL = (TSobese.baseT1 + TSobese.baseT2 )/2 - TSlean.base, #combine all obese baseline samples + EffectT1 = (TSobese.T1 - TSobese.baseT1), + EffectT2 = (TSobese.T2 - TSobese.baseT2), + deltaDelta = ( (TSobese.T2 - TSobese.baseT2) - (TSobese.T1 - TSobese.baseT1)), + levels=design) > > cont.matrix Contrasts Levels ObeseBLvsLeanBL EffectT1 EffectT2 deltaDelta TSlean.base -1.0 0 0 0 TSobese.baseT1 0.5 -1 0 1 TSobese.baseT2 0.5 0 -1 -1 TSobese.T1 0.0 1 0 -1 TSobese.T2 0.0 0 1 1 >
Option A:
> fit <- lmFit(x.norm,design) Coefficients not estimable: Subjects128 Subjects129 Warning message: Partial NA coefficients for 19654 probe(s) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2, trend=TRUE) Loading required package: splines > topTable(fit2) Error in topTableF(fit, number = number, genelist = genelist, adjust.method = adjust.method, : F-statistics not found in fit > topTable(fit2,coef=1) logFC AveExpr t P.Value adj.P.Val B 388595_at -1.2992596 -0.045251715 -5.402161 9.033393e-06 0.1775423 2.24502824 93463_at 1.0838054 -0.033891183 4.762181 5.223624e-05 0.3696189 1.05278388 10589_at -0.7425038 0.002192174 -4.711510 5.950177e-05 0.3696189 0.96356396 5075_at -0.7834098 0.017699312 -4.424018 1.317521e-04 0.3696189 0.41041130 8394_at -0.5483462 -0.007113591 -4.396152 1.398632e-04 0.3696189 0.36951056 5111_at -0.6544413 0.044793358 -4.389222 1.425396e-04 0.3696189 0.35623703 150280_at -0.8668590 0.016500401 -4.368414 1.533131e-04 0.3696189 0.30437320 3850_at -0.7982704 0.000636201 -4.313728 1.779197e-04 0.3696189 0.20002697 29100_at 0.5633564 -0.028783088 4.260558 2.025360e-04 0.3696189 0.10966118 4286_at -0.7714238 0.035521729 -4.250192 2.114479e-04 0.3696189 0.07875228 >
Option B (note that the statistical signifcance for coef=1 is much 'better' than above):
> design <- model.matrix(~0+TS) > fit <- lmFit(x.norm,design,block=Subject,correlation=corfit$consensus) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2, trend=TRUE) > topTable(fit2) ObeseBLvsLeanBL EffectT1 EffectT2 deltaDelta AveExpr 5768_at 0.27094472 0.102441091 -0.03123944 -0.1336805291 0.0131168620 100129455_at -0.26193499 0.368394971 0.10007335 -0.2683216244 -0.0002613564 11316_at 0.16576403 0.131316402 0.13082179 -0.0004946125 0.0003524714 2357_at 0.35247065 -0.003587640 -0.03853153 -0.0349438859 -0.0179288959 57533_at 0.12907861 0.041385510 -0.17499073 -0.2163762354 -0.0194017923 9645_at 0.28792718 0.004117408 -0.10544675 -0.1095641611 0.0226279000 55022_at 0.40774391 0.046476910 -0.09294465 -0.1394215643 0.0688917460 81553_at 0.25856635 0.010798014 -0.07423535 -0.0850333635 -0.0243488464 91544_at 0.17854136 0.078838131 0.01881281 -0.0600253259 0.0441462794 22891_at -0.06227555 -0.029604714 -0.20926173 -0.1796570190 -0.0234380020 F P.Value adj.P.Val 5768_at 12.948352 8.272455e-07 0.01625868 100129455_at 9.932042 1.574822e-05 0.08406391 11316_at 9.915626 1.601338e-05 0.08406391 2357_at 9.850647 1.710876e-05 0.08406391 57533_at 9.177903 3.417815e-05 0.10274447 9645_at 9.125528 3.608937e-05 0.10274447 55022_at 8.835055 4.887136e-05 0.10274447 81553_at 8.640098 5.998237e-05 0.10274447 91544_at 8.619696 6.128604e-05 0.10274447 22891_at 8.541327 6.657034e-05 0.10274447 > topTable(fit2,coef=1) logFC AveExpr t P.Value adj.P.Val B 5768_at 0.2709447 0.01311686 5.176671 2.102617e-06 0.02330378 4.581468 2357_at 0.3524707 -0.01792890 5.144169 2.383376e-06 0.02330378 4.471316 9645_at 0.2879272 0.02262790 5.017521 3.872730e-06 0.02330378 4.044732 81553_at 0.2585663 -0.02434885 4.845884 7.418486e-06 0.02330378 3.473728 1848_at 0.4078608 0.03655012 4.831124 7.841526e-06 0.02330378 3.425026 54472_at 0.1947691 0.06115485 4.796373 8.932633e-06 0.02330378 3.310631 1462_at 0.2264794 -0.01457601 4.785774 9.293975e-06 0.02330378 3.275814 55022_at 0.4077439 0.06889175 4.780315 9.485613e-06 0.02330378 3.257895 4688_at 0.2567551 0.01126808 4.714461 1.212391e-05 0.02647592 3.042476 51099_at 0.3188907 -0.05336680 4.553293 2.196215e-05 0.04159022 2.521259 >
Thanks Aaron for your insights, much appreciated.
To be clear, do you suggest to split the analysis in 2 parts; i.e. one analysis for aim 1 using duplicateCorrelation, and a second analysis blocking by SubjectID for aim 2-4? And in both cases use the complete data set (all arrays)?
That's right. Of course, even if you do use the complete data set, the unpaired obese samples won't be useful for the
SubjectID
-blocked analysis and will be automatically ignored.