Question

Limma blocking on unbalanced factor

1

Entering edit mode

obarton ▴ 10

@obarton-11523

Last seen 8.6 years ago

Hello,

I am performing a differential analysis with limma, I have three groups: a control group, and two disease groups (D1 and D2), and I want to perform two contrasts, D1- CTRL and D2-CTRL whilst blocking for a variable “site”.

The site variable is quite unbalanced:

> table(sampDat$joinedGroup, sampDat$site)
      
       arm back leg scalp
  CTRL  14   99 100     0
  D1    37    2  44     0
  D2    10   90   1    19

As you can see the control group has 0 scalp samples, and D2 has 19 scalp samples. Further, there are very low numbers of back in D1 and leg in D2.

My current solution is a standard cell means model with site as a blocking factor:

f <- factor(sampDat$joinedGroup)

site <- factor(sampDat$site)

design <- model.matrix(~0 + f + site)
fit <- lmFit(affyTable, design)

contrastMatrix  <- makeContrasts(fD1-fCTRL,
                                 fD2-fCTRL,
                                 levels = colnames(design))

fit2 <- contrasts.fit(fit, contrastMatrix)
fit2 <- eBayes(fit2)

tt_1 <- topTable(fit2, coef = 1, adjust.method = "BH",n = "inf", p = 0.05)

Which gives me expected results.

My question is in regards to the unbalanced blocking factor. I'm trying to understand how limma is dealing with the unbalanced design, particularly with regards to the scalp level. Could someone comment on what is the model estimating for this level?

Thanks,

Owen

limma • 1.2k views

ADD COMMENT • link updated 8.6 years ago by Aaron Lun ★ 28k • written 8.6 years ago by obarton ▴ 10

score 3 · Answer 1 · 2016-09-21

The scalp samples won't contribute to any of the contrasts (other than increasing the residual d.f. for estimating the variance) because there's nothing to which to compare the D2 samples. The sitescalp coefficient will then directly represent the average log-fold change of the D2 scalp samples over the D2 arm samples (as arm should be the reference level for the site factor, being alphabetically first). In contrast, for the other non-arm sites, the value of the corresponding coefficient will represent the log-fold change over the arm samples, averaged across all disease types (not just D2) because they all have available samples. Note that the fD1, fD2 and fCNTRL coefficients directly represent the average log-expression in arm samples for the corresponding disease type, so there's no need for a separate sitearm coefficient.

This is all fine, by the way. limma accounts for the fact that it's unbalanced, it's not something you have to worry about. In terms of experimental design, you would have more power if the design were properly balanced, but that can't be helped if there are logistical constraints preventing that. In any case, the lack of balance doesn't invalidate the analysis.