Question

edgeR: Choice of reference group in nested model

0

Entering edit mode

ri.lars ▴ 40

@rilars-6776

Last seen 6.0 years ago

I have an experiment where each sample is measured multiple times and samples are nested within the experimental group. I have noticed that my choice of sample reference level seems to influence the results. Below is a minimal example to show what I mean:

library(edgeR)
counts <- matrix(c(6,2,6,21,19,15,17,18,20,12), nrow = 1)
counts <- rbind(counts, 25 - counts)

setup <- data.frame(sample = factor(c("1","1","1","2","2","1","1","2","2","2")),
                    group = factor(c(rep("a", 5), rep("b", 5))))

tester <- function(){
  design <- model.matrix(~ group + group:sample, setup)
  y <- DGEList(counts = counts, group = setup$group, genes = c("test", "test2"))
  y <- estimateDisp(y, design=design, tagwise = FALSE, trend.method = "none")
  fit <- glmFit(y, design)
  topTags(glmLRT(fit, coef = 2))
}

tester()
# Coefficient:  groupb 
#   genes     logFC   logCPM       LR       PValue          FDR
# 1  test  1.750700 19.03706 16.39638 5.138319e-05 0.0001027664
# 2 test2 -1.164792 18.81775 10.49939 1.194136e-03 0.0011941364

# Set reference group to "2" instead of "1"
setup$sample <- relevel(setup$sample, "2")
tester()
# Coefficient:  groupb 
#   genes      logFC   logCPM        LR    PValue       FDR
# 2 test2  0.7228214 18.81775 1.9869241 0.1586629 0.3173257
# 1  test -0.2612434 19.03706 0.7314366 0.3924179 0.3924179

I have a couple of questions: Is this intended behaviour, and if so, why?

Is my model specification incorrect or can I change it to ensure consistent results?

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.20.9 limma_3.34.9

loaded via a namespace (and not attached):
[1] compiler_3.4.3  tools_3.4.3     Rcpp_0.12.16    grid_3.4.3      locfit_1.5-9.1  lattice_0.20-35

edgeR model.matrix nested design reference group • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Aaron Lun ★ 28k • written 6.6 years ago by ri.lars ▴ 40

score 3 · Accepted Answer · 2018-04-17

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 15 hours ago

The city by the bay

Because the second coefficient represents different things in the two design matrices.

Let's look at the design matrix when 1 is the reference level for sample. Column names are:

(Intercept) groupb groupa:sample2 groupb:sample2

If you examine the entries of design, you will find that the intercept represents the average expression of the A-1 combination. The groupb coefficient represents the log-fold change in B-1 over A-1. groupa:sample2 represents the log-fold change in A-2 over A-1, while groupb:sample2 is the log-fold change in B-2 over B-1.

Now, let's have a look at the design matrix after the call to relevel(), which has column names:

(Intercept) groupb groupa:sample1 groupb:sample1

Now, the intercept represents the average expression of the A-2 combination, the new "overall baseline" across the two factors. The groupb coefficient represents the log-fold change in B-2 over A-2, which is clearly different from B-1 vs A-1. groupa:sample1 represents the log-fold change in A-1 over A-2, while groupb:sample1 represents the log-fold change in B-1 over B-2.

In short, don't be fooled by the fact that the columns have the same names before and after relevel(). You need to look at the design matrix to figure out what the coefficients actually represent.

ADD COMMENT • link 6.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for the comprehensive explanation, that cleared things up quite a bit. What I am really interested in would be the log-fold change in the average of B1 and B2 over the average of A1 and A2. Can I design a contrast with my current model specification that would test this? Something like ((Intercept) + groupa:sample2)/2 - (groupb + groupb:sample2)/2. (this formulation seems wrong).

Alternatively I could fit a model of the form ~ 0 + group:sample, and then make a contrast of (b1 + b2)/2 - (a1 + a2)/2, that should give me the difference I am after, correct?

ADD REPLY • link 6.6 years ago ri.lars ▴ 40

1

Entering edit mode

Looking at the first design matrix:

A1: Intercept
B1: Intercept + groupb
A2: Intercept + groupa:sample2
B2: Intercept + groupb + groupb:sample2

So to create the null hypothesis equivalent to (A1 + A2)/2 = (B1 + B2)/2, you do:

(Intercept + Intercept + groupa:sample2)/2
= (Intercept + groupb + Intercept + groupb + groupb:sample2)/2

... which can be simplified to:

(groupa:sample2 - groupb:sample2)/2 - groupb = 0

... the left-hand-side of which is the expression to use in makeContrasts().

If you think that looks unnecessarily complicated, you'd be right; a far simpler approach would be to combine group and sample into a single factor, and use a no-intercept model as you've propose in your second option.

ADD REPLY • link 6.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

That is excellent, I find the construction of contrasts in R to be confusing and not all packages has the same flexibility as edgeR does so this will definitely help me in the future.

ADD REPLY • link 6.6 years ago ri.lars ▴ 40