Question

"Blocking" in design matrix prior to limma DE analysis

0

Entering edit mode

bhakti.dwivedi • 0

@bhaktidwivedi-8895

Last seen 4.9 years ago

United States

Hi,

I have the following design. Three cell types, A, B, and C obtained from three subjects.

   subject  condition
    1   A
    1   B
    1   C
    2   A
    2   B
    2   C
    3   A
    3   B
    3   C

I would like to compare condition A to B to C (A is sort of primary ref here, so like BvsA, CvsA, and CvsB). I am not interested in differences between the subjects and would like to adjust for it. Though the data (from PCA plots etc) clearly shows separation by condition and similarity among subjects. The data is RNAseq processed, filtered for genes, and normalized (TMM with voom).

I am thinking of "blocking" using the design matrix:

design <- model.matrix(~subject+condition)

This generates only five columns, three for the subjects and two for the conditions. Where is the third condition? and the intercept is the first subject. Is this correct? Am I doing something wrong?

How should I define contrasts to detect genes differentially expressed in condition B vs condition A; condition C vs condition A; condition C vs condition B; and in in any of the three treatments? Do I specify as in below?

DGE = DGEList(counts=exprdatafltd, group=metadata)
y <- calcNormFactors(DGE,method =c("TMM"))
v <- voom(y, design, plot=TRUE)
fit <- lmFit(v, design)
fit <- contrasts.fit(fit, coefficient=?) 
fit <- eBayes(fit)

Appreciate any help or suggestions! Thank you.

limma • 679 views

ADD COMMENT • link updated 5.0 years ago by James W. MacDonald 68k • written 5.0 years ago by bhakti.dwivedi • 0

score 1 · Answer 1 · 2020-03-24

A simple way to figure out what the coefficients are is to look at the rows, one by one. So your design matrix looks like

> model.matrix(~subject + condition, df)
  (Intercept) subject2 subject3 condition2 condition3
1           1        0        0          0          0
2           1        0        0          1          0
3           1        0        0          0          1
4           1        1        0          0          0
5           1        1        0          1          0
6           1        1        0          0          1
7           1        0        1          0          0
8           1        0        1          1          0
9           1        0        1          0          1

Right? And each row pertains to each sample. The first sample (row) has only one 1, and that sample is subject 1, condition A. So that's what the intercept column represents. The next row has an additional 1 in the condition 2 column, and that's subject 1, condition B. So we can infer that the condition 2 coefficient is the difference between condition B and condition A for subject 1. Or you can do it algebraically:

Subj1_condB = Subj1_condA + X
#solve for X
X = Subj1_condB - Subj1_condA

Following that logic, the fifth column is Subj1condC - Subj1condA. So what is column 2? It's the difference between Subj2condA and Subj1condA (you can do the algebra). And column 3 is Subj3condA - Subj1condA.

Heuristically you can think of it this way; you are making comparisons between conditions for subject 1, and using data from subjects 2 and 3 by setting them to an equivalent level as subject 1 (by subtracting out the difference between subjects). Does that make sense?