Question

Help with limma contrast matrix

0

Entering edit mode

chris86 ▴ 420

@chris86-8408

Last seen 5.4 years ago

UCL, United Kingdom

Hi

I thought I had contrast matrices and limma worked out, but it appears I don't. The data frame is a series of columns (samples) with rows (genes) and I am splitting this up according on the different groups to compare which are in order.

len1 <- 7
len2 <- 12
len3 <- 5

group <- factor(c(rep(1, times = len1), rep(2, times = len2), rep(3, times = len3)))

design <- model.matrix(~0+group)
matrix <- data.matrix(data6)
colnames(design)[1] <- 's1'
colnames(design)[2] <- 's2'
colnames(design)[3] <- 's3'

contrast.matrix <- makeContrasts('s1-s2','s2-s3','s1-s3', levels = design)

fit <- lmFit(matrix,design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

top2 <- topTable(fit2,coef=1,number=Inf,sort.by="P")
sig <- subset(top2, adj.P.Val<0.05)
nrow(sig)

This gives me 400 differentially expressed genes. All I want to do is compare s1 gene profile with s2. However, when I exclude s3 (remove it from the data frame) and just run the following:

group <- factor(c(rep(1, times = len1), rep(2, times = len2)))
design <- model.matrix(~group) # from EdgeR
matrix <- data.matrix(data6)

#

fit <- lmFit(matrix,design)
fit <- eBayes(fit)
top2 <- topTable(fit,coef=2,number=Inf,sort.by="P")
sig <- subset(top2, adj.P.Val<0.05)

This gives a 1000 differentially expressed genes.

So the two methods are different, but all I thought I was doing in the first step was comparing 2 gene profiles so why the difference?

Thanks.

limma • 1.6k views

ADD COMMENT • link updated 9.7 years ago by Aaron Lun ★ 28k • written 9.7 years ago by chris86 ▴ 420

score 4 · Accepted Answer · 2015-08-24

When you fit an ANOVA model and then compute contrasts, the variance estimate is based on the sums of squares of error (SSE) from the model fit. This is basically the average of the intra-group variances for all groups in the model, and relies on the assumption that each group has relatively similar variances. In the first analysis you did, your variance estimate was based in part on the s3 group, which is obviously much more variable than s1 and s2. In the second analysis you only used s1 and s2 for the variance estimate, and by removing the noisier group you increased the number of genes found significant.

score 2 · Accepted Answer · 2015-08-24

The variance is estimated using all samples in the data set. So, in your first approach, you'll be using the information in s3 to estimate the variance for each gene. This means that the expression of the s3 samples (or specifically, the variability thereof) will affect the size of the p-values when testing for DE between s1 and s2, even though you're never actually comparing to s3 itself. Presumably, the samples in the s3 group are more variable, such that their removal from the analysis in the second approach results in a lower variance (or higher prior d.f.) and more DE.

Note that this isn't usually the case - typically, more samples provide more stable variance estimates and greater power to detect DE. I'd check the MDS plot and make sure there aren't any outliers or batch effects within the s3 group. If you're seeing aberrant behaviour, you might consider using the arrayWeights function to down-weight any of the odd samples. If there's nothing obviously wrong, though, I'd be inclined to analyse everything together (i.e., your first approach). For all we know, a larger true variance may be the correct state of affairs, and the second approach is overly liberal.

P.S. I'm also assuming that you've taken the s3 samples out of matrix in your second approach, otherwise nothing should work.