Question

Difference in Goodness of fit when applying limma to one-colour vs. two-colour arrays

0

Entering edit mode

andreas.schuettler • 0

@andreasschuettler-8486

Last seen 9.4 years ago

European Union

Hello,

I have a question regarding the outcome of the limma workflow comparing one-colour studies with two-colour common reference studies. I stumbled across the observation that the R-squared seems to be in average much higher for the models in one-colour studies than in two colour studies and I do not have an explanation for this.

So just as an example I take two datasets from the limma usersguide and followed the proposed workflow. The r-squared I calculated as proposed here: limma eBayes: how to determine goodness of fit?

1. One Colour:

#####load data and libraries##########
source("http://bioconductor.org/biocLite.R")
biocLite("ecoliLeucine")
library("ecoliLeucine")
library(limma)
library(affy)
Data <- ecoliLeucine

#####limma workflow
eset <- rma(Data)
strain <- c("lrp-","lrp-","lrp-","lrp-","lrp+","lrp+","lrp+","lrp+")
design <- model.matrix(~factor(strain))
colnames(design) <- c("lrp-","lrp+vs-")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
tabletop_0<-topTable(fit, coef=2, n=40, adjust="BH")

##Goodness of fit
sst<-rowSums(exprs(eset)^2)
ssr<-sst-fit$df.residual*fit$sigma^2
rsq<-ssr/sst
summary(rsq)

2. two-colour

load("../Apoa1.RData") ###downloaded from http://bioinf.wehi.edu.au/limma

MA <- normalizeWithinArrays(RG)
design <- cbind("Control-Ref"=1,"KO-Control"=MA$targets$Cy5=="ApoAI-/-")
fit <- lmFit(MA, design)
fit <- eBayes(fit)
tabletop_1<-topTable(fit,coef=2,number=15,genelist=fit$genes$NAME)

##Goodness of fit
sst<-rowSums(MA$M^2)
ssr<-sst-fit$df.residual*fit$sigma^2
rsq<-ssr/sst
summary(rsq)

So is there anything wrong with my calculation of r-squared (or anything else)? Or do the two-colour Arrays have a worse fit? Or do I miss something important?

I appreciate any comments and help...

Best

Andreas

limma two-colour common reference goodness of fit • 2.0k views

ADD COMMENT • link updated 9.8 years ago by Gordon Smyth 52k • written 9.8 years ago by andreas.schuettler • 0

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 10 minutes ago

United States

I'm not sure there is a real take-home message here. There are any number of things that could conspire to make the one-color array data have larger R-squared values than the two-color data.

For example, the E. coli data will tend to be more similar to technical replicates than the mouse data. Even though mice are highly inbred, taking several aliquots from the same solution of E. coli and growing in replicate flasks is not likely to impart much biological variability, so all things equal, I would expect lower intra-group variability for the E. coli data than the mouse data.

I didn't try to track down the provenance of the ApoA1 data, but it is highly likely that those data were generated by different people in a different lab at a different time than the E. coli data. Any one of those differences could impart higher intra-group variability to the ApoA1 data, which you are interpreting as platform differences.

If you really wanted to see if there is a difference between one and two color data, the MAQC has a big data set on GEO that you could play with (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5350 ), where they took pools of the same RNA and sent it to multiple different labs for analysis, using multiple different one and two-color arrays. That would be closer to an apples vs apples comparison.

ADD COMMENT • link updated 9.8 years ago by Gordon Smyth 52k • written 9.8 years ago by James W. MacDonald 68k

score 4 · Accepted Answer · 2015-07-27

The reason why you are getting much higher Rsq for single channel platforms is that you are not computing Rsq correctly, in particular the expression for sst is not correct. The calculation you are using will give incorrectly large values for both platforms, slightly too large for the two colour platform and very much too large for the single channel platform.

I won't give you corrected formulas, because Rsq doesn't seem very useful to me. It just computes correlation between the predictor and the log-expression values, i.e., evaluates differential expression, and the topTable results from limma are a better way to achieve the same aim. You could even transform the moderated t-statistics to correlations if you wanted (but why would you?).

If you want to compare the precision of the different microarray platforms, see this article for a careful platform comparison:

http://www.ncbi.nlm.nih.gov/pubmed/17118209

See the following article for a theoretical discussion of when a two colour common reference experiment will outperform a single channel version of the same platform:

http://www.biomedcentral.com/1471-2105/14/165

But that doesn't mean that an early home-made two-colour array (as used for the ApoA1 case study) will perform better than a commercial single channel array with a completely different chemistry (as used for the E coli case study).