technical replicates (again!): a summary

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

Dear Gordon, Naomi, and BioC list, The issue of how to deal with technical replicates (such as those obtained when we do dye-swaps of the same biological samples in cDNA arrays) has come up in the BioC list several times. What follows is a short summary, with links to mails in BioC plus some questions/comments. There seem to be three major ways of approaching the issue: 1. Treat the technical replicates as ordinary replicates ************************************************************* E.g., Gordon Smyth in sept. 2003 (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/0 02405.html) However, this makes me (and Naomi Altman ---e.g., https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/003 340.html) uneasy (tech. reps. are not independent biological reps. which leads to the usual inflation of dfs and deflation of se). I guess part of the key to Gordon's suggestion is his comment that even if the s.e. are slightly underestimated, the ordering is close to being the optimal one. But I don't see why the ordering out to be much worse if we use the means of technical replicates as in 3. below. (Haven't done the math, but it seems that, specially in the pressence of strong tech. reps. covariance and small number of independent samples we ought to be better of using the means of the tech. reps). 2. Mixed effects models with subject as random effect (e.g., via lme). ********************************************************************** ******** In late August of 2003 I asked about these issues, and Gordon seemed to agree that trying the lme approach could be a way to go. (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0022 24.html). However, in September, I posted an aswer and included code that, at least for some cases, shows potential problems with using lme when the number of technical replicates is small. (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/0 02430.html) Nevertheless, Naomi Altman reports using lme/mixed models in a couple of emails (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/00 3191.html; (https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/003 481.html). After reading about randomizedBlock (package statmod) in a message in BioC (I think from Gordon), I have tried aggain the mixed models approach, since with tech. reps and no other random effects, we should be able to use randomizedBlock. Details in 5. below: 3. Take the average of the technical replicates **************************************************** Except for being possibly conservative (and not estimating tech. reps. variance component), I think this is a "safe" procedure. This is what I have ended up doing routinely after my disappointing tries with lme (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/0 02430.html) and what Bill Kenworthy seemed to end up doing (https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/003 493.html). I think this is also what is done at least some times in literature (e.g., Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). ********* 4. Dealing with replicates in future versions of limma *********************************************************** Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. reps. would be in a future version of limma ( https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002411.html) and I wonder if Gordon meant via mixed-effects models, or some other way, or if there has been some progress in this area. 5. Using randomizedBlock ***************************** In a simple set up of control and treatment with dye-swaps, I have done some numerical comparisons of the outcome of a t-test on the mean of the technical replicates with lme and with randomizedBlock. [The function is attached]. The outcome of the t-test of the means of replicates and randomizedBlock, in terms of the t-statistic, tends to be the same (if we "positivize" the dye swaps). The only difference, then, lies in the df we then use to put a p-value on the statistic. But I don't see how we can use the dfs from randomizedBlock: they seem way too large. Where am I getting lost? Best, R. -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) -------------- next part -------------- library(nlme) library(limma) library(statmod) options(contrasts = c("contr.helmert", "contr.poly")) f3 <- function(effect.size = 0, s.biol = 2, s.tech = 1, biological.repls = 5, n.tech.reps = 10){ ## For one-sample comparisons, with dye-swaps ## generates data, and fits models using both limma ## and lme and one-sample t-tests on the mean of the ## technical replicates if(getOption("contrasts")[1] != "contr.helmert") stop("You need to set Helmert contrasts in options") if(n.tech.reps %% 2) stop("Need even number of n.tech.reps") ## simulate data tech.rep.effect <- rnorm(biological.repls, mean = 0, sd = s.biol) tech.rep.effect <- rep(tech.rep.effect, rep(n.tech.reps, biological.repls)) data.l <- data <- rnorm(biological.repls * n.tech.reps, mean = 0, sd = s.tech) + tech.rep.effect + effect.size dim(data.l) <- c(1, biological.repls * n.tech.reps) ## design matrices, etc. tech.reps <- factor(rep(1:biological.repls, rep(n.tech.reps, biological.repls))) dm1 <- model.matrix(data ~ tech.reps) dye.swap.codes <- rep(c(-1, 1), (n.tech.reps * biological.repls)/2) data.sp <- data.l * dye.swap.codes data.s <- data * dye.swap.codes ## model fitting tmp1 <- lm.series(data.l, dm1) tmp2 <- lm.series(data.sp, dm1 * dye.swap.codes) tmp3 <- summary(lme(data ~ 1, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp4 <- summary(lme(data.s ~ -1 + dye.swap.codes, random = ~ 1|tech.reps, control = list(tolerance = 1e-10, msTol = 1e-10))) tmp5 <- randomizedBlock(data ~ 1, random = tech.reps, tol = 1e-10) tmp6 <- randomizedBlock(data.s ~ -1 + dye.swap.codes, random = tech.reps, tol = 1e-10) tmp1 <- c(tmp1$coefficients[[1]], tmp1$stdev.unscaled[[1]], tmp1$sigma, tmp1$sigma *tmp1$stdev.unscaled[[1]], tmp1$df) names(tmp1) <- c("coeff", "unscaled.stdev", "sigma", "scaled se", "df") tmp2 <- c(tmp2$coefficients[[1]], tmp2$stdev.unscaled[[1]], tmp2$sigma, tmp2$sigma *tmp2$stdev.unscaled[[1]], tmp2$df) rB <- rbind(c(tmp5$coefficients[[1]], tmp5$se.coefficients, tmp5$df.residual, sqrt(tmp5$varcomp)), c(tmp6$coefficients[[1]], tmp6$se.coefficients, tmp6$df.residual, sqrt(tmp6$se.varcomp))) rB <- cbind(rB, rB[, 1]/rB[, 2]) colnames(rB)[c(1:3, 6)] <- c("coeff", "se.coeff", "df", "t-value") rownames(rB) <- c("rB.positivized", "rB.swap.design") cat("\n\n *************\n\n") mean.by.tech.rep <- tapply(data, tech.reps, mean) the.t.results <- unlist(t.test(mean.by.tech.rep)[1:6]) the.t.se <- sqrt(var(mean.by.tech.rep)/length(unique(tech.reps))) lme.and.t <- rbind(lme.positivized.t.table = tmp3$tTable, lme.swap.design.t.table = tmp4$tTable, t.test.results = c(the.t.results[6], the.t.se, the.t.results[c(2, 1, 3)])) rownames(lme.and.t) <- c("lme.positivized", "lme.swap", "t.test") return(list( lme.and.limma = rbind( lm.series.positivized = tmp1, lm.series.swap.design = tmp2, lme.positivized = c(tmp3$coefficients[[1]], sqrt(tmp3$varFix)/tmp3$sigma, tmp3$sigma, sqrt(tmp3$varFix), tmp3$tTable[[3]]), lme.swap.design = c(tmp4$coefficients[[1]], sqrt(tmp4$varFix)/tmp4$sigma, tmp4$sigma, sqrt(tmp4$varFix), tmp4$tTable[[3]])), lme.and.t, randomizedBlockaAnalyses = rB)) } ## example call: f3(2, 2, 1, 3, 2) f3(0, 2, 1, 3, 2) f3(0, 2, 4, 3, 2) f3(0, 12, 4, 3, 2) f3(2, 12, 4, 3, 2) f3(10, 12, 4, 3, 2)

GO Cancer vsn GO Cancer vsn • 1.9k views

ADD COMMENT • link updated 20.7 years ago by Naomi Altman ★ 6.0k • written 20.7 years ago by Ramon Diaz ★ 1.1k

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

I forgot to add that, under > 3. Take the average of the technical replicates > **************************************************** > Except for being possibly conservative (and not estimating tech. reps. > variance component), I think this is a "safe" procedure. > This is what I have ended up doing routinely after my disappointing tries > with lme > (https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002430 >.html) and what Bill Kenworthy seemed to end up doing > (https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/0 03493.h >tml). > > I think this is also what is done at least some times in literature (e.g., > Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > > ********* If we have multiple treatments or conditions, and one single type of control, and we hybridize each treatment against the control, then we can analyze these experiments, after taking the mean of the tech. reps., following, for instance, the example in section 7.2 of the limma mannual ("Affymetrix and Other single-channel designs"). I think this set-up is not that rare in some places. R.

ADD COMMENT • link 20.7 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.6 years ago

United States

Dear Ramon, I did not look through all of your R-code, but here is what statistical theory says: If you treat the biological subject as a random effect, then the mixed model ANOVA tests of all the treatment and other effects are identical to what you get if you average the technical reps (within treatment). (This assumes a balanced dye-swap design, with no missing observations.) If you run lme on the experiment, and set everything up correctly, this is what you will get and the df will be the same. If you use SAS PROC MIXED with a Type III analysis, you will also get this. If you use SAS PROC MIXED with the REML (default) option, you may get something different because of the way the variance of the random effects are estimated. When the variances match those obtained from the Type III analysis, then you will get the same ANOVA as the other methods. However, when the variance are estimated to be 0, then the biological subject effect is (in effect) dropped from the model, so that the technical and biological replicates are treated identically. (This does not happen in lme, which also uses REML, because lme estimates the log(variance) which cannot ever be 0.) After struggling with this (and huge amounts of imbalance due to bad spots, unequal spot replication, etc) I am pretty much settling for doing the averaging (over all technical reps) before sending my data to the ANOVA routine. (I should do a weighted analysis is account for this, but so far I have not.) To do this, I have to do a lot of slow preprocessing. However, because I am working in a non-medical environment in which the number of replicates is generally small, the number of treatments is generally large, and investigator patience is reasonably good, this seems to be the best solution for our group. The biggest draw-back is that to date I have not fully automated the procedure, so I cannot off-load the data-processing to a research assistant (or publish it to this list). Regarding randomized blocks: We have a RCB design if each biological sample gets all of the treatments (e.g. cancer cells and normal cells from the same subject; different organs from the same mouse). In this case, the "error" for testing treatment effects is biological sample * treatment, and the technical replicates appear as "pure error". The d.f. should be correct. If I have time, I will look carefully at your design and model over the weekend to see if I can determine why you are not getting the right d.f. I hope that this is helpful, and apologize for not giving this all of the attention needed. --Naomi At 10:49 AM 3/31/2004, Ramon Diaz-Uriarte wrote: >Dear Gordon, Naomi, and BioC list, > >The issue of how to deal with technical replicates (such as those obtained >when we do dye-swaps of the same biological samples in cDNA arrays) has come >up in the BioC list several times. What follows is a short summary, with >links to mails in BioC plus some questions/comments. > > >There seem to be three major ways of approaching the issue: > > >1. Treat the technical replicates as ordinary replicates >************************************************************* >E.g., Gordon Smyth in sept. 2003 >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002405.html) > >However, this makes me (and Naomi Altman ---e.g., >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/00 3340.html) > >uneasy (tech. reps. are not independent biological reps. which leads to the >usual inflation of dfs and deflation of se). > >I guess part of the key to Gordon's suggestion is his comment that even if >the >s.e. are slightly underestimated, the ordering is close to being the optimal >one. But I don't see why the ordering out to be much worse if we use the >means of technical replicates as in 3. below. (Haven't done the math, but it >seems that, specially in the pressence of strong tech. reps. covariance and >small number of independent samples we ought to be better of using the means >of the tech. reps). > > >2. Mixed effects models with subject as random effect (e.g., via lme). >********************************************************************* ********* > >In late August of 2003 I asked about these issues, and Gordon seemed to agree >that trying the lme approach could be a way to go. >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/002 224.html). > >However, in September, I posted an aswer and included code that, at least for >some cases, shows potential problems with using lme when the number of >technical replicates is small. >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002430.html) > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of >emails >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/0 03191.html; > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/00 3481.html). > >After reading about randomizedBlock (package statmod) in a message in BioC (I >think from Gordon), I have tried aggain the mixed models approach, since with >tech. reps and no other random effects, we should be able to use >randomizedBlock. Details in 5. below: > > >3. Take the average of the technical replicates >**************************************************** >Except for being possibly conservative (and not estimating tech. reps. >variance component), I think this is a "safe" procedure. >This is what I have ended up doing routinely after my disappointing tries >with >lme >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002430.html) >and what Bill Kenworthy seemed to end up doing >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/00 3493.html). > >I think this is also what is done at least some times in literature (e.g., >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > >********* > >4. Dealing with replicates in future versions of limma >*********************************************************** > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. reps. >would be in a future version of limma >( >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/0 02411.html) >and I wonder if Gordon meant via mixed-effects models, or some other way, or >if there has been some progress in this area. > > > >5. Using randomizedBlock >***************************** >In a simple set up of control and treatment with dye-swaps, I have done some >numerical comparisons of the outcome of a t-test on the mean of the technical >replicates with lme and with randomizedBlock. [The function is attached]. The >outcome of the t-test of the means of replicates and randomizedBlock, in >terms of the t-statistic, tends to be the same (if we "positivize" the dye >swaps). The only difference, then, lies in the df we then use to put a >p-value on the statistic. But I don't see how we can use the dfs from >randomizedBlock: they seem way too large. Where am I getting lost? > > >Best, > > >R. > > > >-- >Ram?n D?az-Uriarte >Bioinformatics Unit >Centro Nacional de Investigaciones Oncol?gicas (CNIO) >(Spanish National Cancer Center) >Melchor Fern?ndez Almagro, 3 >28029 Madrid (Spain) >Fax: +-34-91-224-6972 >Phone: +-34-91-224-6900 > >http://bioinfo.cnio.es/~rdiaz >PGP KeyID: 0xE89B3462 >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > > > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.7 years ago Naomi Altman ★ 6.0k

Login before adding your answer.