Re: technical replicates (again!): a summary

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 55 minutes ago

WEHI, Melbourne, Australia

Dear Ramon, Thanks for your constructive post. I'm sorry my replies will have to be too brief. At 01:49 AM 1/04/2004, Ramon Diaz-Uriarte wrote: >Dear Gordon, Naomi, and BioC list, > >The issue of how to deal with technical replicates (such as those obtained >when we do dye-swaps of the same biological samples in cDNA arrays) has come >up in the BioC list several times. What follows is a short summary, with >links to mails in BioC plus some questions/comments. > > >There seem to be three major ways of approaching the issue: There is another practical approach which is to fit the technical replicate as a fixed effect rather than as a random effect. See Naomi's addition to the discussion summary. This works when the number of levels is not too large and there are a respectable number of replicates per level. It's not the absolutely ideal solution, but it can be good for people with the right sort of data who want to something here and now with existing software. >1. Treat the technical replicates as ordinary replicates >************************************************************* >E.g., Gordon Smyth in sept. 2003 >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002405.html) > >However, this makes me (and Naomi Altman ---e.g., >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/00 3340.html) > >uneasy (tech. reps. are not independent biological reps. which leads to the >usual inflation of dfs and deflation of se). > >I guess part of the key to Gordon's suggestion is his comment that even if >the >s.e. are slightly underestimated, the ordering is close to being the optimal >one. But I don't see why the ordering out to be much worse if we use the >means of technical replicates as in 3. below. (Haven't done the math, but it >seems that, specially in the pressence of strong tech. reps. covariance and >small number of independent samples we ought to be better of using the means >of the tech. reps). See comment below. There are some situations where I think this is still the best method within the limitations of existing Bioconductor software, especially when the number of independent samples is small. >2. Mixed effects models with subject as random effect (e.g., via lme). >********************************************************************* ********* > >In late August of 2003 I asked about these issues, and Gordon seemed to agree >that trying the lme approach could be a way to go. >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/002 224.html). > >However, in September, I posted an aswer and included code that, at least for >some cases, shows potential problems with using lme when the number of >technical replicates is small. >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002430.html) > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of >emails >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/0 03191.html; > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/00 3481.html). > >After reading about randomizedBlock (package statmod) in a message in BioC (I >think from Gordon), I have tried aggain the mixed models approach, since with >tech. reps and no other random effects, we should be able to use >randomizedBlock. Details in 5. below: I think that fitting any statistical model, random effects or otherwise, to each gene in a microarray is risky in terms of false discovery rate without some sort of moderation across the genes. See comment below. >3. Take the average of the technical replicates >**************************************************** >Except for being possibly conservative (and not estimating tech. reps. >variance component), I think this is a "safe" procedure. >This is what I have ended up doing routinely after my disappointing tries >with >lme >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002430.html) >and what Bill Kenworthy seemed to end up doing >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/00 3493.html). > >I think this is also what is done at least some times in literature (e.g., >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). Two comments. Firstly, this strategy only works when you have a balanced design and no missing values. It would be hard for me to recommend it as a universal strategy because people send me emails all the time with half of their data missing. Secondly, this is "safe" in the usual statistical sense of ensuring the size of your test, for an individual gene, is not larger than the nominal size. But you are worrying only about bias when noise is also an issue. In the microarray context, it can pay in terms of overall false discovery rate to introduce bias into estimation of the variances if it makes the variances more stable. >********* > >4. Dealing with replicates in future versions of limma >*********************************************************** > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. reps. >would be in a future version of limma >( >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/0 02411.html) >and I wonder if Gordon meant via mixed-effects models, or some other way, or >if there has been some progress in this area. I and a student have done quite a bit of work in this direction, but I'm not ready yet to put out public software. My personal view, not to be forced on anyone else, is that fitting genewise mixed-effect models is not enough when the number of microarrays is small. >5. Using randomizedBlock >***************************** >In a simple set up of control and treatment with dye-swaps, I have done some >numerical comparisons of the outcome of a t-test on the mean of the technical >replicates with lme and with randomizedBlock. [The function is attached]. The >outcome of the t-test of the means of replicates and randomizedBlock, in >terms of the t-statistic, tends to be the same (if we "positivize" the dye >swaps). The only difference, then, lies in the df we then use to put a >p-value on the statistic. But I don't see how we can use the dfs from >randomizedBlock: they seem way too large. Where am I getting lost? The function randomizedBlock() does REML for the variance components plus weighted least squares for the fixed effect coefficients for general unbalanced mixed models with only two variance components (including the usual error variance). It isn't restricted to RCB designs which of course are much simpler. As far as I know, there is no theory establishing what is the right d.f. for testing contrasts in such models, although I have my own ideas and lme() does something ad hoc and conservative. What randomizedBlock() returns is simply the df from the weighted least squares, i.e., the df take into account estimation of the fixed effects only and not estimation of the variance components. Best Gordon >Best, > > >R. > > > >-- >Ram?n D?az-Uriarte >Bioinformatics Unit >Centro Nacional de Investigaciones Oncol?gicas (CNIO) >(Spanish National Cancer Center) >Melchor Fern?ndez Almagro, 3 >28029 Madrid (Spain) >Fax: +-34-91-224-6972 >Phone: +-34-91-224-6900 > >http://bioinfo.cnio.es/~rdiaz >PGP KeyID: 0xE89B3462 >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

Microarray GO Cancer vsn Microarray GO Cancer vsn • 1.6k views

ADD COMMENT • link updated 20.7 years ago by Ramon Diaz ★ 1.1k • written 20.7 years ago by Gordon Smyth 51k

0

Entering edit mode

Johan Lindberg ▴ 90

@johan-lindberg-581

Last seen 10.2 years ago

Hmm, it seems to me (from not being a statestician point of view) as if the only safe way of dealing with technical replicates is to just use average and then to put in the true biological replicates in the statistical test you are using. I did not get any comment on another strategy using duplicateCorrelation as an estimatior for my technical replicates. The idea is like this: You have two dye swap pairs from different patients: Cy3 Cy5 Patient1 Control Control Patient1 Patient2 Control Control Patient2 You have all the data in a M-value matrix. Instead of just averaging over the technical dye swap pairs you create an new M-value matrix with columns twice as long as the original M-value matrix. You put in the first dye swap pair as if they would belong to the same slide. Example: Old M-value matrix: Patient1A Patient1 Patient2 Patient2 genes 30000 30000 30000 30000 New M-value matrix Patient1 Patient2 genes 60000 60000 Which means that the first "slide" of the new MAobject contains both hybridizations from patient1. The idea of duplicateCorrelation is to extract as much information as possible from within-slides replicate spots. This is done by estimating the correlation between them. By using duplicateCorrelation you get an approximation of genewise variance instead of just taking the average. So why not use this between slides as well? Is this a better approach then just taking averages? comments? Best regards / Johan Lindberg At 10:29 2004-04-01 +1000, you wrote: >Dear Ramon, > >Thanks for your constructive post. I'm sorry my replies will have to be >too brief. > >At 01:49 AM 1/04/2004, Ramon Diaz-Uriarte wrote: >>Dear Gordon, Naomi, and BioC list, >> >>The issue of how to deal with technical replicates (such as those obtained >>when we do dye-swaps of the same biological samples in cDNA arrays) has come >>up in the BioC list several times. What follows is a short summary, with >>links to mails in BioC plus some questions/comments. >> >> >>There seem to be three major ways of approaching the issue: > >There is another practical approach which is to fit the technical >replicate as a fixed effect rather than as a random effect. See Naomi's >addition to the discussion summary. This works when the number of levels >is not too large and there are a respectable number of replicates per >level. It's not the absolutely ideal solution, but it can be good for >people with the right sort of data who want to something here and now with >existing software. > >>1. Treat the technical replicates as ordinary replicates >>************************************************************* >>E.g., Gordon Smyth in sept. 2003 >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002405.html) >> >>However, this makes me (and Naomi Altman ---e.g., >>https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/0 03340.html) >> >>uneasy (tech. reps. are not independent biological reps. which leads to the >>usual inflation of dfs and deflation of se). >> >>I guess part of the key to Gordon's suggestion is his comment that even >>if the >>s.e. are slightly underestimated, the ordering is close to being the optimal >>one. But I don't see why the ordering out to be much worse if we use the >>means of technical replicates as in 3. below. (Haven't done the math, but it >>seems that, specially in the pressence of strong tech. reps. covariance and >>small number of independent samples we ought to be better of using the means >>of the tech. reps). > >See comment below. There are some situations where I think this is still >the best method within the limitations of existing Bioconductor software, >especially when the number of independent samples is small. > >>2. Mixed effects models with subject as random effect (e.g., via lme). >>******************************************************************** ********** >> >>In late August of 2003 I asked about these issues, and Gordon seemed to agree >>that trying the lme approach could be a way to go. >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/00 2224.html). >> >>However, in September, I posted an aswer and included code that, at least for >>some cases, shows potential problems with using lme when the number of >>technical replicates is small. >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002430.html) >> >>Nevertheless, Naomi Altman reports using lme/mixed models in a couple of >>emails >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/ 003191.html; >> >> >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January /00348 >> 1.html). >> >>After reading about randomizedBlock (package statmod) in a message in BioC (I >>think from Gordon), I have tried aggain the mixed models approach, since with >>tech. reps and no other random effects, we should be able to use >>randomizedBlock. Details in 5. below: > >I think that fitting any statistical model, random effects or otherwise, >to each gene in a microarray is risky in terms of false discovery rate >without some sort of moderation across the genes. See comment below. > >>3. Take the average of the technical replicates >>**************************************************** >>Except for being possibly conservative (and not estimating tech. reps. >>variance component), I think this is a "safe" procedure. >>This is what I have ended up doing routinely after my disappointing tries >>with >>lme >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002430.html) >>and what Bill Kenworthy seemed to end up doing >>(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/0 03493.html). >> >>I think this is also what is done at least some times in literature (e.g., >>Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > >Two comments. Firstly, this strategy only works when you have a balanced >design and no missing values. It would be hard for me to recommend it as a >universal strategy because people send me emails all the time with half of >their data missing. > >Secondly, this is "safe" in the usual statistical sense of ensuring the >size of your test, for an individual gene, is not larger than the nominal >size. But you are worrying only about bias when noise is also an issue. In >the microarray context, it can pay in terms of overall false discovery >rate to introduce bias into estimation of the variances if it makes the >variances more stable. > >>********* >> >>4. Dealing with replicates in future versions of limma >>*********************************************************** >> >>Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. reps. >>would be in a future version of limma >>( >>https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September/ 002411.html) >>and I wonder if Gordon meant via mixed-effects models, or some other way, or >>if there has been some progress in this area. > >I and a student have done quite a bit of work in this direction, but I'm >not ready yet to put out public software. My personal view, not to be >forced on anyone else, is that fitting genewise mixed-effect models is not >enough when the number of microarrays is small. > >>5. Using randomizedBlock >>***************************** >>In a simple set up of control and treatment with dye-swaps, I have done some >>numerical comparisons of the outcome of a t-test on the mean of the technical >>replicates with lme and with randomizedBlock. [The function is attached]. The >>outcome of the t-test of the means of replicates and randomizedBlock, in >>terms of the t-statistic, tends to be the same (if we "positivize" the dye >>swaps). The only difference, then, lies in the df we then use to put a >>p-value on the statistic. But I don't see how we can use the dfs from >>randomizedBlock: they seem way too large. Where am I getting lost? > >The function randomizedBlock() does REML for the variance components plus >weighted least squares for the fixed effect coefficients for general >unbalanced mixed models with only two variance components (including the >usual error variance). It isn't restricted to RCB designs which of course >are much simpler. As far as I know, there is no theory establishing what >is the right d.f. for testing contrasts in such models, although I have my >own ideas and lme() does something ad hoc and conservative. What >randomizedBlock() returns is simply the df from the weighted least >squares, i.e., the df take into account estimation of the fixed effects >only and not estimation of the variance components. > >Best >Gordon > >>Best, >> >> >>R. >> >> >> >>-- >>Ram?n D?az-Uriarte >>Bioinformatics Unit >>Centro Nacional de Investigaciones Oncol?gicas (CNIO) >>(Spanish National Cancer Center) >>Melchor Fern?ndez Almagro, 3 >>28029 Madrid (Spain) >>Fax: +-34-91-224-6972 >>Phone: +-34-91-224-6900 >> >>http://bioinfo.cnio.es/~rdiaz >>PGP KeyID: 0xE89B3462 >>(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.7 years ago Johan Lindberg ▴ 90

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

Sorry, Gordon, you are right. My fault. In fact, wouldn't that be a good way to go, and prevent problems from convergence with REML, specially if we don't care much about the random effect and within subject replication is small (i.e., number tech. reps. small)? R. On Thursday 01 April 2004 00:18, Gordon Smyth wrote: > Hi Ramon, > > You've left out an important strategy, which I've suggested a couples of > times recently, which is to fit the technical replicates a fixed factor > rather than a random factor. > > Cheers > Gordon > > At 01:49 AM 1/04/2004, you wrote: > >Dear Gordon, Naomi, and BioC list, > > > >The issue of how to deal with technical replicates (such as those obtained > >when we do dye-swaps of the same biological samples in cDNA arrays) has > > come up in the BioC list several times. What follows is a short summary, > > with links to mails in BioC plus some questions/comments. > > > > > >There seem to be three major ways of approaching the issue: > > > > > >1. Treat the technical replicates as ordinary replicates > >************************************************************* > >E.g., Gordon Smyth in sept. 2003 > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00240 > >5.html) > > > >However, this makes me (and Naomi Altman ---e.g., > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/ 003340. > >html) > > > >uneasy (tech. reps. are not independent biological reps. which leads to > > the usual inflation of dfs and deflation of se). > > > >I guess part of the key to Gordon's suggestion is his comment that even if > >the > >s.e. are slightly underestimated, the ordering is close to being the > > optimal one. But I don't see why the ordering out to be much worse if we > > use the means of technical replicates as in 3. below. (Haven't done the > > math, but it seems that, specially in the pressence of strong tech. reps. > > covariance and small number of independent samples we ought to be better > > of using the means of the tech. reps). > > > > > >2. Mixed effects models with subject as random effect (e.g., via lme). > >******************************************************************* ******* > >**** > > > >In late August of 2003 I asked about these issues, and Gordon seemed to > > agree that trying the lme approach could be a way to go. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0 02224.h > >tml). > > > >However, in September, I posted an aswer and included code that, at least > > for some cases, shows potential problems with using lme when the number > > of technical replicates is small. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) > > > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of > >emails > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December /003191 > >.html; > > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003481. > >html). > > > >After reading about randomizedBlock (package statmod) in a message in BioC > > (I think from Gordon), I have tried aggain the mixed models approach, > > since with tech. reps and no other random effects, we should be able to > > use > >randomizedBlock. Details in 5. below: > > > > > >3. Take the average of the technical replicates > >**************************************************** > >Except for being possibly conservative (and not estimating tech. reps. > >variance component), I think this is a "safe" procedure. > >This is what I have ended up doing routinely after my disappointing tries > >with > >lme > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) and what Bill Kenworthy seemed to end up doing > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003493. > >html). > > > >I think this is also what is done at least some times in literature (e.g., > >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > > > >********* > > > >4. Dealing with replicates in future versions of limma > >*********************************************************** > > > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. > > reps. would be in a future version of limma > >( > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002411 > >.html) and I wonder if Gordon meant via mixed-effects models, or some > > other way, or if there has been some progress in this area. > > > > > > > >5. Using randomizedBlock > >***************************** > >In a simple set up of control and treatment with dye-swaps, I have done > > some numerical comparisons of the outcome of a t-test on the mean of the > > technical replicates with lme and with randomizedBlock. [The function is > > attached]. The outcome of the t-test of the means of replicates and > > randomizedBlock, in terms of the t-statistic, tends to be the same (if we > > "positivize" the dye swaps). The only difference, then, lies in the df we > > then use to put a p-value on the statistic. But I don't see how we can > > use the dfs from randomizedBlock: they seem way too large. Where am I > > getting lost? > > > > > >Best, > > > > > >R. > > > > > > > >-- > >Ram?n D?az-Uriarte > >Bioinformatics Unit > >Centro Nacional de Investigaciones Oncol?gicas (CNIO) > >(Spanish National Cancer Center) > >Melchor Fern?ndez Almagro, 3 > >28029 Madrid (Spain) > >Fax: +-34-91-224-6972 > >Phone: +-34-91-224-6900 > > > >http://bioinfo.cnio.es/~rdiaz > >PGP KeyID: 0xE89B3462 > >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD COMMENT • link 20.7 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

If you treat technical replicates as fixed effects in an RCB with within block replicates, you get the wrong error term for testing the fixed effects. So, I think this only works if you then designate rep*treatment as the error. --Naomi At 11:27 AM 4/1/2004 +0200, Ramon Diaz-Uriarte wrote: >Sorry, Gordon, you are right. My fault. >In fact, wouldn't that be a good way to go, and prevent problems from >convergence with REML, specially if we don't care much about the random >effect and within subject replication is small (i.e., number tech. reps. >small)? > >R. > > >On Thursday 01 April 2004 00:18, Gordon Smyth wrote: > > Hi Ramon, > > > > You've left out an important strategy, which I've suggested a couples of > > times recently, which is to fit the technical replicates a fixed factor > > rather than a random factor. > > > > Cheers > > Gordon > > > > At 01:49 AM 1/04/2004, you wrote: > > >Dear Gordon, Naomi, and BioC list, > > > > > >The issue of how to deal with technical replicates (such as those obtained > > >when we do dye-swaps of the same biological samples in cDNA arrays) has > > > come up in the BioC list several times. What follows is a short summary, > > > with links to mails in BioC plus some questions/comments. > > > > > > > > >There seem to be three major ways of approaching the issue: > > > > > > > > >1. Treat the technical replicates as ordinary replicates > > >************************************************************* > > >E.g., Gordon Smyth in sept. 2003 > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septem ber/00240 > > >5.html) > > > > > >However, this makes me (and Naomi Altman ---e.g., > > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Decembe r/003340. > > >html) > > > > > >uneasy (tech. reps. are not independent biological reps. which leads to > > > the usual inflation of dfs and deflation of se). > > > > > >I guess part of the key to Gordon's suggestion is his comment that even if > > >the > > >s.e. are slightly underestimated, the ordering is close to being the > > > optimal one. But I don't see why the ordering out to be much worse if we > > > use the means of technical replicates as in 3. below. (Haven't done the > > > math, but it seems that, specially in the pressence of strong tech. reps. > > > covariance and small number of independent samples we ought to be better > > > of using the means of the tech. reps). > > > > > > > > >2. Mixed effects models with subject as random effect (e.g., via lme). > > >***************************************************************** ********* > > >**** > > > > > >In late August of 2003 I asked about these issues, and Gordon seemed to > > > agree that trying the lme approach could be a way to go. > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August /002224.h > > >tml). > > > > > >However, in September, I posted an aswer and included code that, at least > > > for some cases, shows potential problems with using lme when the number > > > of technical replicates is small. > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septem ber/00243 > > >0.html) > > > > > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of > > >emails > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Decemb er/003191 > > >.html; > > > > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-Januar y/003481. > > >html). > > > > > >After reading about randomizedBlock (package statmod) in a message in BioC > > > (I think from Gordon), I have tried aggain the mixed models approach, > > > since with tech. reps and no other random effects, we should be able to > > > use > > >randomizedBlock. Details in 5. below: > > > > > > > > >3. Take the average of the technical replicates > > >**************************************************** > > >Except for being possibly conservative (and not estimating tech. reps. > > >variance component), I think this is a "safe" procedure. > > >This is what I have ended up doing routinely after my disappointing tries > > >with > > >lme > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septem ber/00243 > > >0.html) and what Bill Kenworthy seemed to end up doing > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-Januar y/003493. > > >html). > > > > > >I think this is also what is done at least some times in literature (e.g., > > >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > > > > > >********* > > > > > >4. Dealing with replicates in future versions of limma > > >*********************************************************** > > > > > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. > > > reps. would be in a future version of limma > > >( > > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septemb er/002411 > > >.html) and I wonder if Gordon meant via mixed-effects models, or some > > > other way, or if there has been some progress in this area. > > > > > > > > > > > >5. Using randomizedBlock > > >***************************** > > >In a simple set up of control and treatment with dye-swaps, I have done > > > some numerical comparisons of the outcome of a t-test on the mean of the > > > technical replicates with lme and with randomizedBlock. [The function is > > > attached]. The outcome of the t-test of the means of replicates and > > > randomizedBlock, in terms of the t-statistic, tends to be the same (if we > > > "positivize" the dye swaps). The only difference, then, lies in the df we > > > then use to put a p-value on the statistic. But I don't see how we can > > > use the dfs from randomizedBlock: they seem way too large. Where am I > > > getting lost? > > > > > > > > >Best, > > > > > > > > >R. > > > > > > > > > > > >-- > > >Ram?n D?az-Uriarte > > >Bioinformatics Unit > > >Centro Nacional de Investigaciones Oncol?gicas (CNIO) > > >(Spanish National Cancer Center) > > >Melchor Fern?ndez Almagro, 3 > > >28029 Madrid (Spain) > > >Fax: +-34-91-224-6972 > > >Phone: +-34-91-224-6900 > > > > > >http://bioinfo.cnio.es/~rdiaz > > >PGP KeyID: 0xE89B3462 > > >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > >-- >Ram?n D?az-Uriarte >Bioinformatics Unit >Centro Nacional de Investigaciones Oncol?gicas (CNIO) >(Spanish National Cancer Center) >Melchor Fern?ndez Almagro, 3 >28029 Madrid (Spain) >Fax: +-34-91-224-6972 >Phone: +-34-91-224-6900 > >http://bioinfo.cnio.es/~rdiaz >PGP KeyID: 0xE89B3462 >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.7 years ago Naomi Altman ▴ 10

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

Dear Gordon, Thanks a lot for your answers. If I may summarize your points, I understand you are saying that: a) mixed-effects gene-wise OR t-tests gene-wise after averaging might not be a good idea because of the gene-wise issue precisely and the problems that arise from the FDR perspective. I think I understand this point. I forgot to add that after fitting each model either via t-test on tech. reps. means or after fitting lme gene wise we can, as you had already suggested, use the eBayes function, feeding it the appropriate columns, to obtain the moderated t-statistic. b) Taking means of tech. reps. can be problematic with missing values and lack of balance problems. c) We can fit the tech. reps. as fixed effects. d) We can fit all tech reps as just reps (i.e., ignore the tech reps issue), and use the usual limma (including ebayes) machinery. This way, we might slightly underestimate variances, but we might get more stable variances, and be better of in terms of FDR. R. > Dear Ramon, > > Thanks for your constructive post. I'm sorry my replies will have to be too > brief. > > At 01:49 AM 1/04/2004, Ramon Diaz-Uriarte wrote: > >Dear Gordon, Naomi, and BioC list, > > > >The issue of how to deal with technical replicates (such as those obtained > >when we do dye-swaps of the same biological samples in cDNA arrays) has > > come up in the BioC list several times. What follows is a short summary, > > with links to mails in BioC plus some questions/comments. > > > > > >There seem to be three major ways of approaching the issue: > > There is another practical approach which is to fit the technical replicate > as a fixed effect rather than as a random effect. See Naomi's addition to > the discussion summary. This works when the number of levels is not too > large and there are a respectable number of replicates per level. It's not > the absolutely ideal solution, but it can be good for people with the right > sort of data who want to something here and now with existing software. > > >1. Treat the technical replicates as ordinary replicates > >************************************************************* > >E.g., Gordon Smyth in sept. 2003 > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00240 > >5.html) > > > >However, this makes me (and Naomi Altman ---e.g., > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/ 003340. > >html) > > > >uneasy (tech. reps. are not independent biological reps. which leads to > > the usual inflation of dfs and deflation of se). > > > >I guess part of the key to Gordon's suggestion is his comment that even if > >the > >s.e. are slightly underestimated, the ordering is close to being the > > optimal one. But I don't see why the ordering out to be much worse if we > > use the means of technical replicates as in 3. below. (Haven't done the > > math, but it seems that, specially in the pressence of strong tech. reps. > > covariance and small number of independent samples we ought to be better > > of using the means of the tech. reps). > > See comment below. There are some situations where I think this is still > the best method within the limitations of existing Bioconductor software, > especially when the number of independent samples is small. > > >2. Mixed effects models with subject as random effect (e.g., via lme). > >******************************************************************* ******* > >**** > > > >In late August of 2003 I asked about these issues, and Gordon seemed to > > agree that trying the lme approach could be a way to go. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0 02224.h > >tml). > > > >However, in September, I posted an aswer and included code that, at least > > for some cases, shows potential problems with using lme when the number > > of technical replicates is small. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) > > > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of > >emails > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December /003191 > >.html; > > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003481. > >html). > > > >After reading about randomizedBlock (package statmod) in a message in BioC > > (I think from Gordon), I have tried aggain the mixed models approach, > > since with tech. reps and no other random effects, we should be able to > > use > >randomizedBlock. Details in 5. below: > > I think that fitting any statistical model, random effects or otherwise, to > each gene in a microarray is risky in terms of false discovery rate without > some sort of moderation across the genes. See comment below. > > >3. Take the average of the technical replicates > >**************************************************** > >Except for being possibly conservative (and not estimating tech. reps. > >variance component), I think this is a "safe" procedure. > >This is what I have ended up doing routinely after my disappointing tries > >with > >lme > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) and what Bill Kenworthy seemed to end up doing > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003493. > >html). > > > >I think this is also what is done at least some times in literature (e.g., > >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > > Two comments. Firstly, this strategy only works when you have a balanced > design and no missing values. It would be hard for me to recommend it as a > universal strategy because people send me emails all the time with half of > their data missing. > > Secondly, this is "safe" in the usual statistical sense of ensuring the > size of your test, for an individual gene, is not larger than the nominal > size. But you are worrying only about bias when noise is also an issue. In > the microarray context, it can pay in terms of overall false discovery rate > to introduce bias into estimation of the variances if it makes the > variances more stable. > > >********* > > > >4. Dealing with replicates in future versions of limma > >*********************************************************** > > > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. > > reps. would be in a future version of limma > >( > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002411 > >.html) and I wonder if Gordon meant via mixed-effects models, or some > > other way, or if there has been some progress in this area. > > I and a student have done quite a bit of work in this direction, but I'm > not ready yet to put out public software. My personal view, not to be > forced on anyone else, is that fitting genewise mixed-effect models is not > enough when the number of microarrays is small. > > >5. Using randomizedBlock > >***************************** > >In a simple set up of control and treatment with dye-swaps, I have done > > some numerical comparisons of the outcome of a t-test on the mean of the > > technical replicates with lme and with randomizedBlock. [The function is > > attached]. The outcome of the t-test of the means of replicates and > > randomizedBlock, in terms of the t-statistic, tends to be the same (if we > > "positivize" the dye swaps). The only difference, then, lies in the df we > > then use to put a p-value on the statistic. But I don't see how we can > > use the dfs from randomizedBlock: they seem way too large. Where am I > > getting lost? > > The function randomizedBlock() does REML for the variance components plus > weighted least squares for the fixed effect coefficients for general > unbalanced mixed models with only two variance components (including the > usual error variance). It isn't restricted to RCB designs which of course > are much simpler. As far as I know, there is no theory establishing what is > the right d.f. for testing contrasts in such models, although I have my own > ideas and lme() does something ad hoc and conservative. What > randomizedBlock() returns is simply the df from the weighted least squares, > i.e., the df take into account estimation of the fixed effects only and not > estimation of the variance components. > > Best > Gordon > > >Best, > > > > > >R. > > > > > > > >-- > >Ram?n D?az-Uriarte > >Bioinformatics Unit > >Centro Nacional de Investigaciones Oncol?gicas (CNIO) > >(Spanish National Cancer Center) > >Melchor Fern?ndez Almagro, 3 > >28029 Madrid (Spain) > >Fax: +-34-91-224-6972 > >Phone: +-34-91-224-6900 > > > >http://bioinfo.cnio.es/~rdiaz > >PGP KeyID: 0xE89B3462 > >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > See comment below. There are some situations where I think this is still > the best method within the limitations of existing Bioconductor software, > especially when the number of independent samples is small. > > >2. Mixed effects models with subject as random effect (e.g., via lme). > >******************************************************************* ******* > >**** > > > >In late August of 2003 I asked about these issues, and Gordon seemed to > > agree that trying the lme approach could be a way to go. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0 02224.h > >tml). > > > >However, in September, I posted an aswer and included code that, at least > > for some cases, shows potential problems with using lme when the number > > of technical replicates is small. > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) > > > >Nevertheless, Naomi Altman reports using lme/mixed models in a couple of > >emails > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December /003191 > >.html; > > > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003481. > >html). > > > >After reading about randomizedBlock (package statmod) in a message in BioC > > (I think from Gordon), I have tried aggain the mixed models approach, > > since with tech. reps and no other random effects, we should be able to > > use > >randomizedBlock. Details in 5. below: > > I think that fitting any statistical model, random effects or otherwise, to > each gene in a microarray is risky in terms of false discovery rate without > some sort of moderation across the genes. See comment below. > > >3. Take the average of the technical replicates > >**************************************************** > >Except for being possibly conservative (and not estimating tech. reps. > >variance component), I think this is a "safe" procedure. > >This is what I have ended up doing routinely after my disappointing tries > >with > >lme > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe r/00243 > >0.html) and what Bill Kenworthy seemed to end up doing > >(https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/ 003493. > >html). > > > >I think this is also what is done at least some times in literature (e.g., > >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]). > > Two comments. Firstly, this strategy only works when you have a balanced > design and no missing values. It would be hard for me to recommend it as a > universal strategy because people send me emails all the time with half of > their data missing. > > Secondly, this is "safe" in the usual statistical sense of ensuring the > size of your test, for an individual gene, is not larger than the nominal > size. But you are worrying only about bias when noise is also an issue. In > the microarray context, it can pay in terms of overall false discovery rate > to introduce bias into estimation of the variances if it makes the > variances more stable. > > >********* > > > >4. Dealing with replicates in future versions of limma > >*********************************************************** > > > >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of tech. > > reps. would be in a future version of limma > >( > >https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September /002411 > >.html) and I wonder if Gordon meant via mixed-effects models, or some > > other way, or if there has been some progress in this area. > > I and a student have done quite a bit of work in this direction, but I'm > not ready yet to put out public software. My personal view, not to be > forced on anyone else, is that fitting genewise mixed-effect models is not > enough when the number of microarrays is small. > > >5. Using randomizedBlock > >***************************** > >In a simple set up of control and treatment with dye-swaps, I have done > > some numerical comparisons of the outcome of a t-test on the mean of the > > technical replicates with lme and with randomizedBlock. [The function is > > attached]. The outcome of the t-test of the means of replicates and > > randomizedBlock, in terms of the t-statistic, tends to be the same (if we > > "positivize" the dye swaps). The only difference, then, lies in the df we > > then use to put a p-value on the statistic. But I don't see how we can > > use the dfs from randomizedBlock: they seem way too large. Where am I > > getting lost? > > The function randomizedBlock() does REML for the variance components plus > weighted least squares for the fixed effect coefficients for general > unbalanced mixed models with only two variance components (including the > usual error variance). It isn't restricted to RCB designs which of course > are much simpler. As far as I know, there is no theory establishing what is > the right d.f. for testing contrasts in such models, although I have my own > ideas and lme() does something ad hoc and conservative. What > randomizedBlock() returns is simply the df from the weighted least squares, > i.e., the df take into account estimation of the fixed effects only and not > estimation of the variance components. > > Best > Gordon > > >Best, > > > > > >R. > > > > > > > >-- > >Ram?n D?az-Uriarte > >Bioinformatics Unit > >Centro Nacional de Investigaciones Oncol?gicas (CNIO) > >(Spanish National Cancer Center) > >Melchor Fern?ndez Almagro, 3 > >28029 Madrid (Spain) > >Fax: +-34-91-224-6972 > >Phone: +-34-91-224-6900 > > > >http://bioinfo.cnio.es/~rdiaz > >PGP KeyID: 0xE89B3462 > >(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD COMMENT • link 20.7 years ago Ramon Diaz ★ 1.1k

Login before adding your answer.