Dear Gordon,
Thanks a lot for your answers. If I may summarize your points, I
understand
you are saying that:
a) mixed-effects gene-wise OR t-tests gene-wise after averaging might
not be a
good idea because of the gene-wise issue precisely and the problems
that
arise from the FDR perspective.
I think I understand this point. I forgot to add that after fitting
each model
either via t-test on tech. reps. means or after fitting lme gene wise
we can,
as you had already suggested, use the eBayes function, feeding it the
appropriate columns, to obtain the moderated t-statistic.
b) Taking means of tech. reps. can be problematic with missing values
and lack
of balance problems.
c) We can fit the tech. reps. as fixed effects.
d) We can fit all tech reps as just reps (i.e., ignore the tech reps
issue),
and use the usual limma (including ebayes) machinery. This way, we
might
slightly underestimate variances, but we might get more stable
variances, and
be better of in terms of FDR.
R.
> Dear Ramon,
>
> Thanks for your constructive post. I'm sorry my replies will have to
be too
> brief.
>
> At 01:49 AM 1/04/2004, Ramon Diaz-Uriarte wrote:
> >Dear Gordon, Naomi, and BioC list,
> >
> >The issue of how to deal with technical replicates (such as those
obtained
> >when we do dye-swaps of the same biological samples in cDNA arrays)
has
> > come up in the BioC list several times. What follows is a short
summary,
> > with links to mails in BioC plus some questions/comments.
> >
> >
> >There seem to be three major ways of approaching the issue:
>
> There is another practical approach which is to fit the technical
replicate
> as a fixed effect rather than as a random effect. See Naomi's
addition to
> the discussion summary. This works when the number of levels is not
too
> large and there are a respectable number of replicates per level.
It's not
> the absolutely ideal solution, but it can be good for people with
the right
> sort of data who want to something here and now with existing
software.
>
> >1. Treat the technical replicates as ordinary replicates
> >*************************************************************
> >E.g., Gordon Smyth in sept. 2003
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe
r/00240
> >5.html)
> >
> >However, this makes me (and Naomi Altman ---e.g.,
> >
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December/
003340.
> >html)
> >
> >uneasy (tech. reps. are not independent biological reps. which
leads to
> > the usual inflation of dfs and deflation of se).
> >
> >I guess part of the key to Gordon's suggestion is his comment that
even if
> >the
> >s.e. are slightly underestimated, the ordering is close to being
the
> > optimal one. But I don't see why the ordering out to be much worse
if we
> > use the means of technical replicates as in 3. below. (Haven't
done the
> > math, but it seems that, specially in the pressence of strong
tech. reps.
> > covariance and small number of independent samples we ought to be
better
> > of using the means of the tech. reps).
>
> See comment below. There are some situations where I think this is
still
> the best method within the limitations of existing Bioconductor
software,
> especially when the number of independent samples is small.
>
> >2. Mixed effects models with subject as random effect (e.g., via
lme).
> >*******************************************************************
*******
> >****
> >
> >In late August of 2003 I asked about these issues, and Gordon
seemed to
> > agree that trying the lme approach could be a way to go.
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0
02224.h
> >tml).
> >
> >However, in September, I posted an aswer and included code that, at
least
> > for some cases, shows potential problems with using lme when the
number
> > of technical replicates is small.
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe
r/00243
> >0.html)
> >
> >Nevertheless, Naomi Altman reports using lme/mixed models in a
couple of
> >emails
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December
/003191
> >.html;
> >
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/
003481.
> >html).
> >
> >After reading about randomizedBlock (package statmod) in a message
in BioC
> > (I think from Gordon), I have tried aggain the mixed models
approach,
> > since with tech. reps and no other random effects, we should be
able to
> > use
> >randomizedBlock. Details in 5. below:
>
> I think that fitting any statistical model, random effects or
otherwise, to
> each gene in a microarray is risky in terms of false discovery rate
without
> some sort of moderation across the genes. See comment below.
>
> >3. Take the average of the technical replicates
> >****************************************************
> >Except for being possibly conservative (and not estimating tech.
reps.
> >variance component), I think this is a "safe" procedure.
> >This is what I have ended up doing routinely after my disappointing
tries
> >with
> >lme
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe
r/00243
> >0.html) and what Bill Kenworthy seemed to end up doing
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/
003493.
> >html).
> >
> >I think this is also what is done at least some times in literature
(e.g.,
> >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]).
>
> Two comments. Firstly, this strategy only works when you have a
balanced
> design and no missing values. It would be hard for me to recommend
it as a
> universal strategy because people send me emails all the time with
half of
> their data missing.
>
> Secondly, this is "safe" in the usual statistical sense of ensuring
the
> size of your test, for an individual gene, is not larger than the
nominal
> size. But you are worrying only about bias when noise is also an
issue. In
> the microarray context, it can pay in terms of overall false
discovery rate
> to introduce bias into estimation of the variances if it makes the
> variances more stable.
>
> >*********
> >
> >4. Dealing with replicates in future versions of limma
> >***********************************************************
> >
> >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of
tech.
> > reps. would be in a future version of limma
> >(
> >
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September
/002411
> >.html) and I wonder if Gordon meant via mixed-effects models, or
some
> > other way, or if there has been some progress in this area.
>
> I and a student have done quite a bit of work in this direction, but
I'm
> not ready yet to put out public software. My personal view, not to
be
> forced on anyone else, is that fitting genewise mixed-effect models
is not
> enough when the number of microarrays is small.
>
> >5. Using randomizedBlock
> >*****************************
> >In a simple set up of control and treatment with dye-swaps, I have
done
> > some numerical comparisons of the outcome of a t-test on the mean
of the
> > technical replicates with lme and with randomizedBlock. [The
function is
> > attached]. The outcome of the t-test of the means of replicates
and
> > randomizedBlock, in terms of the t-statistic, tends to be the same
(if we
> > "positivize" the dye swaps). The only difference, then, lies in
the df we
> > then use to put a p-value on the statistic. But I don't see how we
can
> > use the dfs from randomizedBlock: they seem way too large. Where
am I
> > getting lost?
>
> The function randomizedBlock() does REML for the variance components
plus
> weighted least squares for the fixed effect coefficients for general
> unbalanced mixed models with only two variance components (including
the
> usual error variance). It isn't restricted to RCB designs which of
course
> are much simpler. As far as I know, there is no theory establishing
what is
> the right d.f. for testing contrasts in such models, although I have
my own
> ideas and lme() does something ad hoc and conservative. What
> randomizedBlock() returns is simply the df from the weighted least
squares,
> i.e., the df take into account estimation of the fixed effects only
and not
> estimation of the variance components.
>
> Best
> Gordon
>
> >Best,
> >
> >
> >R.
> >
> >
> >
> >--
> >Ram?n D?az-Uriarte
> >Bioinformatics Unit
> >Centro Nacional de Investigaciones Oncol?gicas (CNIO)
> >(Spanish National Cancer Center)
> >Melchor Fern?ndez Almagro, 3
> >28029 Madrid (Spain)
> >Fax: +-34-91-224-6972
> >Phone: +-34-91-224-6900
> >
> >
http://bioinfo.cnio.es/~rdiaz
> >PGP KeyID: 0xE89B3462
> >(
http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
--
Ram?n D?az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(
http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
> See comment below. There are some situations where I think this is
still
> the best method within the limitations of existing Bioconductor
software,
> especially when the number of independent samples is small.
>
> >2. Mixed effects models with subject as random effect (e.g., via
lme).
> >*******************************************************************
*******
> >****
> >
> >In late August of 2003 I asked about these issues, and Gordon
seemed to
> > agree that trying the lme approach could be a way to go.
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-August/0
02224.h
> >tml).
> >
> >However, in September, I posted an aswer and included code that, at
least
> > for some cases, shows potential problems with using lme when the
number
> > of technical replicates is small.
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe
r/00243
> >0.html)
> >
> >Nevertheless, Naomi Altman reports using lme/mixed models in a
couple of
> >emails
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-December
/003191
> >.html;
> >
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/
003481.
> >html).
> >
> >After reading about randomizedBlock (package statmod) in a message
in BioC
> > (I think from Gordon), I have tried aggain the mixed models
approach,
> > since with tech. reps and no other random effects, we should be
able to
> > use
> >randomizedBlock. Details in 5. below:
>
> I think that fitting any statistical model, random effects or
otherwise, to
> each gene in a microarray is risky in terms of false discovery rate
without
> some sort of moderation across the genes. See comment below.
>
> >3. Take the average of the technical replicates
> >****************************************************
> >Except for being possibly conservative (and not estimating tech.
reps.
> >variance component), I think this is a "safe" procedure.
> >This is what I have ended up doing routinely after my disappointing
tries
> >with
> >lme
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-Septembe
r/00243
> >0.html) and what Bill Kenworthy seemed to end up doing
> >(
https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-January/
003493.
> >html).
> >
> >I think this is also what is done at least some times in literature
(e.g.,
> >Huber et al., 2002, Bioinformatics, 18, S96--S104 [the vsn paper]).
>
> Two comments. Firstly, this strategy only works when you have a
balanced
> design and no missing values. It would be hard for me to recommend
it as a
> universal strategy because people send me emails all the time with
half of
> their data missing.
>
> Secondly, this is "safe" in the usual statistical sense of ensuring
the
> size of your test, for an individual gene, is not larger than the
nominal
> size. But you are worrying only about bias when noise is also an
issue. In
> the microarray context, it can pay in terms of overall false
discovery rate
> to introduce bias into estimation of the variances if it makes the
> variances more stable.
>
> >*********
> >
> >4. Dealing with replicates in future versions of limma
> >***********************************************************
> >
> >Now, in Sept. 2004 Gordon mentioned that an explicit treatment of
tech.
> > reps. would be in a future version of limma
> >(
> >
https://www.stat.math.ethz.ch/pipermail/bioconductor/2003-September
/002411
> >.html) and I wonder if Gordon meant via mixed-effects models, or
some
> > other way, or if there has been some progress in this area.
>
> I and a student have done quite a bit of work in this direction, but
I'm
> not ready yet to put out public software. My personal view, not to
be
> forced on anyone else, is that fitting genewise mixed-effect models
is not
> enough when the number of microarrays is small.
>
> >5. Using randomizedBlock
> >*****************************
> >In a simple set up of control and treatment with dye-swaps, I have
done
> > some numerical comparisons of the outcome of a t-test on the mean
of the
> > technical replicates with lme and with randomizedBlock. [The
function is
> > attached]. The outcome of the t-test of the means of replicates
and
> > randomizedBlock, in terms of the t-statistic, tends to be the same
(if we
> > "positivize" the dye swaps). The only difference, then, lies in
the df we
> > then use to put a p-value on the statistic. But I don't see how we
can
> > use the dfs from randomizedBlock: they seem way too large. Where
am I
> > getting lost?
>
> The function randomizedBlock() does REML for the variance components
plus
> weighted least squares for the fixed effect coefficients for general
> unbalanced mixed models with only two variance components (including
the
> usual error variance). It isn't restricted to RCB designs which of
course
> are much simpler. As far as I know, there is no theory establishing
what is
> the right d.f. for testing contrasts in such models, although I have
my own
> ideas and lme() does something ad hoc and conservative. What
> randomizedBlock() returns is simply the df from the weighted least
squares,
> i.e., the df take into account estimation of the fixed effects only
and not
> estimation of the variance components.
>
> Best
> Gordon
>
> >Best,
> >
> >
> >R.
> >
> >
> >
> >--
> >Ram?n D?az-Uriarte
> >Bioinformatics Unit
> >Centro Nacional de Investigaciones Oncol?gicas (CNIO)
> >(Spanish National Cancer Center)
> >Melchor Fern?ndez Almagro, 3
> >28029 Madrid (Spain)
> >Fax: +-34-91-224-6972
> >Phone: +-34-91-224-6900
> >
> >
http://bioinfo.cnio.es/~rdiaz
> >PGP KeyID: 0xE89B3462
> >(
http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
--
Ram?n D?az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(
http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)