Entering edit mode
Dear Albyn,
It is always helpful to give a code example, preferably something
which
can be run by others, but at least showing us the details of the
situation. There isn't any way to confirm from your email that the
issue
you mention is the root cause of the problem. Your colleague's data
example would have to quite unusual for this to be true.
I have plenty of sympathy with researchers carefully examining the
microarray image and marking scratches or other imperfections, but it
is
hard to imagine that they would flag tens of thousands of spots in
this
way and, if they did, it might be better to drop the array entirely.
lmFit() already computes exact standard errors.
In many large data problems, in which many contrasts might be
experimented
with, it is convenient to be able to compute a basic fit object and
then
to apply various contrasts posthoc. This was especially true in the
early
days of microarray analysis with somewhat slower computers than today.
This is the reason for the separate contrasts.fit() function.
I made an early design decision that fit objects could be order
ngenes*nparameters in size, but not order ngenes*nparameters^2, so it
was
not possible to store genewise QR decompositions. This leads to the
computational efficiency issue, which applies to contrasts.fit, not to
lmFit.
If a dataset has so many spot-specific weights of different sizes, so
that
the approximation involved in contrasts.fit() is not trustworthy, then
it
is simple enough to run lmFit() a couple of times with different
parametrizations to get all the contrasts desired.
I have considered in the past adding a 'contrasts' argument to lmFit,
so
that users could fit the basic regression and simultaneously extract
all
desired contrasts, using the exact covariance matrices. (This is not
quite
as simple as you might think, because lmFit supports a number of
different
covariance structures, robust fits etc, so the changes have to be made
in
a number of low level functions. And it changes the structure of the
fit
object which is passed on.) I can still do so if there is a demand
for
it.
Best wishes
Gordon
> Date: Mon, 14 Dec 2009 11:15:05 -0800
> From: Albyn Jones <jones at="" reed.edu="">
> Subject: [BioC] missing values in limma/contrasts.fit
> To: bioconductor at stat.math.ethz.ch
> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Dear BioConductor Folk
>
> The help file for contrasts.fit states:
>
> "Warning. For efficiency reasons, this function does not
> re-factorize the design matrix for each probe. A consequence is
> that, if the design matrix is non-orthogonal and the original
fit
> included quality weights or missing values, then the unscaled
> standard deviations produced by this function are approximate
> rather than exact. The approximation is usually acceptable...."
>
> My attention was attracted to the statement when a colleague in
> biology asked me why one would get different sets of probes
identified
> as differentially expressed, depending on which individual or
> biological sample was selected as the reference in a balanced loop
> design.
>
> My experience, admittedly limited, suggests that the computational
> efficiency gain is not worth the loss of accuracy. Even if one has
to
> sacrifice the efficiency of a single pass through the raw data, at
> least one gets correct results. I have hacked a version of lmFit to
> evaluate contrasts with standard errors based on the exact
covariance
> matrix. It runs esssentially as quickly as lmFit, so I find the
> efficiency argument uncompelling.
>
> A search of the archive produced several discussions of missing
values
> in limma. The main argument I see is Gordon Smyth's (Date:
2008-03-08)
>
> "The ideal solution is not to introduce missing values into your
> data in the first place. In my experimence, missing values are
> almost always avoidable. I have never seen a situation where it
> was necessary or desirable to introduce a large proportion of
> missing values."
>
> My colleagues in biology report that they inspect their arrays
> visually and note probes which have been scratched, probes covered
by
> background blobs and the like. These categories seem to satisfy the
> missing-at-random criterion: the probe is marked NA not because it
is
> saturated or below background, but because it was unreadable for
> reasons unrelated to the response.
>
> I'd appreciate feedback: has anyone else already done this? Would
> others find this useful? Are there objections I have overlooked?
>
> albyn
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}