Hi David, BioC list,
apologies in advance for the length of this email........
I have a few things to add to the advice already given, some might
also be
relevant to the thread that Ben Bolstad mentioned in his reply:
http://files.protsuggest.org/biocond/html/1816.html
You asked if anyone has looked at this problem. I have studied
'subset-based'
RMA strategies, including the extrapolation approach (take e.g. 50
chips and
extrapolate that model to get RMA values for the rest of the chips),
partitioning the entire set of chips into manageable size (however
many you can
do in a run, like 50), and doing this partitioning multiple times and
averaging
to get RMA values. The 'partitioning' approaches depend on having an
entire set
available.
To get an idea of how much RMA values can vary, as well as how
inferences might
vary, please see
http://mbi.osu.edu/2004/ws1materials/goldstein.pdf
I have a working ms on this and will be happy to send a preprint when
it's
submitted.
You also ask if anyone has a solution. Unfortunately, I have to say
no here (at
least for myself), but I also think that there will not be a general
solution.
Rather, the way the issue is approached will depend on the specifics
of the
study. There are many ways to get 1000 chips. For instance, a lab
may process
a bunch of stored samples over a relatively short period of time;
alternatively,
the same lab may process samples coming in over a longer period of
time, as in a
prospective trial where patients are recruited into the study over
time.
Another common possibility is that multiple centers are collaborating
on a
larger trial, with each center doing some processing of chips.
There may be different types of problems and artifacts in each of
these
scenarios. For example, the first 50 chips in a study occurring over
a period
of time may be qualitatively different from subsequent sets of chips
if there is
a time trend for some reason. In the multi-center case, between lab
variability
is likely to be an important artifact.
Ben made the point that what you need are:
1. A consistent normalization step
2. Probe effects estimates made based on a reasonable number of arrays
I could not agree more with 1, however in my opinion there is a
problem in how
to get that. Some people seem to think that quantile normalization of
all chips
together will safely remove all artifactual differences between chips.
This is
emphatically _not_ true (and many people are recognizing this). In an
experiment replicated by the same lab a few months apart (using
different
animals each time but following the same protocols in all experimental
aspects),
the experimental 'batch' effect persists even if you RMA all chips
together.
This is really easy to see if you just cluster samples based on RMA
values - the
major split is between the two replications. So, if you're hoping to
get rid of
this kind of effect merely by RMAing all chips together, I think you
are likely
to be disappointed. I have a preprint of this study if you want more
details.
As for 2, I think that the number of arrays is only one component.
The arrays
should also be somehow 'representative'. In practice, this might be
difficult
to achieve. As you say, if the target is moving then it won't be easy
to hit
(as well as cause confusion).
It is not only reasonable but I would also say necessary that the
scientists
examine early/preliminary results. What I would do in this case is
RMA the
'preliminary' set together if possible and base early analyses on
that. As more
chips come in, most likely I would re-RMA after 'enough' came in.
However, you
still need to carry out careful exploratory analyses to ensure that
you are
really removing the artifacts that you think you are. What you should
look for
depends on the specifics of your study. Persistent artifacts will
need to be
removed by other means (by regression for example).
In the event that you are unable to RMA all your chips together, I
would
recommend multiple partitioning to get 'final' RMA values for all
chips. This
is in contrast to extrapolating from a single subset. Yes, the RMA
values will
change, which may be confusing and an audit nightmare, but you will
give
yourself some protection against 'locking in' an artifact by averaging
over
different sets (which are likely to have different artifacts). I see
this as a
major benefit.
Don't hesitate to write back, on or off list, if any of this seems
unclear,
Best regards,
Darlene
On Fri, 2005-06-03 at 09:07 +0100, David Kipling wrote:
> Hi
>
> This is not a "how do I process 1000 chips with RMA" but rather
> something slightly different.
>
> We're starting to get projects coming thru our Affy core that
involve
> 1000+ chips. Obviously we can use MAS5 to process the .cel files,
and
> irrespective of what happens with subsequent chips in the project
the
> expression values from those chips will stay the same because of the
> single-chip nature of the algorithm.
>
> It would be nice to run, in parallel, RMA-style processing of the
data.
> The issue this raises for me relates to the desire of the
scientists
> to look at their data before the end of the project (e.g. you'd want
to
> explore the first 200 cancer samples rather than wait for all 1000
to
> be done), which is understandable. My concern is that the multi-
chip
> nature of RMA means that, for any specific .cel file, the expression
> values will depend on the other chips included in the run, and so
the
> expression values from that .cel file will be different in the early
> stages (200 chips) and at the end (1000 chips). Such a 'moving
target'
> dataset may be confusing and would certainly cause an audit
headache.
>
> Has anyone explored this issue and proposed a solution? It's
entirely
> possible that I am being totally paranoid and that after 100+ chips
in
> a dataset the expression values plateau out and are stable in the
face
> of additional .cel files being included; I don't yet have access
to
> big-enough datasets to critically address that. I do have some
> recollection in the deep mists of time a comment (?from Ben
Bolstad?)
> suggesting the use of a standard 'training set' of (say) 50 chips,
to
> which you would add your new chips one at a time and process.
>
> All comments, thoughts, or experiences gratefully received!
>
> Regards
>
> David
>
>
>
> Prof David Kipling
> Department of Pathology
> School of Medicine
> Cardiff University
> Heath Park
> Cardiff CF14 4XN
>
> Tel: 029 2074 4847
> Email: KiplingD@cardiff.ac.uk
>
--
Darlene Goldstein
?cole Polytechnique F?d?rale de Lausanne (EPFL)
Institut de Math?matiques
Batiment MA, Station 8 Tel: +41 21 693 2552
CH-1015 Lausanne Fax: +41 21 693 4303
SWITZERLAND