Hi Lev,
> I would like to make some further points regarding
> filtering. Firstly, the bimodal behaviour of log
> transformed signals shown in the plots that I have
> posted (raw and filtered raw,
>
http://tmgarden.cloud.prohosting.com/images/) is
> probably something specific to AB1700 and some other
> platforms, not Affymetrix though. Therefore,
> filtering of Affy data may not be a good idea.
> Secondly, it just happens that by filtering on
> signal-to-noise >=3 (threshold specified by ABI to
> distinguish badly measured signals) I remove the
> first peak of the distribution. I have observed this
> phenomenon for many AB1700 datasets and thus think
> that this first peak corresponding to low
> signal-to-noise probes is artificial and does not
> reflect real signal (I may be wrong here).
Actually, a bimodal distribution is exactly what I would expect to see
if a goodly percentage of probes on the array were not expressed in
your particular sample. This is very common for whole genome arrays,
and I often see this on Affymetrix arrays when the total percent
present can be as low as 30-40%. Thus your two distributions are the
unexpressed probes (effectively "zero" but measured with error) and
the expressed probes, which might or might not have a normal
distribution. I don't think this is particular to AB1700 datasets, and
I don't think the peak is "artificial", but instead represents probes
that are not expressed.
> Thirdly,
> as I pointed before, low signal-to-noise does not
> always indicate low raw signal for a probe. My plots
> clearly show this. Therefore, this is not the case
> of discarding low expressed probes from the
> analysis. I understand that filtering might lead to
> loosing some interesting probes, but this is a trade
> off between false positive and false negative
> results. So, it may be better for you to save some
> money and effort during validation stages.
Again, I would argue that you are throwing out "zeros", not low-
expressed probes. If you were to count for each probe how many arrays
it was below your filter criteria, what you would probably find is an
extreme bi-modal distribution, where most probes are either above
background on all arrays or below background on all arrays. I think
it's fine to filter (after normalization) out those that are below
background on ALL arrays, which can cut out a substantial chuck of
probes and save on the FDR correction. Usually there is only a small
percentage of probes that are above background on some arrays and
below on other arrays. To be conservative, I leave these in because
they will not affect the FDR calculations all that much and I don't
want to lose probes that may be off in one treatment and on in another
treatment. Sorry I don't have a graph of a typical bi-modal
distribution of "present" calls to show you, but I'm at home today.
> Also, it is often assumed that log transformed raw
> signal is roughly Normal. Is this assumption
> required for normalization stage? If yes than
> removing the peak corresponding to low
> signal-to-noise should be advantageous.
The log- transformation does help to compress the range of expression
values and decreases the mean-variance problem, but I can't remember
anywhere it's been said that it should be normal after transformation.
Furthermore, normality is not an assumption for normalization, only
that the distributions for each array should be the SAME, whatever the
shape of the distribution. Unless there is something special about
AB1700 arrays (I confess I don't have any experience with them), I
think the bimodality represents real measured signal for all arrays,
and it's better to use all available data for the pre-processing
steps, but after normalization it's fine to remove probes that fail to
pass a conservative filter on ALL arrays. Even if you want to use your
filter of removing "probes that have >50% of "bad" signals within a
treatment", use it only if the probe has >50% "bad" signals for ALL
treatments.
Cheers,
Jenny
>
>
> Jenny Drnevich <drnevich at="" uiuc.edu=""> wrote:
>
> Hi Lev,
>
> There have been several discussions about when to
> filter out data on
> this list previously, and the consensus has been
> to NOT filter until
> after all pre-processing steps (e.g.,
> normalization) have been done.
> One reason is that one array may have had a higher
> background than
> others, and so more data values would be removed
> in your scheme,
> which can be problematic for many normalization
> routines. I also
> would caution you against removing "badly measured
> signals" from your
> data set even after pre-processing. While these
> numbers may not be as
> accurate as larger numbers, they represent very
> low expression or no
> expression. Would you remove all the zeros from
> any set of data? My
> rationale is that had there been distinct
> expression, you would have
> measured it, therefore the low values near
> background are valid, if
> not as completely accurate. In the worst case
> scenario, you would
> miss genes that weren't expressed in one treatment
> but were expressed
> in another treatment because you were throwing out
> all the data from
> the non-expressed treatment. If the signals were
> "badly measured" in
> ALL samples, then I would remove that entire probe
> from the analysis
> (after pre-processing), but not if they were badly
> measured in only a
> few samples.
>
> That's my two cents,
> Jenny
>
> At 08:59 AM 7/12/2007, Lev Soinov wrote:
> > Dear List,
> > I have posted a similar question before, but
> would like to ask you again
> > about filtering strategies. I have some AB1700
> data and filter on signal to
> > noise ratios before normalization. The rationale
> is to get rid of badly
> > measured signals before actual processing of the
> data. Two jpg
> > histograms of
> > log2 signal distributions, before (raw.jpg) and
> after (filtered.jpg)
> > filtering, can be seen in this location:
> >
http://tmgarden.cloud.prohosting.com/images/
> > Could you please have a look at the
> distributions and comment on whether
> > this is correct to filter before normalization
> as this changes
> > the distribution of
> > signals a lot?
> > Thank you very much for your help.
> > Lev.
> >
> >
> >---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
>
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional
> Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
>
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
>
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu
>
>
>
> ------------------------------------------------
>
> Yahoo! Mail is the world's favourite email. Don't
> settle for less, sign up for your free account
> today.
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu