Entering edit mode
Dear Wolfgang,
With all respect, I meant exactly what I said.
You have taken the discussion out of context, and some of your claims
are
wrong in my opinion.
On Sun, 26 May 2013, Wolfgang Huber wrote:
> Dear Gordon
>
>> The literature tends to say that the reason for filtering is to
reduce
>> the amount of multiple testing, but in truth the increase in power
from
>> this is only slight. The more important reason for filtering in
most
>> applications is to remove highly variable genes at low intensities.
>> The importance of filtering is highly dependent on how you
>> pre-processed your data. Filtering is less important if you (i)
use a
>> good background correction or normalising method that damps down
>> variability at low intensities and (ii) use eBayes(trend=TRUE)
which
>> accommodates a mean-variance trend.
You have taken out of context one paragraph from my reply to Miriam:
https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-May/052816
.html
I was answering a specific question about the limma package, but you
have
lost that context. You don't even include the date of the post you
are
replying to.
> With all respect, I think this paragraph mixes up two separate
issues
> and can benefit from clarification.
>
> 1. While literature can probably be found to support any statement,
the
> above-cited reason is indeed bogus when multiple testing is
performed
> with an FDR objective.
Not bogus. Just less important than some other considerations.
> The paper by Bourgon et al. motivates filtering differently, namely
by
> using a filter criterion that is independent of the test statistic
under
> the null (thus does not affect type-I error; some subtlety is
discussed
> in that paper) but dependent under the alternative (thus improves
> power).
This is a good time to recall that the question was about filtering
with
the limma package, not about filtering in conjunction with t-tests or
permutation tests. Your paper (Bourgon et al) provides no motivation
for
filtering in conjunction with limma. Quite the opposite, your paper
concludes (incorrectly IMO) on its final page that limma needs to be
used
unfiltered.
In reality, filtering low intensity probes (not low variance probes)
is
usually of benefit to limma, and we do this routinely for nearly all
analyses in my lab. This is for a number of reasons.
First there is the generic (not specific to limma) reason that probes
that
are not detecting real signal to any worthwhile degree for any sample
cannot be detecting DE to any worthwhile degree. Therefore there is a
positive correlation between mean log intensity and true DE.
Second there is the limma-specific reason that probes that are not
detecting signal above background levels in any sample trend to have
atypical variances, both in absolute size and in terms of mean-
variance
relationship, compared to probes that are responding to genuine
biological
signal. In other words, non-expressed or dead probes have variances
that
cannot be considered to be sampled from the same population as
variances
for probes from regular expressed probes. It is desirable to get rid
of
these atypical probes so that limma can concentrate on the behaviour
of
probes of genuine interest.
Filtering by mean log-intensity does not cause any problems for the
limma
probabilistic model. Indeed it generally improves concordance with
the
empirical Bayes assumptions.
> 2. "Highly variable genes at low intensities" are indeed a problem
of
> bad preprocessing and are better dealt with at that level, not by
> filtering.
I agree in most cases, but it's not universally true. Pre-processing
methods that damp down variality at low intensities also tend to
attenuate
fold changes. In some applications it can be legitimate to allow
higher
variability at low intensities in order to maintain dynamic range in
the
fold changes. voom is one such application where the preprocessed and
normalized expression values are deliberately kept more variable at
the
low end than the high end.
> Nowadays, the commonly used methods for expression microarray or
RNA-Seq
> analysis that I am aware of avoid that problem.
Yes, the high variability is gone but the non-expressed probes are
still
atypical. With most commonly used methods, the non-expressed probes
now
have atypically small variances. For example, the RMA algorithm (used
in
your paper) yields a mean variance relationship that increases at low
intensities then decreases again at high intensities. The lowest
intensity probes have variances almost zero. This effect is even
stronger
using the vst algorithm for Illumina BeadArrays (you are an author of
the
vst paper). This method typically generates a very pronounced
(increasing) mean-variance trend for probes at very low levels.
Anyone can see this by using the plotSA() function in limma to plot
the
mean-variance relationship.
Atypical low variances mitigate the potential benefits from the
empirical
Bayes algorithm just as do atypical large variances, so the benefit
that
derives from filtering non-expressed probes remains.
The reason I worded my post in terms of high variances was simply
because
the strongest and most frequent arguments for filtering were made over
10
years ago when large variances were common.
> 3. The question when & how independent filtering (as in 1) is
beneficial
> is quite unrelated to preprocessing.
I strongly disagree. The benefit that may or may not come from
filtering
is intimately connected to the behaviour of the data, especially to
the
mean-variance trend, and this depends intimately on the platform and
on
the preprocessing.
Sincerely
Gordon
> You are right that FDR is a property of the whole selected gene
list,
> not of individual genes, and that different approaches exist for
> spending the type-I error budget wisely, by weighting different
genes
> differently; of which independent filtering is one and trended
eBayes
> (which is not the default option in limma) may be another.
>
> Best wishes
> Wolfgang
>
> Reference:
> Bourgon et al. PNAS 2010: http://www.pnas.org/content/107/21/9546
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}