Entering edit mode
Johan van Heerden
▴
40
@johan-van-heerden-2873
Last seen 10.2 years ago
Dear All,
I have scoured the BioC mailing list in search of a clear answer
regarding the filtering of a data sets prior to differential testing,
in an attempt to circumvent the multiple testing problem. Although
several opinions have been expressed over the last couple of years I
have not yet found a convincing argument for or against this practice.
I would like to make a comment and would appreciate any constructive
feedback, as I am not a Statistician but a Biologists.
As far as I can see the problem has been divided into 2 categories:
(1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based
on some knowledge regarding the functional classes present in the
data, as opposed to (2) which does not consider any such information.
Several criticism have been raised against the "Supervised" approach,
with many people calling it flawed logic. My first comments are
regarding the logic of "Supervised" filtering.
As an example: A data set consisting of two classes (Treatment 1 and
Treatment 2) has been generated. A fold-change is then used to enrich
the data set for genes that show within class activity (i.e. select
only genes that show a mean x-fold change between classes). This
filtered data set is then used for differential testing.
My first question is: How is this different (especially when working
with "whole-genome" arrays) from having custom arrays constructed from
genes known show a response to some treatment. I.e. Arrays will then
be selectively printed with genes that are known to or expected to
show a response. This is a type of "filtering" step that will yield
arrays with highly reduced gene sets. This scenario can result from
known knowledge about pathways or can arrise from a discovery based
microarray experiment, where a researcher produces whole genome arrays
and from there select "responsive" genes for the creation of targeted
(or custom arrays). Surely this step-wise sample space reduction
should be subject to the same criticism?
Secondly, the supervised fold-change filter should not affect the
statistic of each individual gene, but will have profound effects on
the adjusted p-values. I have checked this only for t-tests and am not
sure what the effect on more complex statistical differential testing
methods would be. If the only effect of the "supervised" filtering
step is the enrichment of class-specific responsive gene and a
reduction in the severity of the p-value ADJUSTMENT (without affecting
the actual statistic), this could surely be a very useful way of
filtering data?
Wrt the "unsupervised" approaches: These approaches define some
overall variability threshold which can be used to filter out genes
that don't show a minimum degree of variability regardless of class.
As far as I can tell there are several issues wrt this approach. (1)
Some genes will be naturally "noisy", i.e. will show high levels of
fluctuation regardless of class. These genes are likely to be included
in a filter based on degree of varilablity. (2) Some genes might show
low levels of variability (with small changes between classes) and
could be important, but will be excluded if a filter is based on
degree of variability.
I would greatly appreciate some feedback on these comments,
specifically some statistical substantiation as to why a "supervised"
approach is "flawed", given the similar experimental strategies
included in the paragraph on this approach.
Many Thanks!!
Johan van Heerden