Filtering gene list prior to statistical testing
2
0
Entering edit mode
@johan-van-heerden-2873
Last seen 10.2 years ago
Dear All, I have scoured the BioC mailing list in search of a clear answer regarding the filtering of a data sets prior to differential testing, in an attempt to circumvent the multiple testing problem. Although several opinions have been expressed over the last couple of years I have not yet found a convincing argument for or against this practice. I would like to make a comment and would appreciate any constructive feedback, as I am not a Statistician but a Biologists. As far as I can see the problem has been divided into 2 categories: (1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based on some knowledge regarding the functional classes present in the data, as opposed to (2) which does not consider any such information. Several criticism have been raised against the "Supervised" approach, with many people calling it flawed logic. My first comments are regarding the logic of "Supervised" filtering. As an example: A data set consisting of two classes (Treatment 1 and Treatment 2) has been generated. A fold-change is then used to enrich the data set for genes that show within class activity (i.e. select only genes that show a mean x-fold change between classes). This filtered data set is then used for differential testing. My first question is: How is this different (especially when working with "whole-genome" arrays) from having custom arrays constructed from genes known show a response to some treatment. I.e. Arrays will then be selectively printed with genes that are known to or expected to show a response. This is a type of "filtering" step that will yield arrays with highly reduced gene sets. This scenario can result from known knowledge about pathways or can arrise from a discovery based microarray experiment, where a researcher produces whole genome arrays and from there select "responsive" genes for the creation of targeted (or custom arrays). Surely this step-wise sample space reduction should be subject to the same criticism? Secondly, the supervised fold-change filter should not affect the statistic of each individual gene, but will have profound effects on the adjusted p-values. I have checked this only for t-tests and am not sure what the effect on more complex statistical differential testing methods would be. If the only effect of the "supervised" filtering step is the enrichment of class-specific responsive gene and a reduction in the severity of the p-value ADJUSTMENT (without affecting the actual statistic), this could surely be a very useful way of filtering data? Wrt the "unsupervised" approaches: These approaches define some overall variability threshold which can be used to filter out genes that don't show a minimum degree of variability regardless of class. As far as I can tell there are several issues wrt this approach. (1) Some genes will be naturally "noisy", i.e. will show high levels of fluctuation regardless of class. These genes are likely to be included in a filter based on degree of varilablity. (2) Some genes might show low levels of variability (with small changes between classes) and could be important, but will be excluded if a filter is based on degree of variability. I would greatly appreciate some feedback on these comments, specifically some statistical substantiation as to why a "supervised" approach is "flawed", given the similar experimental strategies included in the paragraph on this approach. Many Thanks!! Johan van Heerden
Pathways Pathways • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States
Hi Johan, Johan van Heerden wrote: > Dear All, > > I have scoured the BioC mailing list in search of a clear answer > regarding the filtering of a data sets prior to differential testing, > in an attempt to circumvent the multiple testing problem. Although > several opinions have been expressed over the last couple of years I > have not yet found a convincing argument for or against this > practice. I would like to make a comment and would appreciate any > constructive feedback, as I am not a Statistician but a Biologists. > > As far as I can see the problem has been divided into 2 categories: > (1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based > on some knowledge regarding the functional classes present in the > data, as opposed to (2) which does not consider any such information. > Several criticism have been raised against the "Supervised" approach, > with many people calling it flawed logic. My first comments are > regarding the logic of "Supervised" filtering. > > As an example: A data set consisting of two classes (Treatment 1 and > Treatment 2) has been generated. A fold-change is then used to > enrich the data set for genes that show within class activity (i.e. > select only genes that show a mean x-fold change between classes). > This filtered data set is then used for differential testing. > > My first question is: How is this different (especially when working > with "whole-genome" arrays) from having custom arrays constructed > from genes known show a response to some treatment. I.e. Arrays will > then be selectively printed with genes that are known to or expected > to show a response. This is a type of "filtering" step that will > yield arrays with highly reduced gene sets. This scenario can result > from known knowledge about pathways or can arrise from a discovery > based microarray experiment, where a researcher produces whole genome > arrays and from there select "responsive" genes for the creation of > targeted (or custom arrays). Surely this step-wise sample space > reduction should be subject to the same criticism? It is. When normalizing microarray data, the assumption being made is that many (most) of the genes being measured are not actually changing expression. What most normalization schemes do is line up the bulk of the data so on average the log fold change is zero. If we can't make this assumption (e.g., it is possible that _all_ genes are up- regulated in one sample), then without having some housekeeping genes to use for the normalization, there is no way to normalize the data without making some strong and possibly unwarranted assumptions. So the main argument as I see it against doing supervised sample space reduction is that you may be removing the main assumption of most normalization schemes. The normalization is really the important thing here, as you are trying to remove unwanted technical variation that will have a much larger effect on your statistics than the multiple testing issue. > > Secondly, the supervised fold-change filter should not affect the > statistic of each individual gene, but will have profound effects on > the adjusted p-values. I have checked this only for t-tests and am > not sure what the effect on more complex statistical differential > testing methods would be. If the only effect of the "supervised" > filtering step is the enrichment of class-specific responsive gene > and a reduction in the severity of the p-value ADJUSTMENT (without > affecting the actual statistic), this could surely be a very useful > way of filtering data? > > Wrt the "unsupervised" approaches: These approaches define some > overall variability threshold which can be used to filter out genes > that don't show a minimum degree of variability regardless of class. > As far as I can tell there are several issues wrt this approach. (1) > Some genes will be naturally "noisy", i.e. will show high levels of > fluctuation regardless of class. These genes are likely to be > included in a filter based on degree of varilablity. (2) Some genes > might show low levels of variability (with small changes between > classes) and could be important, but will be excluded if a filter is > based on degree of variability. This is all true, but again I think the normalization issue is much more important, and that is where we really want to make sure we are doing a good job. These days people are getting much less interested in a list of differentially expressed genes, as these are often too large to be useful anyway. The real underlying goal of most experiments IMO, is to find pathways that are perturbed by some treatment/condition/whatever. In this case one really doesn't care about multiple testing, and instead is just using the t-stats (or whatever) in a GSEA type statistic to measure the difference in sets of genes. Best, Jim > > I would greatly appreciate some feedback on these comments, > specifically some statistical substantiation as to why a "supervised" > approach is "flawed", given the similar experimental strategies > included in the paragraph on this approach. > > Many Thanks!! Johan van Heerden > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623
ADD COMMENT
0
Entering edit mode
Dear Jim, Thanks for the feedback. Your point about the implications for Normalization is noted, but my question relates to filtering post Normalization. I.e. the scenario is as follows: Data is normalized (using all "good" data points - "good" being based on Genepix quality criteria) and effeciency of Normalization is assessed (using visualizations and stats) based on the assumptions that you stated. This data is then pre-processed (missing values are imputed, replicates are averaged etc. Once a "clean" data set is obtained (where we are fairly certain that most technical noise has been removed), we filter out all genes that don't show at least a 1.2fold change between classes (i.e. not very stringent, the aim is to remove within class "flat" patterns), which reduces our sample space by about 1/3 (highlighting again that most genes show no or very little change). Furthur analysis is then done on this data set (i.e. Functional enrichment of gene groups as well as "straight" differential testing). What I would like to know is if there is any good reason why we should keep the genes that show "no" class-based behaviour (in terms of fold- changes) - as these contain very little or no information. Including them does not change the types of functional classes found to be enriched, but it pushes all adjusted p-values (doing differential testing) into very non-significant ranges. Eliminating them improves the adjusted p-values greatly. The "top" x genes using the "filtered" data set is exactly the same as the unfiltered one, except they have significant adjusted p-values. I am aware that a significant p-value is not the be-all-and-end-all of microarray research, but it is an unfortunate reality that most people do attach great importance to this and "significant" values are required to make any substatiated assertions (prior to downstream validation of course!). Can any serious criticisms be raised against this type of "post- normalization" sample space reduction? Thanks, Johan van Heerden --- On Tue, 6/24/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > From: James W. MacDonald <jmacdon at="" med.umich.edu=""> > Subject: Re: [BioC] Filtering gene list prior to statistical testing > To: jvhn1 at yahoo.com > Cc: bioconductor at stat.math.ethz.ch > Date: Tuesday, June 24, 2008, 5:30 PM > Hi Johan, > > Johan van Heerden wrote: > > Dear All, > > > > I have scoured the BioC mailing list in search of a > clear answer > > regarding the filtering of a data sets prior to > differential testing, > > in an attempt to circumvent the multiple testing > problem. Although > > several opinions have been expressed over the last > couple of years I > > have not yet found a convincing argument for or > against this > > practice. I would like to make a comment and would > appreciate any > > constructive feedback, as I am not a Statistician but > a Biologists. > > > > As far as I can see the problem has been divided into > 2 categories: > > (1) "Supervised" and (2) > "Unsupervised" filtering, where (1) is based > > on some knowledge regarding the functional classes > present in the > > data, as opposed to (2) which does not consider any > such information. > > Several criticism have been raised against the > "Supervised" approach, > > with many people calling it flawed logic. My first > comments are > > regarding the logic of "Supervised" > filtering. > > > > As an example: A data set consisting of two classes > (Treatment 1 and > > Treatment 2) has been generated. A fold-change is > then used to > > enrich the data set for genes that show within class > activity (i.e. > > select only genes that show a mean x-fold change > between classes). > > This filtered data set is then used for differential > testing. > > > > My first question is: How is this different > (especially when working > > with "whole-genome" arrays) from having > custom arrays constructed > > from genes known show a response to some treatment. > I.e. Arrays will > > then be selectively printed with genes that are known > to or expected > > to show a response. This is a type of > "filtering" step that will > > yield arrays with highly reduced gene sets. This > scenario can result > > from known knowledge about pathways or can arrise from > a discovery > > based microarray experiment, where a researcher > produces whole genome > > arrays and from there select "responsive" > genes for the creation of > > targeted (or custom arrays). Surely this step-wise > sample space > > reduction should be subject to the same criticism? > > It is. When normalizing microarray data, the assumption > being made is > that many (most) of the genes being measured are not > actually changing > expression. What most normalization schemes do is line up > the bulk of > the data so on average the log fold change is zero. If we > can't make > this assumption (e.g., it is possible that _all_ genes are > up-regulated > in one sample), then without having some housekeeping genes > to use for > the normalization, there is no way to normalize the data > without making > some strong and possibly unwarranted assumptions. > > So the main argument as I see it against doing supervised > sample space > reduction is that you may be removing the main assumption > of most > normalization schemes. The normalization is really the > important thing > here, as you are trying to remove unwanted technical > variation that will > have a much larger effect on your statistics than the > multiple testing > issue. > > > > > Secondly, the supervised fold-change filter should not > affect the > > statistic of each individual gene, but will have > profound effects on > > the adjusted p-values. I have checked this only for > t-tests and am > > not sure what the effect on more complex statistical > differential > > testing methods would be. If the only effect of the > "supervised" > > filtering step is the enrichment of class-specific > responsive gene > > and a reduction in the severity of the p-value > ADJUSTMENT (without > > affecting the actual statistic), this could surely be > a very useful > > way of filtering data? > > > > Wrt the "unsupervised" approaches: These > approaches define some > > overall variability threshold which can be used to > filter out genes > > that don't show a minimum degree of variability > regardless of class. > > As far as I can tell there are several issues wrt this > approach. (1) > > Some genes will be naturally "noisy", i.e. > will show high levels of > > fluctuation regardless of class. These genes are > likely to be > > included in a filter based on degree of varilablity. > (2) Some genes > > might show low levels of variability (with small > changes between > > classes) and could be important, but will be excluded > if a filter is > > based on degree of variability. > > This is all true, but again I think the normalization issue > is much more > important, and that is where we really want to make sure we > are doing a > good job. > > These days people are getting much less interested in a > list of > differentially expressed genes, as these are often too > large to be > useful anyway. The real underlying goal of most experiments > IMO, is to > find pathways that are perturbed by some > treatment/condition/whatever. > In this case one really doesn't care about multiple > testing, and instead > is just using the t-stats (or whatever) in a GSEA type > statistic to > measure the difference in sets of genes. > > Best, > > Jim > > > > > > > I would greatly appreciate some feedback on these > comments, > > specifically some statistical substantiation as to why > a "supervised" > > approach is "flawed", given the similar > experimental strategies > > included in the paragraph on this approach. > > > > Many Thanks!! Johan van Heerden > > > > _______________________________________________ > Bioconductor mailing > > list Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the > > archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623
ADD REPLY
0
Entering edit mode
Hi Johan, I tend to filter out genes with low Stdev across all arrays in the experiment before doing any tests. This decreases your possibility for multiple testing errors, although i still apply an MTP. Since your total number of tests is less, you see better adj.Pvalues. To get this better, look at the calculation for whatever MTP you use - see how N factors into the calculation. Also, low stdev usually equals no change- like you say. Since you will likely go on to validate anything with qPCR if it is important, I say cut them out before testing. If for some reason it gives you false positives (unlikely given the MTP and stringency) you can weed them out later. I filter these things out using the shorth. In my opinion, it gets rid of most 'absent' things as well as genes with low stDev but true signal. Hope that helps. Siobhan At 12:36 AM 25/06/2008, you wrote: >Dear Jim, > >Thanks for the feedback. Your point about the implications for >Normalization is noted, but my question relates to filtering post >Normalization. I.e. the scenario is as follows: Data is normalized (using >all "good" data points - "good" being based on Genepix quality criteria) >and effeciency of Normalization is assessed (using visualizations and >stats) based on the assumptions that you stated. This data is then >pre-processed (missing values are imputed, replicates are averaged etc. >Once a "clean" data set is obtained (where we are fairly certain that most >technical noise has been removed), we filter out all genes that don't show >at least a 1.2fold change between classes (i.e. not very stringent, the >aim is to remove within class "flat" patterns), which reduces our sample >space by about 1/3 (highlighting again that most genes show no or very >little change). Furthur analysis is then done on this data set (i.e. >Functional enrichment of gene groups as well as > "straight" differential testing). > >What I would like to know is if there is any good reason why we should >keep the genes that show "no" class-based behaviour (in terms of >fold-changes) - as these contain very little or no information. Including >them does not change the types of functional classes found to be enriched, >but it pushes all adjusted p-values (doing differential testing) into very >non-significant ranges. Eliminating them improves the adjusted p-values >greatly. The "top" x genes using the "filtered" data set is exactly the >same as the unfiltered one, except they have significant adjusted >p-values. I am aware that a significant p-value is not the >be-all-and-end-all of microarray research, but it is an unfortunate >reality that most people do attach great importance to this and >"significant" values are required to make any substatiated assertions >(prior to downstream validation of course!). > >Can any serious criticisms be raised against this type of >"post-normalization" sample space reduction? > >Thanks, >Johan van Heerden > > > > >--- On Tue, 6/24/08, James W. MacDonald <jmacdon@med.umich.edu> wrote: > > > From: James W. MacDonald <jmacdon@med.umich.edu> > > Subject: Re: [BioC] Filtering gene list prior to statistical testing > > To: jvhn1@yahoo.com > > Cc: bioconductor@stat.math.ethz.ch > > Date: Tuesday, June 24, 2008, 5:30 PM > > Hi Johan, > > > > Johan van Heerden wrote: > > > Dear All, > > > > > > I have scoured the BioC mailing list in search of a > > clear answer > > > regarding the filtering of a data sets prior to > > differential testing, > > > in an attempt to circumvent the multiple testing > > problem. Although > > > several opinions have been expressed over the last > > couple of years I > > > have not yet found a convincing argument for or > > against this > > > practice. I would like to make a comment and would > > appreciate any > > > constructive feedback, as I am not a Statistician but > > a Biologists. > > > > > > As far as I can see the problem has been divided into > > 2 categories: > > > (1) "Supervised" and (2) > > "Unsupervised" filtering, where (1) is based > > > on some knowledge regarding the functional classes > > present in the > > > data, as opposed to (2) which does not consider any > > such information. > > > Several criticism have been raised against the > > "Supervised" approach, > > > with many people calling it flawed logic. My first > > comments are > > > regarding the logic of "Supervised" > > filtering. > > > > > > As an example: A data set consisting of two classes > > (Treatment 1 and > > > Treatment 2) has been generated. A fold-change is > > then used to > > > enrich the data set for genes that show within class > > activity (i.e. > > > select only genes that show a mean x-fold change > > between classes). > > > This filtered data set is then used for differential > > testing. > > > > > > My first question is: How is this different > > (especially when working > > > with "whole-genome" arrays) from having > > custom arrays constructed > > > from genes known show a response to some treatment. > > I.e. Arrays will > > > then be selectively printed with genes that are known > > to or expected > > > to show a response. This is a type of > > "filtering" step that will > > > yield arrays with highly reduced gene sets. This > > scenario can result > > > from known knowledge about pathways or can arrise from > > a discovery > > > based microarray experiment, where a researcher > > produces whole genome > > > arrays and from there select "responsive" > > genes for the creation of > > > targeted (or custom arrays). Surely this step-wise > > sample space > > > reduction should be subject to the same criticism? > > > > It is. When normalizing microarray data, the assumption > > being made is > > that many (most) of the genes being measured are not > > actually changing > > expression. What most normalization schemes do is line up > > the bulk of > > the data so on average the log fold change is zero. If we > > can't make > > this assumption (e.g., it is possible that _all_ genes are > > up-regulated > > in one sample), then without having some housekeeping genes > > to use for > > the normalization, there is no way to normalize the data > > without making > > some strong and possibly unwarranted assumptions. > > > > So the main argument as I see it against doing supervised > > sample space > > reduction is that you may be removing the main assumption > > of most > > normalization schemes. The normalization is really the > > important thing > > here, as you are trying to remove unwanted technical > > variation that will > > have a much larger effect on your statistics than the > > multiple testing > > issue. > > > > > > > > Secondly, the supervised fold-change filter should not > > affect the > > > statistic of each individual gene, but will have > > profound effects on > > > the adjusted p-values. I have checked this only for > > t-tests and am > > > not sure what the effect on more complex statistical > > differential > > > testing methods would be. If the only effect of the > > "supervised" > > > filtering step is the enrichment of class-specific > > responsive gene > > > and a reduction in the severity of the p-value > > ADJUSTMENT (without > > > affecting the actual statistic), this could surely be > > a very useful > > > way of filtering data? > > > > > > Wrt the "unsupervised" approaches: These > > approaches define some > > > overall variability threshold which can be used to > > filter out genes > > > that don't show a minimum degree of variability > > regardless of class. > > > As far as I can tell there are several issues wrt this > > approach. (1) > > > Some genes will be naturally "noisy", i.e. > > will show high levels of > > > fluctuation regardless of class. These genes are > > likely to be > > > included in a filter based on degree of varilablity. > > (2) Some genes > > > might show low levels of variability (with small > > changes between > > > classes) and could be important, but will be excluded > > if a filter is > > > based on degree of variability. > > > > This is all true, but again I think the normalization issue > > is much more > > important, and that is where we really want to make sure we > > are doing a > > good job. > > > > These days people are getting much less interested in a > > list of > > differentially expressed genes, as these are often too > > large to be > > useful anyway. The real underlying goal of most experiments > > IMO, is to > > find pathways that are perturbed by some > > treatment/condition/whatever. > > In this case one really doesn't care about multiple > > testing, and instead > > is just using the t-stats (or whatever) in a GSEA type > > statistic to > > measure the difference in sets of genes. > > > > Best, > > > > Jim > > > > > > > > > > > > I would greatly appreciate some feedback on these > > comments, > > > specifically some statistical substantiation as to why > > a "supervised" > > > approach is "flawed", given the similar > > experimental strategies > > > included in the paragraph on this approach. > > > > > > Many Thanks!! Johan van Heerden > > > > > > _______________________________________________ > > Bioconductor mailing > > > list Bioconductor@stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the > > > archives: > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > Affymetrix and cDNA Microarray Core > > University of Michigan Cancer Center > > 1500 E. Medical Center Drive > > 7410 CCGC > > Ann Arbor MI 48109 > > 734-647-5623 > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor S. A. Braybrook Graduate Student, Harada Lab Section of Plant Biology University of California, Davis Davis, CA 95616 Ph 530.752.6980 The time is always right, to do what is right. - Martin Luther King, Jr. [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I had a stats professor tell me that theoretically you are performing a test each time you filter a gene (based on fold change or standard deviation, etc.). So if you go along with that line of thinking, you wouldn't really be helping with the multiple testing problem and may be artificially lowering your p-values. I am new to this area of research, though, so I don't know what the standard practices are for publishing this type of result. -Steve -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Siobhan A. Braybrook Sent: Wednesday, June 25, 2008 11:29 AM To: jvhn1 at yahoo.com Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Filtering gene list prior to statistical testing Hi Johan, I tend to filter out genes with low Stdev across all arrays in the experiment before doing any tests. This decreases your possibility for multiple testing errors, although i still apply an MTP. Since your total number of tests is less, you see better adj.Pvalues. To get this better, look at the calculation for whatever MTP you use - see how N factors into the calculation. Also, low stdev usually equals no change- like you say. Since you will likely go on to validate anything with qPCR if it is important, I say cut them out before testing. If for some reason it gives you false positives (unlikely given the MTP and stringency) you can weed them out later. I filter these things out using the shorth. In my opinion, it gets rid of most 'absent' things as well as genes with low stDev but true signal. Hope that helps. Siobhan At 12:36 AM 25/06/2008, you wrote: >Dear Jim, > >Thanks for the feedback. Your point about the implications for >Normalization is noted, but my question relates to filtering post >Normalization. I.e. the scenario is as follows: Data is normalized (using >all "good" data points - "good" being based on Genepix quality criteria) >and effeciency of Normalization is assessed (using visualizations and >stats) based on the assumptions that you stated. This data is then >pre-processed (missing values are imputed, replicates are averaged etc. >Once a "clean" data set is obtained (where we are fairly certain that most >technical noise has been removed), we filter out all genes that don't show >at least a 1.2fold change between classes (i.e. not very stringent, the >aim is to remove within class "flat" patterns), which reduces our sample >space by about 1/3 (highlighting again that most genes show no or very >little change). Furthur analysis is then done on this data set (i.e. >Functional enrichment of gene groups as well as > "straight" differential testing). > >What I would like to know is if there is any good reason why we should >keep the genes that show "no" class-based behaviour (in terms of >fold-changes) - as these contain very little or no information. Including >them does not change the types of functional classes found to be enriched, >but it pushes all adjusted p-values (doing differential testing) into very >non-significant ranges. Eliminating them improves the adjusted p-values >greatly. The "top" x genes using the "filtered" data set is exactly the >same as the unfiltered one, except they have significant adjusted >p-values. I am aware that a significant p-value is not the >be-all-and-end-all of microarray research, but it is an unfortunate >reality that most people do attach great importance to this and >"significant" values are required to make any substatiated assertions >(prior to downstream validation of course!). > >Can any serious criticisms be raised against this type of >"post-normalization" sample space reduction? > >Thanks, >Johan van Heerden > > > > >--- On Tue, 6/24/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > > > From: James W. MacDonald <jmacdon at="" med.umich.edu=""> > > Subject: Re: [BioC] Filtering gene list prior to statistical testing > > To: jvhn1 at yahoo.com > > Cc: bioconductor at stat.math.ethz.ch > > Date: Tuesday, June 24, 2008, 5:30 PM > > Hi Johan, > > > > Johan van Heerden wrote: > > > Dear All, > > > > > > I have scoured the BioC mailing list in search of a > > clear answer > > > regarding the filtering of a data sets prior to > > differential testing, > > > in an attempt to circumvent the multiple testing > > problem. Although > > > several opinions have been expressed over the last > > couple of years I > > > have not yet found a convincing argument for or > > against this > > > practice. I would like to make a comment and would > > appreciate any > > > constructive feedback, as I am not a Statistician but > > a Biologists. > > > > > > As far as I can see the problem has been divided into > > 2 categories: > > > (1) "Supervised" and (2) > > "Unsupervised" filtering, where (1) is based > > > on some knowledge regarding the functional classes > > present in the > > > data, as opposed to (2) which does not consider any > > such information. > > > Several criticism have been raised against the > > "Supervised" approach, > > > with many people calling it flawed logic. My first > > comments are > > > regarding the logic of "Supervised" > > filtering. > > > > > > As an example: A data set consisting of two classes > > (Treatment 1 and > > > Treatment 2) has been generated. A fold-change is > > then used to > > > enrich the data set for genes that show within class > > activity (i.e. > > > select only genes that show a mean x-fold change > > between classes). > > > This filtered data set is then used for differential > > testing. > > > > > > My first question is: How is this different > > (especially when working > > > with "whole-genome" arrays) from having > > custom arrays constructed > > > from genes known show a response to some treatment. > > I.e. Arrays will > > > then be selectively printed with genes that are known > > to or expected > > > to show a response. This is a type of > > "filtering" step that will > > > yield arrays with highly reduced gene sets. This > > scenario can result > > > from known knowledge about pathways or can arrise from > > a discovery > > > based microarray experiment, where a researcher > > produces whole genome > > > arrays and from there select "responsive" > > genes for the creation of > > > targeted (or custom arrays). Surely this step-wise > > sample space > > > reduction should be subject to the same criticism? > > > > It is. When normalizing microarray data, the assumption > > being made is > > that many (most) of the genes being measured are not > > actually changing > > expression. What most normalization schemes do is line up > > the bulk of > > the data so on average the log fold change is zero. If we > > can't make > > this assumption (e.g., it is possible that _all_ genes are > > up-regulated > > in one sample), then without having some housekeeping genes > > to use for > > the normalization, there is no way to normalize the data > > without making > > some strong and possibly unwarranted assumptions. > > > > So the main argument as I see it against doing supervised > > sample space > > reduction is that you may be removing the main assumption > > of most > > normalization schemes. The normalization is really the > > important thing > > here, as you are trying to remove unwanted technical > > variation that will > > have a much larger effect on your statistics than the > > multiple testing > > issue. > > > > > > > > Secondly, the supervised fold-change filter should not > > affect the > > > statistic of each individual gene, but will have > > profound effects on > > > the adjusted p-values. I have checked this only for > > t-tests and am > > > not sure what the effect on more complex statistical > > differential > > > testing methods would be. If the only effect of the > > "supervised" > > > filtering step is the enrichment of class-specific > > responsive gene > > > and a reduction in the severity of the p-value > > ADJUSTMENT (without > > > affecting the actual statistic), this could surely be > > a very useful > > > way of filtering data? > > > > > > Wrt the "unsupervised" approaches: These > > approaches define some > > > overall variability threshold which can be used to > > filter out genes > > > that don't show a minimum degree of variability > > regardless of class. > > > As far as I can tell there are several issues wrt this > > approach. (1) > > > Some genes will be naturally "noisy", i.e. > > will show high levels of > > > fluctuation regardless of class. These genes are > > likely to be > > > included in a filter based on degree of varilablity. > > (2) Some genes > > > might show low levels of variability (with small > > changes between > > > classes) and could be important, but will be excluded > > if a filter is > > > based on degree of variability. > > > > This is all true, but again I think the normalization issue > > is much more > > important, and that is where we really want to make sure we > > are doing a > > good job. > > > > These days people are getting much less interested in a > > list of > > differentially expressed genes, as these are often too > > large to be > > useful anyway. The real underlying goal of most experiments > > IMO, is to > > find pathways that are perturbed by some > > treatment/condition/whatever. > > In this case one really doesn't care about multiple > > testing, and instead > > is just using the t-stats (or whatever) in a GSEA type > > statistic to > > measure the difference in sets of genes. > > > > Best, > > > > Jim > > > > > > > > > > > > I would greatly appreciate some feedback on these > > comments, > > > specifically some statistical substantiation as to why > > a "supervised" > > > approach is "flawed", given the similar > > experimental strategies > > > included in the paragraph on this approach. > > > > > > Many Thanks!! Johan van Heerden > > > > > > _______________________________________________ > > Bioconductor mailing > > > list Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the > > > archives: > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > Affymetrix and cDNA Microarray Core > > University of Michigan Cancer Center > > 1500 E. Medical Center Drive > > 7410 CCGC > > Ann Arbor MI 48109 > > 734-647-5623 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor S. A. Braybrook Graduate Student, Harada Lab Section of Plant Biology University of California, Davis Davis, CA 95616 Ph 530.752.6980 The time is always right, to do what is right. - Martin Luther King, Jr. [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
I think Siobhan's approach is pretty sensible. One thing to bear in mind is that we may be agonizing over the parallel testing issue with counterproductive results. Yes, it is common to see p values adjusted for family wise error rates or false discovery rates. However, it's a well known effect that paying too much attention to statistical significance measures can result in reduced concordance between replicates than you would see if you simply took the most highly regulated genes. As a result, the FDA and it's MAQC group recommend ranking expression by fold change to some biologically sensible level (say about 2) then removing from remaining group those that fail to meet some common significance threshold (say .05). They think altogether too much has been made of the p values in general, and I agree. I do not think the goal here is to quibble over which genes are "most significant" in a single experiment statistically when those same genes fail to reproduce in identical repeat experiments. We actually did 3 sets (N = 5) of the same experiment over several years and found that statistical measures (e.g. Bioc RankProd) that were closer to straight fold change yielded far more concordance with our data than a variety of other tests which were essentially penalized t tests. I also did a bunch of simulations and found similar results with theoretical data, so I can see how this works. Basically, it happens when you have non-standard distributions where the effect size and variance are of similar magnitude.... Anyway, if you are feeling distraught over seemingly good statistical practices (like p value correction) that lead to short, weird lists, you are certainly not alone. Google MAQC and check out their Sept 2006 papers in Nature Biotechnology. Tom On Jun 25, 2008, at 1:29 PM, Siobhan A. Braybrook wrote: > Hi Johan, > > I tend to filter out genes with low Stdev across all arrays in the > experiment before doing any tests. This decreases your possibility > for > multiple testing errors, although i still apply an MTP. Since your > total > number of tests is less, you see better adj.Pvalues. To get this > better, > look at the calculation for whatever MTP you use - see how N > factors into > the calculation. > Also, low stdev usually equals no change- like you say. > Since you will likely go on to validate anything with qPCR if it is > important, I say cut them out before testing. If for some reason > it gives > you false positives (unlikely given the MTP and stringency) you can > weed > them out later. > > I filter these things out using the shorth. In my opinion, it gets > rid of > most 'absent' things as well as genes with low stDev but true signal. > > Hope that helps. > > Siobhan > > > > At 12:36 AM 25/06/2008, you wrote: >> Dear Jim, >> >> Thanks for the feedback. Your point about the implications for >> Normalization is noted, but my question relates to filtering post >> Normalization. I.e. the scenario is as follows: Data is normalized >> (using >> all "good" data points - "good" being based on Genepix quality >> criteria) >> and effeciency of Normalization is assessed (using visualizations and >> stats) based on the assumptions that you stated. This data is then >> pre-processed (missing values are imputed, replicates are averaged >> etc. >> Once a "clean" data set is obtained (where we are fairly certain >> that most >> technical noise has been removed), we filter out all genes that >> don't show >> at least a 1.2fold change between classes (i.e. not very >> stringent, the >> aim is to remove within class "flat" patterns), which reduces our >> sample >> space by about 1/3 (highlighting again that most genes show no or >> very >> little change). Furthur analysis is then done on this data set (i.e. >> Functional enrichment of gene groups as well as >> "straight" differential testing). >> >> What I would like to know is if there is any good reason why we >> should >> keep the genes that show "no" class-based behaviour (in terms of >> fold-changes) - as these contain very little or no information. >> Including >> them does not change the types of functional classes found to be >> enriched, >> but it pushes all adjusted p-values (doing differential testing) >> into very >> non-significant ranges. Eliminating them improves the adjusted p- >> values >> greatly. The "top" x genes using the "filtered" data set is >> exactly the >> same as the unfiltered one, except they have significant adjusted >> p-values. I am aware that a significant p-value is not the >> be-all-and-end-all of microarray research, but it is an unfortunate >> reality that most people do attach great importance to this and >> "significant" values are required to make any substatiated assertions >> (prior to downstream validation of course!). >> >> Can any serious criticisms be raised against this type of >> "post-normalization" sample space reduction? >> >> Thanks, >> Johan van Heerden >> >> >> >> >> --- On Tue, 6/24/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> >> wrote: >> >>> From: James W. MacDonald <jmacdon at="" med.umich.edu=""> >>> Subject: Re: [BioC] Filtering gene list prior to statistical testing >>> To: jvhn1 at yahoo.com >>> Cc: bioconductor at stat.math.ethz.ch >>> Date: Tuesday, June 24, 2008, 5:30 PM >>> Hi Johan, >>> >>> Johan van Heerden wrote: >>>> Dear All, >>>> >>>> I have scoured the BioC mailing list in search of a >>> clear answer >>>> regarding the filtering of a data sets prior to >>> differential testing, >>>> in an attempt to circumvent the multiple testing >>> problem. Although >>>> several opinions have been expressed over the last >>> couple of years I >>>> have not yet found a convincing argument for or >>> against this >>>> practice. I would like to make a comment and would >>> appreciate any >>>> constructive feedback, as I am not a Statistician but >>> a Biologists. >>>> >>>> As far as I can see the problem has been divided into >>> 2 categories: >>>> (1) "Supervised" and (2) >>> "Unsupervised" filtering, where (1) is based >>>> on some knowledge regarding the functional classes >>> present in the >>>> data, as opposed to (2) which does not consider any >>> such information. >>>> Several criticism have been raised against the >>> "Supervised" approach, >>>> with many people calling it flawed logic. My first >>> comments are >>>> regarding the logic of "Supervised" >>> filtering. >>>> >>>> As an example: A data set consisting of two classes >>> (Treatment 1 and >>>> Treatment 2) has been generated. A fold-change is >>> then used to >>>> enrich the data set for genes that show within class >>> activity (i.e. >>>> select only genes that show a mean x-fold change >>> between classes). >>>> This filtered data set is then used for differential >>> testing. >>>> >>>> My first question is: How is this different >>> (especially when working >>>> with "whole-genome" arrays) from having >>> custom arrays constructed >>>> from genes known show a response to some treatment. >>> I.e. Arrays will >>>> then be selectively printed with genes that are known >>> to or expected >>>> to show a response. This is a type of >>> "filtering" step that will >>>> yield arrays with highly reduced gene sets. This >>> scenario can result >>>> from known knowledge about pathways or can arrise from >>> a discovery >>>> based microarray experiment, where a researcher >>> produces whole genome >>>> arrays and from there select "responsive" >>> genes for the creation of >>>> targeted (or custom arrays). Surely this step-wise >>> sample space >>>> reduction should be subject to the same criticism? >>> >>> It is. When normalizing microarray data, the assumption >>> being made is >>> that many (most) of the genes being measured are not >>> actually changing >>> expression. What most normalization schemes do is line up >>> the bulk of >>> the data so on average the log fold change is zero. If we >>> can't make >>> this assumption (e.g., it is possible that _all_ genes are >>> up-regulated >>> in one sample), then without having some housekeeping genes >>> to use for >>> the normalization, there is no way to normalize the data >>> without making >>> some strong and possibly unwarranted assumptions. >>> >>> So the main argument as I see it against doing supervised >>> sample space >>> reduction is that you may be removing the main assumption >>> of most >>> normalization schemes. The normalization is really the >>> important thing >>> here, as you are trying to remove unwanted technical >>> variation that will >>> have a much larger effect on your statistics than the >>> multiple testing >>> issue. >>> >>>> >>>> Secondly, the supervised fold-change filter should not >>> affect the >>>> statistic of each individual gene, but will have >>> profound effects on >>>> the adjusted p-values. I have checked this only for >>> t-tests and am >>>> not sure what the effect on more complex statistical >>> differential >>>> testing methods would be. If the only effect of the >>> "supervised" >>>> filtering step is the enrichment of class-specific >>> responsive gene >>>> and a reduction in the severity of the p-value >>> ADJUSTMENT (without >>>> affecting the actual statistic), this could surely be >>> a very useful >>>> way of filtering data? >>>> >>>> Wrt the "unsupervised" approaches: These >>> approaches define some >>>> overall variability threshold which can be used to >>> filter out genes >>>> that don't show a minimum degree of variability >>> regardless of class. >>>> As far as I can tell there are several issues wrt this >>> approach. (1) >>>> Some genes will be naturally "noisy", i.e. >>> will show high levels of >>>> fluctuation regardless of class. These genes are >>> likely to be >>>> included in a filter based on degree of varilablity. >>> (2) Some genes >>>> might show low levels of variability (with small >>> changes between >>>> classes) and could be important, but will be excluded >>> if a filter is >>>> based on degree of variability. >>> >>> This is all true, but again I think the normalization issue >>> is much more >>> important, and that is where we really want to make sure we >>> are doing a >>> good job. >>> >>> These days people are getting much less interested in a >>> list of >>> differentially expressed genes, as these are often too >>> large to be >>> useful anyway. The real underlying goal of most experiments >>> IMO, is to >>> find pathways that are perturbed by some >>> treatment/condition/whatever. >>> In this case one really doesn't care about multiple >>> testing, and instead >>> is just using the t-stats (or whatever) in a GSEA type >>> statistic to >>> measure the difference in sets of genes. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>>> >>>> I would greatly appreciate some feedback on these >>> comments, >>>> specifically some statistical substantiation as to why >>> a "supervised" >>>> approach is "flawed", given the similar >>> experimental strategies >>>> included in the paragraph on this approach. >>>> >>>> Many Thanks!! Johan van Heerden >>>> >>>> _______________________________________________ >>> Bioconductor mailing >>>> list Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the >>>> archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> Affymetrix and cDNA Microarray Core >>> University of Michigan Cancer Center >>> 1500 E. Medical Center Drive >>> 7410 CCGC >>> Ann Arbor MI 48109 >>> 734-647-5623 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > S. A. Braybrook > Graduate Student, Harada Lab > Section of Plant Biology > University of California, Davis > Davis, CA 95616 > Ph 530.752.6980 > > The time is always right, to do what is right. > - Martin Luther King, Jr. > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Johan, You've raised some good points, and i hope that your post attracts a number of replies. For me, i routinely filter out genes that are Absent on all arrays, as detected by mas5calls on older affy data, or some threshold on DABG calls for newer affy data; This removes ~30% of the genes on the array that are detected too close to background in ALL SAMPLES. These are the data points that are least likely to be measured correctly, and least likely to be validated by other approaches. Note I'm not saying that there aren't interesting changes down here, is just that you have to understand/work within the limits of the technology itself. my 2 cents Mark ----------------------------------------------------- Mark Cowley, BSc (Bioinformatics)(Hons) Peter Wills Bioinformatics Centre Garvan Institute of Medical Research ----------------------------------------------------- On 25/06/2008, at 5:36 PM, Johan van Heerden wrote: > Dear Jim, > > Thanks for the feedback. Your point about the implications for > Normalization is noted, but my question relates to filtering post > Normalization. I.e. the scenario is as follows: Data is normalized > (using all "good" data points - "good" being based on Genepix > quality criteria) and effeciency of Normalization is assessed (using > visualizations and stats) based on the assumptions that you stated. > This data is then pre-processed (missing values are imputed, > replicates are averaged etc. Once a "clean" data set is obtained > (where we are fairly certain that most technical noise has been > removed), we filter out all genes that don't show at least a 1.2fold > change between classes (i.e. not very stringent, the aim is to > remove within class "flat" patterns), which reduces our sample space > by about 1/3 (highlighting again that most genes show no or very > little change). Furthur analysis is then done on this data set (i.e. > Functional enrichment of gene groups as well as > "straight" differential testing). > > What I would like to know is if there is any good reason why we > should keep the genes that show "no" class-based behaviour (in terms > of fold-changes) - as these contain very little or no information. > Including them does not change the types of functional classes found > to be enriched, but it pushes all adjusted p-values (doing > differential testing) into very non-significant ranges. Eliminating > them improves the adjusted p-values greatly. The "top" x genes > using the "filtered" data set is exactly the same as the unfiltered > one, except they have significant adjusted p-values. I am aware that > a significant p-value is not the be-all-and-end-all of microarray > research, but it is an unfortunate reality that most people do > attach great importance to this and "significant" values are > required to make any substatiated assertions (prior to downstream > validation of course!). > > Can any serious criticisms be raised against this type of "post- > normalization" sample space reduction? > > Thanks, > Johan van Heerden > > > > > --- On Tue, 6/24/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > >> From: James W. MacDonald <jmacdon at="" med.umich.edu=""> >> Subject: Re: [BioC] Filtering gene list prior to statistical testing >> To: jvhn1 at yahoo.com >> Cc: bioconductor at stat.math.ethz.ch >> Date: Tuesday, June 24, 2008, 5:30 PM >> Hi Johan, >> >> Johan van Heerden wrote: >>> Dear All, >>> >>> I have scoured the BioC mailing list in search of a >> clear answer >>> regarding the filtering of a data sets prior to >> differential testing, >>> in an attempt to circumvent the multiple testing >> problem. Although >>> several opinions have been expressed over the last >> couple of years I >>> have not yet found a convincing argument for or >> against this >>> practice. I would like to make a comment and would >> appreciate any >>> constructive feedback, as I am not a Statistician but >> a Biologists. >>> >>> As far as I can see the problem has been divided into >> 2 categories: >>> (1) "Supervised" and (2) >> "Unsupervised" filtering, where (1) is based >>> on some knowledge regarding the functional classes >> present in the >>> data, as opposed to (2) which does not consider any >> such information. >>> Several criticism have been raised against the >> "Supervised" approach, >>> with many people calling it flawed logic. My first >> comments are >>> regarding the logic of "Supervised" >> filtering. >>> >>> As an example: A data set consisting of two classes >> (Treatment 1 and >>> Treatment 2) has been generated. A fold-change is >> then used to >>> enrich the data set for genes that show within class >> activity (i.e. >>> select only genes that show a mean x-fold change >> between classes). >>> This filtered data set is then used for differential >> testing. >>> >>> My first question is: How is this different >> (especially when working >>> with "whole-genome" arrays) from having >> custom arrays constructed >>> from genes known show a response to some treatment. >> I.e. Arrays will >>> then be selectively printed with genes that are known >> to or expected >>> to show a response. This is a type of >> "filtering" step that will >>> yield arrays with highly reduced gene sets. This >> scenario can result >>> from known knowledge about pathways or can arrise from >> a discovery >>> based microarray experiment, where a researcher >> produces whole genome >>> arrays and from there select "responsive" >> genes for the creation of >>> targeted (or custom arrays). Surely this step-wise >> sample space >>> reduction should be subject to the same criticism? >> >> It is. When normalizing microarray data, the assumption >> being made is >> that many (most) of the genes being measured are not >> actually changing >> expression. What most normalization schemes do is line up >> the bulk of >> the data so on average the log fold change is zero. If we >> can't make >> this assumption (e.g., it is possible that _all_ genes are >> up-regulated >> in one sample), then without having some housekeeping genes >> to use for >> the normalization, there is no way to normalize the data >> without making >> some strong and possibly unwarranted assumptions. >> >> So the main argument as I see it against doing supervised >> sample space >> reduction is that you may be removing the main assumption >> of most >> normalization schemes. The normalization is really the >> important thing >> here, as you are trying to remove unwanted technical >> variation that will >> have a much larger effect on your statistics than the >> multiple testing >> issue. >> >>> >>> Secondly, the supervised fold-change filter should not >> affect the >>> statistic of each individual gene, but will have >> profound effects on >>> the adjusted p-values. I have checked this only for >> t-tests and am >>> not sure what the effect on more complex statistical >> differential >>> testing methods would be. If the only effect of the >> "supervised" >>> filtering step is the enrichment of class-specific >> responsive gene >>> and a reduction in the severity of the p-value >> ADJUSTMENT (without >>> affecting the actual statistic), this could surely be >> a very useful >>> way of filtering data? >>> >>> Wrt the "unsupervised" approaches: These >> approaches define some >>> overall variability threshold which can be used to >> filter out genes >>> that don't show a minimum degree of variability >> regardless of class. >>> As far as I can tell there are several issues wrt this >> approach. (1) >>> Some genes will be naturally "noisy", i.e. >> will show high levels of >>> fluctuation regardless of class. These genes are >> likely to be >>> included in a filter based on degree of varilablity. >> (2) Some genes >>> might show low levels of variability (with small >> changes between >>> classes) and could be important, but will be excluded >> if a filter is >>> based on degree of variability. >> >> This is all true, but again I think the normalization issue >> is much more >> important, and that is where we really want to make sure we >> are doing a >> good job. >> >> These days people are getting much less interested in a >> list of >> differentially expressed genes, as these are often too >> large to be >> useful anyway. The real underlying goal of most experiments >> IMO, is to >> find pathways that are perturbed by some >> treatment/condition/whatever. >> In this case one really doesn't care about multiple >> testing, and instead >> is just using the t-stats (or whatever) in a GSEA type >> statistic to >> measure the difference in sets of genes. >> >> Best, >> >> Jim >> >> >> >>> >>> I would greatly appreciate some feedback on these >> comments, >>> specifically some statistical substantiation as to why >> a "supervised" >>> approach is "flawed", given the similar >> experimental strategies >>> included in the paragraph on this approach. >>> >>> Many Thanks!! Johan van Heerden >>> >>> _______________________________________________ >> Bioconductor mailing >>> list Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the >>> archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Johan, You've raised some good points, and i hope that your post attracts a number of replies. For me, i routinely filter out genes that are Absent on all arrays, as detected by mas5calls on older affy data, or some threshold on DABG calls for newer affy data; This removes ~30% of the genes on the array that are detected too close to background in ALL SAMPLES. These are the data points that are least likely to be measured correctly, and least likely to be validated by other approaches. Note I'm not saying that there aren't interesting changes down here, is just that you have to understand/work within the limits of the technology itself. my 2 cents Mark ----------------------------------------------------- Mark Cowley, BSc (Bioinformatics)(Hons) Peter Wills Bioinformatics Centre Garvan Institute of Medical Research ----------------------------------------------------- On 25/06/2008, at 5:36 PM, Johan van Heerden wrote: > Dear Jim, > > Thanks for the feedback. Your point about the implications for > Normalization is noted, but my question relates to filtering post > Normalization. I.e. the scenario is as follows: Data is normalized > (using all "good" data points - "good" being based on Genepix > quality criteria) and effeciency of Normalization is assessed (using > visualizations and stats) based on the assumptions that you stated. > This data is then pre-processed (missing values are imputed, > replicates are averaged etc. Once a "clean" data set is obtained > (where we are fairly certain that most technical noise has been > removed), we filter out all genes that don't show at least a 1.2fold > change between classes (i.e. not very stringent, the aim is to > remove within class "flat" patterns), which reduces our sample space > by about 1/3 (highlighting again that most genes show no or very > little change). Furthur analysis is then done on this data set (i.e. > Functional enrichment of gene groups as well as > "straight" differential testing). > > What I would like to know is if there is any good reason why we > should keep the genes that show "no" class-based behaviour (in terms > of fold-changes) - as these contain very little or no information. > Including them does not change the types of functional classes found > to be enriched, but it pushes all adjusted p-values (doing > differential testing) into very non-significant ranges. Eliminating > them improves the adjusted p-values greatly. The "top" x genes > using the "filtered" data set is exactly the same as the unfiltered > one, except they have significant adjusted p-values. I am aware that > a significant p-value is not the be-all-and-end-all of microarray > research, but it is an unfortunate reality that most people do > attach great importance to this and "significant" values are > required to make any substatiated assertions (prior to downstream > validation of course!). > > Can any serious criticisms be raised against this type of "post- > normalization" sample space reduction? > > Thanks, > Johan van Heerden > > > > > --- On Tue, 6/24/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > >> From: James W. MacDonald <jmacdon at="" med.umich.edu=""> >> Subject: Re: [BioC] Filtering gene list prior to statistical testing >> To: jvhn1 at yahoo.com >> Cc: bioconductor at stat.math.ethz.ch >> Date: Tuesday, June 24, 2008, 5:30 PM >> Hi Johan, >> >> Johan van Heerden wrote: >>> Dear All, >>> >>> I have scoured the BioC mailing list in search of a >> clear answer >>> regarding the filtering of a data sets prior to >> differential testing, >>> in an attempt to circumvent the multiple testing >> problem. Although >>> several opinions have been expressed over the last >> couple of years I >>> have not yet found a convincing argument for or >> against this >>> practice. I would like to make a comment and would >> appreciate any >>> constructive feedback, as I am not a Statistician but >> a Biologists. >>> >>> As far as I can see the problem has been divided into >> 2 categories: >>> (1) "Supervised" and (2) >> "Unsupervised" filtering, where (1) is based >>> on some knowledge regarding the functional classes >> present in the >>> data, as opposed to (2) which does not consider any >> such information. >>> Several criticism have been raised against the >> "Supervised" approach, >>> with many people calling it flawed logic. My first >> comments are >>> regarding the logic of "Supervised" >> filtering. >>> >>> As an example: A data set consisting of two classes >> (Treatment 1 and >>> Treatment 2) has been generated. A fold-change is >> then used to >>> enrich the data set for genes that show within class >> activity (i.e. >>> select only genes that show a mean x-fold change >> between classes). >>> This filtered data set is then used for differential >> testing. >>> >>> My first question is: How is this different >> (especially when working >>> with "whole-genome" arrays) from having >> custom arrays constructed >>> from genes known show a response to some treatment. >> I.e. Arrays will >>> then be selectively printed with genes that are known >> to or expected >>> to show a response. This is a type of >> "filtering" step that will >>> yield arrays with highly reduced gene sets. This >> scenario can result >>> from known knowledge about pathways or can arrise from >> a discovery >>> based microarray experiment, where a researcher >> produces whole genome >>> arrays and from there select "responsive" >> genes for the creation of >>> targeted (or custom arrays). Surely this step-wise >> sample space >>> reduction should be subject to the same criticism? >> >> It is. When normalizing microarray data, the assumption >> being made is >> that many (most) of the genes being measured are not >> actually changing >> expression. What most normalization schemes do is line up >> the bulk of >> the data so on average the log fold change is zero. If we >> can't make >> this assumption (e.g., it is possible that _all_ genes are >> up-regulated >> in one sample), then without having some housekeeping genes >> to use for >> the normalization, there is no way to normalize the data >> without making >> some strong and possibly unwarranted assumptions. >> >> So the main argument as I see it against doing supervised >> sample space >> reduction is that you may be removing the main assumption >> of most >> normalization schemes. The normalization is really the >> important thing >> here, as you are trying to remove unwanted technical >> variation that will >> have a much larger effect on your statistics than the >> multiple testing >> issue. >> >>> >>> Secondly, the supervised fold-change filter should not >> affect the >>> statistic of each individual gene, but will have >> profound effects on >>> the adjusted p-values. I have checked this only for >> t-tests and am >>> not sure what the effect on more complex statistical >> differential >>> testing methods would be. If the only effect of the >> "supervised" >>> filtering step is the enrichment of class-specific >> responsive gene >>> and a reduction in the severity of the p-value >> ADJUSTMENT (without >>> affecting the actual statistic), this could surely be >> a very useful >>> way of filtering data? >>> >>> Wrt the "unsupervised" approaches: These >> approaches define some >>> overall variability threshold which can be used to >> filter out genes >>> that don't show a minimum degree of variability >> regardless of class. >>> As far as I can tell there are several issues wrt this >> approach. (1) >>> Some genes will be naturally "noisy", i.e. >> will show high levels of >>> fluctuation regardless of class. These genes are >> likely to be >>> included in a filter based on degree of varilablity. >> (2) Some genes >>> might show low levels of variability (with small >> changes between >>> classes) and could be important, but will be excluded >> if a filter is >>> based on degree of variability. >> >> This is all true, but again I think the normalization issue >> is much more >> important, and that is where we really want to make sure we >> are doing a >> good job. >> >> These days people are getting much less interested in a >> list of >> differentially expressed genes, as these are often too >> large to be >> useful anyway. The real underlying goal of most experiments >> IMO, is to >> find pathways that are perturbed by some >> treatment/condition/whatever. >> In this case one really doesn't care about multiple >> testing, and instead >> is just using the t-stats (or whatever) in a GSEA type >> statistic to >> measure the difference in sets of genes. >> >> Best, >> >> Jim >> >> >> >>> >>> I would greatly appreciate some feedback on these >> comments, >>> specifically some statistical substantiation as to why >> a "supervised" >>> approach is "flawed", given the similar >> experimental strategies >>> included in the paragraph on this approach. >>> >>> Many Thanks!! Johan van Heerden >>> >>> _______________________________________________ >> Bioconductor mailing >>> list Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the >>> archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
rgentleman ★ 5.5k
@rgentleman-7725
Last seen 9.6 years ago
United States
Hi, Good questions, and ones best discussed with a local statistical expert, mailing lists are typically not good resources for finding out about complex statistical issues; they do a much better job of providing help using the software. That said, a few hints below: Johan van Heerden wrote: > Dear All, > > I have scoured the BioC mailing list in search of a clear answer regarding the filtering of a data sets prior to differential testing, in an attempt to circumvent the multiple testing problem. Although several opinions have been expressed over the last couple of years I have not yet found a convincing argument for or against this practice. I would like to make a comment and would appreciate any constructive feedback, as I am not a Statistician but a Biologists. > > As far as I can see the problem has been divided into 2 categories: (1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based on some knowledge regarding the functional classes present in the data, as opposed to (2) which does not consider any such information. Several criticism have been raised against the "Supervised" approach, with many people calling it flawed logic. My first comments are regarding the logic of "Supervised" filtering. > > As an example: A data set consisting of two classes (Treatment 1 and Treatment 2) has been generated. A fold-change is then used to enrich the data set for genes that show within class activity (i.e. select only genes that show a mean x-fold change between classes). This filtered data set is then used for differential testing. > > My first question is: How is this different (especially when working with "whole-genome" arrays) from having custom arrays constructed from genes known show a response to some treatment. I.e. Arrays will then be selectively printed with genes that are known to or expected to show a response. This is a type of "filtering" step that will yield arrays with highly reduced gene sets. This scenario can result from known knowledge about pathways or can arrise from a discovery based microarray experiment, where a researcher produces whole genome arrays and from there select "responsive" genes for the creation of targeted (or custom arrays). Surely this step-wise sample space reduction should be subject to the same criticism? > If you use one data set to select the genes, and a second one to analyze only those genes selected then all is fine, and one expects to see appropriate statistical behavior of most quantities. This is basically what would happen if you did design a special array for your setting. If you use the same data set to do both then pretty much all the necessary assumptions have been violated, and no meaningful inference can be made from the p-values. This is Stats 101 (or at least it used to be). > Secondly, the supervised fold-change filter should not affect the statistic of each individual gene, but will have profound effects on the adjusted p-values. I have checked this only for t-tests and am not sure what the effect on more complex statistical differential testing methods would be. If the only effect of the "supervised" filtering step is the enrichment of class-specific responsive gene and a reduction in the severity of the p-value ADJUSTMENT (without affecting the actual statistic), this could surely be a very useful way of filtering data? > makes no sense to me - consult a local expert with a more explicit statement of what you don't understand. > Wrt the "unsupervised" approaches: These approaches define some overall variability threshold which can be used to filter out genes that don't show a minimum degree of variability regardless of class. As far as I can tell there are several issues wrt this approach. (1) Some genes will be naturally "noisy", i.e. will show high levels of fluctuation regardless of class. These genes are likely to be included in a filter based on degree of varilablity. (2) Some genes might show low levels of variability (with small changes between classes) and could be important, but will be excluded if a filter is based on degree of variability. > Yes, to 1) and to 2), for 1), you know that these genes may be informative about some phenotype (and typically they are, but perhaps not the one you get - whence the name non-specific filtering). Genes that vary little across all samples are typically not informative for any phenotype (and hence not for the one(s) you might be interested in. For 2), microarray technology has its limits - that is one of them. If genes that exhibit that type of behavior are likely to be important to you, then you need a different tool. Put a slightly different way, keeping genes that exhibit that sort of behavior seems to enhance your pool for non-informative genes/probes, most of us are tyring to enhance for informative ones (your use case may be different). > I would greatly appreciate some feedback on these comments, specifically some statistical substantiation as to why a "supervised" approach is "flawed", given the similar experimental strategies included in the paragraph on this approach. > Local experts are more likely to give you the help you want, and certainly posting with a signature is likely to be more successful here too. Robert > Many Thanks!! > Johan van Heerden > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org
ADD COMMENT

Login before adding your answer.

Traffic: 860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6