Machine learning, cross validation and gene selection

0

Entering edit mode

Daniel Brewer ★ 1.9k

@daniel-brewer-1791

Last seen 10.6 years ago

Hello, I am getting a bit confused about gene selection and machine learning and I was wondering if you could help me out. I have a dataset that is classified into two groups and my aim is to get a small number of genes (10-20) in a gene signature that I will in theory be able to apply to over datasets to optimal classify the samples. As I do not have a test and training set I am using Leave-one-out cross-validation to help determine the robustness. I have read that one should perform gene selection for each split of the samples i.e. 1) Select one group as the test set 2) On the remainder select genes 3) Apply machine learning algorithm 4) Test whether the test set is correctly classified 5) Go to one If you do this, you might get different genes each time, so how do you get your "final" optimal gene classifier? Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

GO Cancer GO Cancer • 1.7k views

ADD COMMENT • link updated 14.6 years ago by Vincent J. Carey, Jr. 6.7k • written 14.6 years ago by Daniel Brewer ★ 1.9k

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 6 weeks ago

United States

Traditionally the purpose of cross-validation is to reduce bias in model appraisal. The "resubstitution estimate" of classification accuracy uses the training data to appraise the model derived from the training data, and is typically biased; this is the subject of a substantial literature. Cross-validation introduces a series of partitions into training and test sets, so that a collection of appraisals that are independent of the training data are obtained, and these are summarized. When the training process involves feature selection, this should be part of each cross-validation step. Clearly this process leads to a collection of chosen features likely possessing different elements for each step. There is no '"final" optimal' classifier implied by the procedure, but surveying the features chosen at each step may provide insight into commonly selected or informative features. Random forests has a variable importance measure derived from a bootstrapping approach similar in some respects to cross validation; and a varSelRF package or function was discussed in recent list entries. MLInterfaces package, and probably many others such as CMA, provides tools to control and interpret cross-validation with embedded feature selection. Be careful what you wish for -- what exactly do you mean by 'optimal classifier'? On Wed, Sep 1, 2010 at 10:55 AM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: > Hello, > > I am getting a bit confused about gene selection and machine learning > and I was wondering if you could help me out. ?I have a dataset that is > classified into two groups and my aim is to get a small number of genes > (10-20) in a gene signature that I will in theory be able to apply to > over datasets to optimal classify the samples. ?As I do not have a test > and training set I am using Leave-one-out cross-validation to help > determine the robustness. ?I have read that one should perform gene > selection for each split of the samples i.e. > > 1) Select one group as the test set > 2) On the remainder select genes > 3) Apply machine learning algorithm > 4) Test whether the test set is correctly classified > 5) Go to one > > If you do this, you might get different genes each time, so how do you > get your "final" optimal gene classifier? > > Many thanks > > Dan > > -- > ************************************************************** > Daniel Brewer, Ph.D. > > Institute of Cancer Research > Molecular Carcinogenesis > Email: daniel.brewer at icr.ac.uk > ************************************************************** > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the a...{{dropped:2}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.6 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Many thanks for the detailed reply. That is very informative. What I mean by optimal is the collection of genes that any further studies should use. For example, say I have a cancer/normal dataset and I want to find the top 10 genes that will classify the tumour type according to an SVM. I would like to know the set of genes plus SVM parameters that could be used in further experiments to see if it could be used as a diagnostic test. Thanks again Dan On 01/09/2010 4:48 PM, Vincent Carey wrote: > Traditionally the purpose of cross-validation is to reduce bias in > model appraisal. The "resubstitution estimate" of classification > accuracy uses the training data to appraise the model derived from the > training data, and is typically biased; this is the subject of a > substantial literature. Cross-validation introduces a series of > partitions into training and test sets, so that a collection of > appraisals that are independent of the training data are obtained, and > these are summarized. When the training process involves feature > selection, this should be part of each cross-validation step. Clearly > this process leads to a collection of chosen features likely > possessing different elements for each step. There is no '"final" > optimal' classifier implied by the procedure, but surveying the > features chosen at each step may provide insight into commonly > selected or informative features. Random forests has a variable > importance measure derived from a bootstrapping approach similar in > some respects to cross validation; and a varSelRF package or function > was discussed in recent list entries. MLInterfaces package, and > probably many others such as CMA, provides tools to control and > interpret cross-validation with embedded feature selection. Be > careful what you wish for -- what exactly do you mean by 'optimal > classifier'? > > On Wed, Sep 1, 2010 at 10:55 AM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: >> Hello, >> >> I am getting a bit confused about gene selection and machine learning >> and I was wondering if you could help me out. I have a dataset that is >> classified into two groups and my aim is to get a small number of genes >> (10-20) in a gene signature that I will in theory be able to apply to >> over datasets to optimal classify the samples. As I do not have a test >> and training set I am using Leave-one-out cross-validation to help >> determine the robustness. I have read that one should perform gene >> selection for each split of the samples i.e. >> >> 1) Select one group as the test set >> 2) On the remainder select genes >> 3) Apply machine learning algorithm >> 4) Test whether the test set is correctly classified >> 5) Go to one >> >> If you do this, you might get different genes each time, so how do you >> get your "final" optimal gene classifier? >> >> Many thanks >> >> Dan >> >> -- >> ************************************************************** >> Daniel Brewer, Ph.D. >> >> Institute of Cancer Research >> Molecular Carcinogenesis >> Email: daniel.brewer at icr.ac.uk >> ************************************************************** >> >> The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. >> >> This e-mail message is confidential and for use by the a...{{dropped:2}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis MUCRC 15 Cotswold Road Sutton, Surrey SM2 5NG United Kingdom Tel: +44 (0) 20 8722 4109 ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

ADD REPLY • link 14.6 years ago Daniel Brewer ★ 1.9k

0

Entering edit mode

Hi, On Wed, Sep 1, 2010 at 12:05 PM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: > Many thanks for the detailed reply. ?That is very informative. ?What I > mean by optimal is the collection of genes that any further studies > should use. ?For example, say I have a cancer/normal dataset and I want > to find the top 10 genes that will classify the tumour type according to > an SVM. ?I would like to know the set of genes plus SVM parameters that > could be used in further experiments to see if it could be used as a > diagnostic test. Here is another view on this: The *real* purpose of your leave-one-out/whatever cross validation to assess how well your model can generalize to unknown data. During this CV phase for your SVM, for instance, you would take this opportunity to determine the optimal value for you parameters (maybe the cost param, or nu, or whatever) -- maybe you could avg. the value of the best parameter found during each fold (the one that gives the best classification accuracy(?)) as your "final" parameter(s). Also, during the CV you will want to see how different each model is -- not just how well your model's accuracy is on the test set. Maybe you can look at the concordance of the top features? If they are the same features, are they weighted equally, etc. Once you have sufficiently convinced yourself that an SVM with your type of data, with your fine tuned parameter values can "admirably" generalize to unseen data, then you have reached the objective of the cross validation phase. You could then take *all* of your data and rebuild your model (w/ your params) and use the model that falls out of this as the hammer you will use to attack data that is *really* unseen. Some random comments: If you are going to use an SVM and are looking to "prune" the features it selects, you might want to look into L1-penalized SVMs, a l?: http://bioinformatics.oxfordjournals.org/cgi/content/full/25/13/1711 (there's an R package there) Looking down this alley may also be fruitful: http://cran.r-project.org/web/packages/LiblineaR/ Another way to do that using "normal" SVMs is to perform recursive feature elimination ... these are all things you can google :-) I'm guessing those packages (and papers they lead to) will probably give you some more information on how you might go about choosing your "final" model in some principled manner ... FWIW, I might pursue the "penalized" types of classifiers a bit more aggressively if I were in the shoes that it sounds like you are wearing (I'm a big fan of the glmnet package -- which also does penalized logistic regression) .. but you know what they say about taking advice found on mailing lists ... ;-) Hope that was helpful, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, On Wed, Sep 1, 2010 at 12:05 PM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: > Many thanks for the detailed reply. ?That is very informative. ?What I > mean by optimal is the collection of genes that any further studies > should use. ?For example, say I have a cancer/normal dataset and I want > to find the top 10 genes that will classify the tumour type according to > an SVM. ?I would like to know the set of genes plus SVM parameters that > could be used in further experiments to see if it could be used as a > diagnostic test. Here is another view on this: The *real* purpose of your leave-one-out/whatever cross validation to assess how well your model can generalize to unknown data. During this CV phase for your SVM, for instance, you would take this opportunity to determine the optimal value for you parameters (maybe the cost param, or nu, or whatever) -- maybe you could avg. the value of the best parameter found during each fold (the one that gives the best classification accuracy(?)) as your "final" parameter(s). Also, during the CV you will want to see how different each model is -- not just how well your model's accuracy is on the test set. Maybe you can look at the concordance of the top features? If they are the same features, are they weighted equally, etc. Once you have sufficiently convinced yourself that an SVM with your type of data, with your fine tuned parameter values can "admirably" generalize to unseen data, then you have reached the objective of the cross validation phase. You could then take *all* of your data and rebuild your model (w/ your params) and use the model that falls out of this as the hammer you will use to attack data that is *really* unseen. Some random comments: If you are going to use an SVM and are looking to "prune" the features it selects, you might want to look into L1-penalized SVMs, a l?: http://bioinformatics.oxfordjournals.org/cgi/content/full/25/13/1711 (there's an R package there) Looking down this alley may also be fruitful: http://cran.r-project.org/web/packages/LiblineaR/ Another way to do that using "normal" SVMs is to perform recursive feature elimination ... these are all things you can google :-) I'm guessing those packages (and papers they lead to) will probably give you some more information on how you might go about choosing your "final" model in some principled manner ... FWIW, I might pursue the "penalized" types of classifiers a bit more aggressively if I were in the shoes that it sounds like you are wearing (I'm a big fan of the glmnet package -- which also does penalized logistic regression) .. but you know what they say about taking advice found on mailing lists ... ;-) Hope that was helpful, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.6 years ago Steve Lianoglou ★ 13k

Login before adding your answer.