RandomForest, supervised machine learning and uncertainty

0

Entering edit mode

January Weiner ▴ 370

@january-weiner-3999

Last seen 10.2 years ago

Dear all, I am using RandomForests for supervised machine learning. My set of biomarkers is quite good at distinguishing the samples from different classes. However, I would get an even better classification if I could introduce a class of "Unknown" or "Unclassified" samples. Given that alrf is the RF object alrf <- randomForest( group ~ ., data=all ) I take a look at the matrix alrf$votes. I notice that in almost all the misclassified cases, the votes were close to a tie; there were also some correctly classified cases close to a tie. If I define an additional group called "Undefined", this group will be larger than the percentage of missclassified cases (as some correctly annotated cases will go into that class). However, the error rate *outside* of the class will be almost negligible. From a purely pragmatic point of view in biomarker discovery such a situation is preferable: it's better to admit that you don't know something than to risk a misclassification. And here is my question: Is there a standard method of creating such a class? For example, for a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or the difference between the two top votes for a given sample. But I think that this approach is not sufficient. Best regards, j. -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514

GO Classification GO Classification • 1.8k views

ADD COMMENT • link updated 14.0 years ago by Vincent J. Carey, Jr. 6.7k • written 14.0 years ago by January Weiner ▴ 370

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 9 weeks ago

United States

On Wed, Dec 8, 2010 at 5:43 AM, January Weiner <january.weiner at="" mpiib-berlin.mpg.de=""> wrote: > Dear all, > > I am using RandomForests for supervised machine learning. My set of > biomarkers is quite good at distinguishing the samples from different > classes. > > However, I would get an even better classification if I could > introduce a class of "Unknown" or "Unclassified" samples. Given that > alrf is the RF object > > alrf <- randomForest( group ~ ., data=all ) > > I take a look at the matrix alrf$votes. I notice that in almost all > the misclassified cases, the votes were close to a tie; there were > also some correctly classified cases close to a tie. > > If I define an additional group called "Undefined", this group will be > larger than the percentage of missclassified cases (as some correctly > annotated cases will go into that class). However, the error rate > *outside* of the class will be almost negligible. From a purely > pragmatic point of view in biomarker discovery such a situation is > preferable: it's better to admit that you don't know something than to > risk a misclassification. > > And here is my question: > > Is there a standard method of creating such a class? ?For example, for > a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or > the difference between the two top votes for a given sample. But I > think that this approach is not sufficient. I don't think there is anything like a "standard method" for this task, but if I read you correctly you are addressing the extension of the decision task from two classes to two classes plus "doubt". This is discussed at some length in Ripley's "Pattern Recognition and Neural Networks" book; see the comments on the "error-reject" curve on p20 and on "safety threshold" concept on p22. The MLInterfaces vignette has an application (that, as written, turns out to be nugatory) just at the end of the vignette -- the doubt interval is too narrow to capture any classification for the data in use. If you change the code to douPred[smallDou(0.35, 0.65)] <- "doubt" one prediction is converted to "doubt". This issue deserves more attention. > > Best regards, > > j. > > -- > -------- Dr. January Weiner 3 -------------------------------------- > Max Planck Institute for Infection Biology > Charit?platz 1 > D-10117 Berlin, Germany > Web?? : www.mpiib-berlin.mpg.de > Tel? ?? : +49-30-28460514 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.0 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

On Wed, Dec 8, 2010 at 7:07 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > On Wed, Dec 8, 2010 at 5:43 AM, January Weiner > <january.weiner@mpiib-berlin.mpg.de> wrote: > > Dear all, > > > > I am using RandomForests for supervised machine learning. My set of > > biomarkers is quite good at distinguishing the samples from different > > classes. > > > > However, I would get an even better classification if I could > > introduce a class of "Unknown" or "Unclassified" samples. Given that > > alrf is the RF object > > > > alrf <- randomForest( group ~ ., data=all ) > > > > I take a look at the matrix alrf$votes. I notice that in almost all > > the misclassified cases, the votes were close to a tie; there were > > also some correctly classified cases close to a tie. > > > > If I define an additional group called "Undefined", this group will be > > larger than the percentage of missclassified cases (as some correctly > > annotated cases will go into that class). However, the error rate > > *outside* of the class will be almost negligible. From a purely > > pragmatic point of view in biomarker discovery such a situation is > > preferable: it's better to admit that you don't know something than to > > risk a misclassification. > > > > And here is my question: > > > > Is there a standard method of creating such a class? For example, for > > a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or > > the difference between the two top votes for a given sample. But I > > think that this approach is not sufficient. > > I don't think there is anything like a "standard method" for this > task, but if I read you correctly you are addressing the extension of > the decision task from two classes to two classes plus "doubt". This > is discussed at some length in Ripley's "Pattern Recognition and > Neural Networks" book; see the comments on the "error-reject" curve on > p20 and on "safety threshold" concept on p22. > > The MLInterfaces vignette has an application (that, as written, turns > out to be nugatory) just at the end of the vignette -- the doubt > interval is too narrow to capture any classification for the data in > use. If you change the code to > > douPred[smallDou(0.35, 0.65)] <- "doubt" > > one prediction is converted to "doubt". This issue deserves more > attention. > > I'll just add here that when thinking about biomarker selection and clinical prediction, one must be aware of the often imbalanced costs (to the patient) of misclassification (which could include the "unclassified" cases), depending on the actual details of the clinical scenario. Sean > > > > > Best regards, > > > > j. > > > > -- > > -------- Dr. January Weiner 3 -------------------------------------- > > Max Planck Institute for Infection Biology > > CharitÃ©platz 1 > > D-10117 Berlin, Germany > > Web : www.mpiib-berlin.mpg.de > > Tel : +49-30-28460514 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago Sean Davis 21k

0

Entering edit mode

Thank you, Vincent, for the answer. > task, but if I read you correctly you are addressing the extension of > the decision task from two classes to two classes plus "doubt". ?This Yes; although I do have more than two classes, and I would like to stick to random forests. Say, extend the RF decision task from N classes to N + 1 classes. The problem has been well described in the discussion on "safety threshold" in the Ripley book. The simple solution is to define a "doubt function" d on the votes matrix from the RF such as the one that I have mentioned, and then plot the size of "doubt class" and the error rate in the remaining classes against d. That would help making a decision or would actually count as a result for my study. @Sean Davis: > I'll just add here that when thinking about biomarker selection and clinical prediction, > one must be aware of the often imbalanced costs (to the patient) of misclassification > (which could include the "unclassified" cases), depending on the actual details of > the clinical scenario. This is precisely why I would like to consider the "doubt class". The costs of having an unclassified result are definitely different (and most likely lower) than the costs of false negative. Cheers, j. > is discussed at some length in Ripley's "Pattern Recognition and > Neural Networks" book; see the comments on the "error-reject" curve on > p20 and on "safety threshold" concept on p22. > > The MLInterfaces vignette has an application (that, as written, turns > out to be nugatory) just at the end of the vignette -- the doubt > interval is too narrow to capture any classification for the data in > use. ?If you change the code to > > douPred[smallDou(0.35, 0.65)] <- "doubt" > > one prediction is converted to "doubt". ?This issue deserves more attention. > > >> >> Best regards, >> >> j. >> >> -- >> -------- Dr. January Weiner 3 -------------------------------------- >> Max Planck Institute for Infection Biology >> Charit?platz 1 >> D-10117 Berlin, Germany >> Web?? : www.mpiib-berlin.mpg.de >> Tel? ?? : +49-30-28460514 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514

ADD REPLY • link 14.0 years ago January Weiner ▴ 370

0

Entering edit mode

Hi January, If the situation is as you describe it, you do not need the 3rd class: stay with the two original classes and when a new case arrives, if at least 60% (or some other threshold) of votes are for class 1, make it class 1, if at least 60% (not necessarily equal to previous one) of votes are for class 2 make it class 2 and otherwise make it undefined. Best regards, Moshe. --- On Thu, 9/12/10, January Weiner <january.weiner at="" mpiib-="" berlin.mpg.de=""> wrote: > From: January Weiner <january.weiner at="" mpiib-berlin.mpg.de=""> > Subject: Re: [BioC] RandomForest, supervised machine learning and uncertainty > To: "BioC" <bioconductor at="" stat.math.ethz.ch=""> > Received: Thursday, 9 December, 2010, 12:01 AM > Thank you, Vincent, for the answer. > > > task, but if I read you correctly you are addressing > the extension of > > the decision task from two classes to two classes plus > "doubt". ?This > > Yes; although I do have more than two classes, and I would > like to > stick to random forests. Say, extend the RF decision task > from N > classes to N + 1 classes. The problem has been well > described in the > discussion on "safety threshold" in the Ripley book. > > The simple solution is to define a "doubt function" d on > the votes > matrix from the RF such as the one that I have mentioned, > and then > plot the size of "doubt class" and the error rate in the > remaining > classes against d. That would help making a decision or > would actually > count as a result for my study. > > > @Sean Davis: > > > I'll just add here that when thinking about biomarker > selection and clinical prediction, > > one must be aware of the often imbalanced costs (to > the patient) of misclassification > > (which could include the "unclassified" cases), > depending on the actual details of > > the clinical scenario. > > This is precisely why I would like to consider the "doubt > class". The > costs of having an unclassified result are definitely > different (and > most likely lower) than the costs of false negative. > > Cheers, > j. > > > > > is discussed at some length in Ripley's "Pattern > Recognition and > > Neural Networks" book; see the comments on the > "error-reject" curve on > > p20 and on "safety threshold" concept on p22. > > > > The MLInterfaces vignette has an application (that, as > written, turns > > out to be nugatory) just at the end of the vignette -- > the doubt > > interval is too narrow to capture any classification > for the data in > > use. ?If you change the code to > > > > douPred[smallDou(0.35, 0.65)] <- "doubt" > > > > one prediction is converted to "doubt". ?This issue > deserves more attention. > > > > > >> > >> Best regards, > >> > >> j. > >> > >> -- > >> -------- Dr. January Weiner 3 > -------------------------------------- > >> Max Planck Institute for Infection Biology > >> Charit?platz 1 > >> D-10117 Berlin, Germany > >> Web?? : www.mpiib-berlin.mpg.de > >> Tel? ?? : +49-30-28460514 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > -- > -------- Dr. January Weiner 3 > -------------------------------------- > Max Planck Institute for Infection Biology > Charit?platz 1 > D-10117 Berlin, Germany > Web?? : www.mpiib-berlin.mpg.de > Tel? ?? : +49-30-28460514 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.0 years ago Moshe Olshansky ▴ 120

Login before adding your answer.