feature selection

0

Entering edit mode

Karen.Chancellor@asu.edu ▴ 20

@karenchancellorasuedu-235

Last seen 10.2 years ago

Hello Bioconductor folk, Can any of the bioconductor packages be used on a .pcl file, rather than starting with the raw data? I am starting with a .pcl file containing approximately 900 genes and 50 samples, which I have read using read.table. The classification is known, and there are 3 classes of samples. I am interested in reducing the number of genes. I would like to use the R RandomForest package for this task. Is this appropriate? I'm new to this so will appreciate any help. Thanks Karen .- --. ....- -.-. -.-.

Classification Classification • 1.4k views

ADD COMMENT • link updated 21.7 years ago by Liaw, Andy ▴ 360 • written 21.7 years ago by Karen.Chancellor@asu.edu ▴ 20

0

Entering edit mode

Nicholas Lewin-Koh ▴ 430

@nicholas-lewin-koh-63

Last seen 10.2 years ago

Hi Karen, I don't know that starting with randomForest and using the importance values is the best way to start. I would suggest first filtering the data in different ways, like 200 largest F values. If your question is to identify differentially expressed genes than you really want a multiple comparisons approach. The multcomp package is quite good. If the interest is a classification rule try filtering in different ways, as suggested above, and then try some exploratory discriminant analysis. I have gotten good results with the fda function in the mda package on CRAN. Use the gen.ridge method option and that gives penalized discriminant analysis. This can help to look at the projections and just determine if the states are seperable. You can also look at the coefficients for each variable. After some careful EDA than go for the classification. Nicholas Karen writes> Hello Bioconductor folk, Can any of the bioconductor packages be used on a .pcl file, rather than starting with the raw data? I am starting with a .pcl file containing approximately 900 genes and 50 samples, which I have read using read.table. The classification is known, and there are 3 classes of samples. I am interested in reducing the number of genes. I would like to use the R RandomForest package for this task. Is this appropriate? I'm new to this so will appreciate any help. Thanks Karen

ADD COMMENT • link 21.7 years ago Nicholas Lewin-Koh ▴ 430

0

Entering edit mode

Liaw, Andy ▴ 360

@liaw-andy-125

Last seen 10.2 years ago

First some disclaimer: 1. I don't work with gene expression data, so lack the insights that others have. 2. I maintain the randomForest package, and use it a lot, so count on me being biased. Now, if Karen's objective is finding differentially expressed genes, I agree that randomForest is an overkill. However, for classification as well as data exploration, randomForest can be a very handy tool. What we have found, through both simulated and real (non-genomic) data, is that the variable importance measures can be very effective. I don't see anything wrong with using it to identify potentially "interesting" genes. There are some points to keep in mind, though: 1. We had found "measure 1" of variable importance to be uninformative in some situations, and not very stable even with large number of trees. Leo had decided to abandon measures 1 and 3. In the next version of the package, only measures 2 and 4 are computed. Both of these are quite stable (with, say, 500 or more trees). 2. In most cases that we have seen, randomForest is extremely tolerant of noise variables, in the sense that the cross-validated error rates do not improve significantly as number of variables are reduced, for data sets where we know there are large number of noise variables. While reducing number of variables may be a necessity for other classifiers, it doesn't affect RF much most of the time. 3. Considering #2 above, the value of the importance measures is really mostly for "inpterpretation" or exploration. There's an obvious drawback, though: The measures do not give any hints on trend/directions. To gain further insight on the structure of the data, one should use the information provided by variable importance and carry out further exploration with other tools (e.g., fit more "interpretable" models using the most important variables, but be careful not to read too much into performance of such models, as selection bias had crept in). That's my $0.02 for the day... Andy > -----Original Message----- > From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] > Sent: Monday, March 24, 2003 10:52 PM > To: Karen.Chancellor@asu.edu > Cc: bioconductor@stat.math.ethz.ch > Subject: Re:[BioC] feature selection > > > Hi Karen, > I don't know that starting with randomForest and using the importance > values is the best way to start. I would suggest first filtering the > data in different ways, like 200 largest F values. If your question is > to identify differentially expressed genes than you really want a > multiple comparisons approach. The multcomp package is quite good. If > the interest is a classification rule try filtering in different ways, > as suggested above, and then try some exploratory > discriminant analysis. > I have gotten good results with the fda function in the mda package on > CRAN. Use the gen.ridge method option and that gives penalized > discriminant analysis. This can help to look at the > projections and just > determine if the states are seperable. You can also look at the > coefficients for each variable. After some careful EDA than go for the > classification. > > Nicholas > > > Karen writes> > Hello Bioconductor folk, > Can any of the bioconductor packages be used on a .pcl file, > rather than > starting with the raw data? > I am starting with a .pcl file containing approximately 900 > genes and 50 > samples, which I have read using read.table. The classification is > known, and > there are 3 classes of samples. I am interested in reducing the number > of > genes. I would like to use the R RandomForest package for this task. > Is this appropriate? I'm new to this so will appreciate any help. > > Thanks > Karen > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > ---------------------------------------------------------------------- --------

ADD COMMENT • link 21.7 years ago Liaw, Andy ▴ 360

Login before adding your answer.