Hello Bioconductor folk,
Can any of the bioconductor packages be used on a .pcl file, rather
than
starting with the raw data?
I am starting with a .pcl file containing approximately 900 genes and
50
samples, which I have read using read.table. The classification is
known, and
there are 3 classes of samples. I am interested in reducing the number
of
genes. I would like to use the R RandomForest package for this task.
Is this appropriate? I'm new to this so will appreciate any help.
Thanks
Karen
.- --. ....- -.-. -.-.
Hi Karen,
I don't know that starting with randomForest and using the importance
values is the best way to start. I would suggest first filtering the
data in different ways, like 200 largest F values. If your question is
to identify differentially expressed genes than you really want a
multiple comparisons approach. The multcomp package is quite good. If
the interest is a classification rule try filtering in different ways,
as suggested above, and then try some exploratory discriminant
analysis.
I have gotten good results with the fda function in the mda package on
CRAN. Use the gen.ridge method option and that gives penalized
discriminant analysis. This can help to look at the projections and
just
determine if the states are seperable. You can also look at the
coefficients for each variable. After some careful EDA than go for the
classification.
Nicholas
Karen writes>
Hello Bioconductor folk,
Can any of the bioconductor packages be used on a .pcl file, rather
than
starting with the raw data?
I am starting with a .pcl file containing approximately 900 genes and
50
samples, which I have read using read.table. The classification is
known, and
there are 3 classes of samples. I am interested in reducing the number
of
genes. I would like to use the R RandomForest package for this task.
Is this appropriate? I'm new to this so will appreciate any help.
Thanks
Karen
First some disclaimer:
1. I don't work with gene expression data, so lack the insights that
others
have.
2. I maintain the randomForest package, and use it a lot, so count on
me
being biased.
Now, if Karen's objective is finding differentially expressed genes, I
agree
that randomForest is an overkill. However, for classification as well
as
data exploration, randomForest can be a very handy tool. What we have
found, through both simulated and real (non-genomic) data, is that the
variable importance measures can be very effective. I don't see
anything
wrong with using it to identify potentially "interesting" genes.
There are some points to keep in mind, though:
1. We had found "measure 1" of variable importance to be
uninformative in
some situations, and not very stable even with large number of trees.
Leo
had decided to abandon measures 1 and 3. In the next version of the
package, only measures 2 and 4 are computed. Both of these are quite
stable
(with, say, 500 or more trees).
2. In most cases that we have seen, randomForest is extremely
tolerant of
noise variables, in the sense that the cross-validated error rates do
not
improve significantly as number of variables are reduced, for data
sets
where we know there are large number of noise variables. While
reducing
number of variables may be a necessity for other classifiers, it
doesn't
affect RF much most of the time.
3. Considering #2 above, the value of the importance measures is
really
mostly for "inpterpretation" or exploration. There's an obvious
drawback,
though: The measures do not give any hints on trend/directions. To
gain
further insight on the structure of the data, one should use the
information
provided by variable importance and carry out further exploration with
other
tools (e.g., fit more "interpretable" models using the most important
variables, but be careful not to read too much into performance of
such
models, as selection bias had crept in).
That's my $0.02 for the day...
Andy
> -----Original Message-----
> From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net]
> Sent: Monday, March 24, 2003 10:52 PM
> To: Karen.Chancellor@asu.edu
> Cc: bioconductor@stat.math.ethz.ch
> Subject: Re:[BioC] feature selection
>
>
> Hi Karen,
> I don't know that starting with randomForest and using the
importance
> values is the best way to start. I would suggest first filtering the
> data in different ways, like 200 largest F values. If your question
is
> to identify differentially expressed genes than you really want a
> multiple comparisons approach. The multcomp package is quite good.
If
> the interest is a classification rule try filtering in different
ways,
> as suggested above, and then try some exploratory
> discriminant analysis.
> I have gotten good results with the fda function in the mda package
on
> CRAN. Use the gen.ridge method option and that gives penalized
> discriminant analysis. This can help to look at the
> projections and just
> determine if the states are seperable. You can also look at the
> coefficients for each variable. After some careful EDA than go for
the
> classification.
>
> Nicholas
>
>
> Karen writes>
> Hello Bioconductor folk,
> Can any of the bioconductor packages be used on a .pcl file,
> rather than
> starting with the raw data?
> I am starting with a .pcl file containing approximately 900
> genes and 50
> samples, which I have read using read.table. The classification is
> known, and
> there are 3 classes of samples. I am interested in reducing the
number
> of
> genes. I would like to use the R RandomForest package for this task.
> Is this appropriate? I'm new to this so will appreciate any help.
>
> Thanks
> Karen
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
----------------------------------------------------------------------
--------