Question

How to use expression set object to select genes using different gene selection methods

0

Entering edit mode

babumanish837 ▴ 10

@babumanish837-8404

Last seen 9.5 years ago

India

I want to select top k genes from the gds data and then i want to apply some classification algorithm to find the how much one gene selection algorithms (t-test,chi sq test,mRMR etc) works better from each other.I have used following R code to generate expression set from gds data.

library(GEOquery)

gds4515=getGEO(filename="GDS4515.soft.gz")

eset=GDS2eSet(gds4515,do.log2=TRUE)

Now i don't know what should i do now. At first have i to normalize it or have to do something else. if i have to normalize it that how can i do it. And after that what should i do.

microarray biobase geoquery • 3.5k views

ADD COMMENT • link updated 9.6 years ago by svlachavas ▴ 840 • written 9.6 years ago by babumanish837 ▴ 10

1

Entering edit mode

GDS records have been normalized by the submitter. If you agree that the normalization is appropriate, you could proceed with your analysis. You say "select top k genes" and then "apply some classification algorithm" and then "gene selection algorithms". I am not at all clear on what you are actually trying to do.

ADD REPLY • link 9.6 years ago Sean Davis 21k

0

Entering edit mode

Dear Sean Davis,

I am working in a project in which i have to compare the performances of different gene selection algorithms (feature selection algorithms ) i.e t-test,chi square test,mRMR etc. I am working on two class genes microarray colon cancer data. At first i will divide the data into two parts 1. Training set and 2. Test Set and i will apply the above algo. in training set. Since a microarray contain very less number of samples and large number of genes(features). I want to reduce the no. of genes by different feature or gene selection algo. and have to compare the performances from each other. For comparing the performances i will use a classification algorithm i.e SVM to classify the test set.

ADD REPLY • link 9.6 years ago babumanish837 ▴ 10

score 0 · Answer 1 · 2015-09-10

Dear Babumanish837,

you could first check the comprehensive vignette (http://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html) which describes in detail about how to use the GEOquery package. Generally, you would want first to normalize your expressionset, and then apply some kind of non-speficic filtering(i.e non-specific intensity filtering or another combined filtering) to use a subset for your classification procedure. But, in this specific case, as you have used log2-transformation and you have your expression set you could move forward as:

1) inspect via a boxplot how the data looks : boxplot(as.data.frame(exprs(eset))

2) use of other plots to perform an exploratory analysis(histograms, PCA plots,QQplots. MDSplots) to inspect further your data

3) the selection of the filtering is kind arbitary and depends on the experimental study. For instanse, you could perform a statistical test(i.e limma) and then select a subset of the DEG genes as possible candidates for classification. Or, use another combined filtering procedure, like the one described in the multtest R package:

e <- exprs(eset)
library(genefilter)
my_fun <- filterfun(pOverA(p = 0.4, A = 100), cv(a = 0.7, b = 10)) # where here you can determine a double filter: at least 40% of the samples have an intensity value bigger than 100; and the coefficient of variation(sd/mean) is between 0.7 and 10
my_filter <- genefilter(2^e, my_fun) # unlog-2 the intensity values and apply the above filtering
eset_filter <- eset[my_filter,] # keep the "reliable" probesets

To pinpoint also in the users guide of limma it has excellent preprossesing steps and various filtering methodologies for many studies, but the final choise is up to you

Best,

Efstathios