That looks good--thanks
A question and a suggestion follow.
1. You say you used a linear kernel--Did you find this to be best
after
testing and/or optimising the other kernels?
2. A set of wrapper functions (for multtest, ipred, and e1071) that
consistently interface the affy data object with a few good feature
selection methods and classification methods might be a useful new
addition
to BioC.
Stephen Henderson
-----Original Message-----
From: Ramon Diaz [mailto:rdiaz@cnio.es]
Sent: Monday, January 27, 2003 1:38 PM
To: bioconductor@stat.math.ethz.ch
Cc: amateos@cnio.es
Subject: Re: [BioC] How to do k-fold validation using SVM
Sorry for jumping into the thread so late: I just read the posting
today.
Anyway, I have used the following code a couple of times, and maybe it
is of
some help to you. In each "round" (for each set of training data) I
select the best K genes (where best means with largest abs. value of t
statistic) and then fit the svm using those K genes.
For a couple of reasons, I use mt.maxT (from the multtest library) to
get
the
t statistic, but you can modify the function at your convenience (like
an ANOVA for > 3 or whatever you want). Note also that I use a linear
kernel.
Hope it helps,
Ramón
gene.select <- function(data, class, size = NULL, threshold = NULL) {
# t.stat <- apply(data, 2, function(x) {abs(t.test(x ~
class)$statistic)})
this is slower than
tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)]
selected <- tmp[seq(1:size), 1]
return(selected)
}
cross.valid.svm <- function(data, y, knumber = 10, size = 200) {
## data is structured as subjects in rows, genes in columns
## (and thus is transposed inside gene.select to be fed to
mt.maxT).
## If you want leave-one-out, set knumber = NULL
## size is the number of genes used when building the classifier.
## those are the "best" genes, based on a t-statistic.
## First, selecting the data subsets for cross-validation
if(is.null(knumber)) {
knumber <- length(y)
leave.one.out <- TRUE
}
else leave.one.out <- FALSE
N <- length(y)
if(knumber > N) stop(message = "knumber has to be <= number of
subjects")
reps <- floor(N/knumber)
reps.last <- N - (knumber-1)*reps
index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps),
rep(knumber, reps.last))
index.select <- sample(index.select)
cv.errors <- matrix(-99, nrow = knumber, ncol = 4)
## Fit model for each data set.
for(sample.number in 1:knumber) {
## gene selection
gene.subset <- gene.select(data[index.select != sample.number,
],
y[index.select != sample.number],
size = size)
## predict from svm on that subset
y.f <- factor(y)
test.set <- data[index.select == sample.number, gene.subset]
if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow =
1) ##
for leave-one-out
predicted <- predict(svm(data[index.select != sample.number,
gene.subset],
y.f[index.select != sample.number],
kernel = "linear"), test.set)
cv.errors[sample.number, 1] <- length(which((y.f[index.select
==
sample.number] == 1)
& predicted == 0))
cv.errors[sample.number, 2] <- length(which((y.f[index.select
==
sample.number] == 0)
& predicted == 1))
cv.errors[sample.number, 3] <- length(which(y.f[index.select
== sample.number] != predicted))
cv.errors[sample.number, 4] <- length(predicted)
}
cv.errors <- data.frame(cv.errors)
names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1",
"total.error", "number.predicted")
average.error.rate <- sum(cv.errors[, 3])/sum(cv.errors[, 4])
return(list(cv.errors = cv.errors, average.error.rate =
average.error.rate))
}
## An example code:
cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10)
--
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
http://bioinfo.cnio.es/~rdiaz
**********************************************************************
This email and any files transmitted with it are confidential an ...
[[dropped]]