Dick Beyer <dbeyer@u.washington.edu> writes:
> Hi Kasper,
>
> Thanks for pointing out my problem with pamr.train. On closer
examination, my problem seems slightly different than what I asked
about earlier as it is occurring in pamr.cv.
>
> Every class has 3 samples, so pamr.train is ok, but not pamr.cv:
>
>>table(z)
> z
> 1 2 3 4 5 6 7 8
> 3 3 3 3 3 3 3 3
>>my.data <- list(x=dendmat,y=factor(z))
>>my.train <- pamr.train(my.data)
> 123456789101112131415161718192021222324252627282930
>> my.cv <- pamr.cv(my.train, my.data)
> Fold 1 :Error in nsc(x[, -folds[[ii]]], y = argy[-folds[[ii]]], x[,
folds[[ii]], :
> Error: each class must have >1 sample
>
> Has anyone seen this in pamr.cv before?
Probably still the same problem. Even though your original sample was
ok, when you do CV, each of the CV-train sets must have at least two
sample in every category.
Eg. take a y-vector like
1,1,2,2,2,2
If you do 3 fold CV you must divide your set into 3 test-sets, eg. (if
you do not do randomization)
1,1
2,2
2,2
The corresponsing training sets would be
2,2,2,2
1,1,2,2
1,1,2,2
so in this case you have a problem with the first train set as it does
not contain more than 1 class. This is in principle only a problem on
small sample sizes, but if you have (one or more) categories
containing only a few samples you might run into this.
As far as I can ascertain, in your case it is doing 3-fold cv. This
means that each test set is a sample of size 8 from your z
vector. Unless you sample exactly one of each of the 8 categories,
your will have the error. So you have way to few samples of each
category... 1-fold cv would work though. But is it really possible to
make good class predictions based on 3 samples of each class?
/Kasper
> ********************************************************************
***********
> Richard P. Beyer, Ph.D. University of Washington
> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
> Seattle, WA 98105-6099
>
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
> ********************************************************************
***********
>
> On Wed, 28 Jul 2004, Kasper Daniel Hansen wrote:
>
>> Dick Beyer <dbeyer@u.washington.edu> writes:
>>
>> > I am having trouble with pamr.train and subsequently pamr.cv.
>> >
>> > In the pamr documentation, the following works:
>> >
>> > set.seed(120)
>> > x <- matrix(rnorm(1000*20),ncol=20)
>> > y <- sample(c(1:4),size=20,replace=TRUE)
>> > mydata <- list(x=x,y=y)
>> > mytrain <- pamr.train(mydata)
>> > mycv <- pamr.cv(mytrain,mydata)
>> >
>> > But if you change the seed, it doesn't:
>> >
>> > set.seed(1123)
>> > x <- matrix(rnorm(1000*20),ncol=20)
>> > y <- sample(c(1:4),size=20,replace=TRUE)
>> > mydata <- list(x=x,y=y)
>> > mytrain <- pamr.train(mydata)
>> > Error in nsc(data$x[gene.subset, sample.subset], y = y, proby =
proby, :
>> > Error: each class must have >1 sample
>> >
>> > There is discussion in the documents (http://www-
stat.stanford.edu/~tibs/PAM/Rdist/doc/readme.html) about "fragile"
functions, but I have not been able to understand how to make this
error go away. If anyone has had this problem or has some advice, I
would be eternally grateful.
>>
>> If you look at the y-ector you will notice it look like this
>> > table(y)
>> y
>> 1 2 3 4
>> 1 6 5 8
>>
>> Hence there is only 1 sample with a class of "1". Of course this
>> happens when you sample 20 times from a set of 4 values. From the
error
>> message it seems that the method requires at least two samples from
>> every class.
>>
>> Possible solutions (quick solutions, I am not to familiar with
pamr):
>> - increase the size, so that a class with only one sample is very
>> unlikely.
>> - fit the data, disregarding the single sample and using only 3
>> classes
>>
>> /Kasper
>>
>> --
>> Kasper Daniel Hansen, Research Assistant
>> Department of Biostatistics, University of Copenhagen
>>
>
>
>
>
>
>
>
--
Kasper Daniel Hansen, Research Assistant
Department of Biostatistics, University of Copenhagen