Entering edit mode
He, Yiwen NIH/CIT
▴
360
@he-yiwen-nihcit-1177
Last seen 10.2 years ago
Hi,
I have R version 2.0.1 and bioconductor 1.5 on both PC and Unix. I was
trying to use the impute.knn function of the impute package on a
dataset of
7332 genes and 3 arrays:
> library(impute)
> dim(dd)
[1] 7332 3
> is.matrix(dd)
[1] TRUE
> dd.imputed <- impute.knn(dd)
When run on PC (windows XP), the R program crashes after a few
seconds. When
run on a unix box, I can see such output:
Cluster size 7332 broken into 5667 1665
Cluster size 5667 broken into 4141 1526
Cluster size 4141 broken into 1796 2345
Cluster size 1796 broken into 840 956
Done cluster 840
Done cluster 956
Done cluster 1796
And R session was closed. So the clustering was started but aborted
somewhere in the middle.
I searched the archive and found another report of such problem, for a
dataset of 30000 x 2, but with no answers.
I have some interesting findings playing around with the parameters
and data
size:
1).
> impute.knn(dd, k=3) works, but for k bigger than 3, R crashes as
described.
2).
> dd2 <- cbind(dd,dd)
> dim(dd2)
[1] 7332 6
> impute.knn(dd2, k=8) works, but for k bigger than 8, R crashes.
3).
> dd3 <- cbind(dd, dd, dd)
> dim(dd3)
[1] 7332 9
> impute.knn(dd3) works. (k defaults to 10)
> impute.knn(dd3, k=17) R crashes.
I also played around with other parameters but they didn't help.
My conclusion is that the number of neighbors (k) is critical here.
However,
it's not straightforward how to set it based on data size.
Can anybody help, or at least point me to the maintainer of the impute
package?
Thanks, Yiwen
Yiwen He
Contractor
Center for Information Technology
National Institute of Health