Entering edit mode
Hi,
I'm working with the impute::impute.knn function and noticed that the imputation result changes depending on the order of the input matrix. However, I would expect that rng.seed would produce the same result. Despite the different results, the spearman and pearson correlation between samples is still very high. Is this a bug? Otherwise, I'd appreciate any help understanding this.
Thanks!
#Load the impute library
> library(impute)
#Load the documentation sample data
> data(khanmiss)
> khan.expr <- khanmiss[-1, -(1:2)]
#Add row names for easier sorting later
> rownames(khan.expr) <- paste0("Gene", 1:2308)
#Check that no random seed exists prior to running imputation on the data set as is
> if(exists(".Random.seed")) rm(.Random.seed)
#Run imputation
> Result1_OriginalOrder <- impute.knn(as.matrix(khan.expr), rng.seed = 500)
Cluster size 2308 broken into 1509 799
Cluster size 1509 broken into 401 1108
Done cluster 401
Done cluster 1108
Done cluster 1509
Done cluster 799
#Create a new row order for the expression matrix
> khan.expr_neworder <- khan.expr[c(2308:1),]
#Clear out any random number state that may be stored
> if(exists(".Random.seed")) rm(.Random.seed)
#Run imputation on the new matrix order
> Result2_ChangedOrder <- impute.knn(as.matrix(khan.expr_neworder), rng.seed = 500)
Cluster size 2308 broken into 1458 850
Done cluster 1458
Done cluster 850
#Extract the imputed results and match row-order between the two data frames
> result1_data <- Result1_OriginalOrder$data
> result2 <- Result2_ChangedOrder$data
> result2_data <- result2[rownames(result1_data),]
#Confirm that these results are different; otherwise all.equal would be TRUE
> all.equal(result1_data, result2_data)
[1] "Mean relative difference: 0.1456079"
#Re-imputing changed order with original order
> if(exists(".Random.seed")) rm(.Random.seed)
> khan.expr_originalorder <- khan.expr_neworder[c(2308:1),]
> all.equal(khan.expr, khan.expr_originalorder)
[1] TRUE
#Run imputation on the new matrix order
> Result3_OriginalOrder <- impute.knn(as.matrix(khan.expr_originalorder), rng.seed = 500)
Cluster size 2308 broken into 1509 799
Cluster size 1509 broken into 401 1108
Done cluster 401
Done cluster 1108
Done cluster 1509
Done cluster 799
> result3_data <- Result3_OriginalOrder$data
# Confirm that the original matrix order produces the same result
> all.equal(result3_data, result1_data)
[1] TRUE
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] impute_1.76.0
loaded via a namespace (and not attached):
[1] digest_0.6.33 fastmap_1.1.1 xfun_0.41 lattice_0.22-5 knitr_1.45 parallel_4.3.1
[7] htmltools_0.5.7 rmarkdown_2.25 cli_3.6.1 ape_5.7-1 grid_4.3.1 compiler_4.3.1
[13] rstudioapi_0.15.0 tools_4.3.1 nlme_3.1-164 phylotools_0.2.2 evaluate_0.23 yaml_2.3.8
[19] Rcpp_1.0.11 rlang_1.1.2
Thanks! I'm still confused why maxp has this effect. From the documentation, imputation is happening gene-wise but if all neighbors are missing for a gene, then the overall column mean for that block of genes- which would be influenced by recursive two-means clustering of at most maxp genes- is used for the imputed value.
What is that case where all neighbors are missing and what is the relationship to rowmax? And should maxp be set as high as possible for the best reproducibility?
From
?impute.knn
When you run with the first ordering it says
and when you reorder it says
So you are (in the first case) breaking your data into four subsets and then imputing using the 10 nearest neighbors that also happen to be in that subset of genes. When you reorder, for whatever reason you only use two clusters. If you are imputing using a different subset of samples, your expectation should be that you will get different results. If you set maxp = p, then you don't subcluster, so the row order no longer matters.