Running multiple KNN queries in parallel
1
0
Entering edit mode
SamGG ▴ 360
@samgg-6428
Last seen 10 days ago
France/Marseille/Inserm

Hi,

I know that it is possible to parallelize a single knn query.

I want to run multiple knn queries in parallel although I am unsure if there is a benefit compared to the previous method.

I tried the following, but did not succeed. I am using Windows, so the "snow" option. Probably, it is not possible to pass an object with an external pointer, but if someone could confirm I would appreciate.

Best.

data(iris)

# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])

# Build a research index
library(BiocNeighbors)

prebuilt <- buildIndex(X, BNPARAM = AnnoyParam(
  ntrees = 50,
  distance = "Euclidean"
))
out2 <- queryKNN(prebuilt, X, k=5)

# Set up parallelization
library(BiocParallel)

FUN <- function(x, prebuilt) {
  suppressPackageStartupMessages({
    library(BiocNeighbors)
  })
  queryKNN(prebuilt, x, k=5)
}

# check FUN; this works
FUN(as.data.frame(X[1:10,]), prebuilt)

# Define a 2-worker SOCK Snow cluster.
snow <- SnowParam(workers = 2, type = "SOCK")

# RUN: creates the cluster and distributes the work; this fails
bplapply(split(as.data.frame(X), 1:5), FUN, prebuilt, BPPARAM = snow)
BiocParallel BiocNeighbors • 297 views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

That is correct - you can't pass the prebuilt objects to different processes. Maybe MulticoreParam() might work on POSIX systems but I'm not sure how forking interacts with memory addresses.

If you want to make use of multiple cores, it's easier to just parallelize via num.threads=. Or if you can't do that, recreate the prebuilt object in each process.

I suppose it would be possible for each algorithm's prebuilt index to serialize its internal state into a file that could be read back into each process to recreate the index (e.g., as is done by the Annoy, HNSW libraries). However, this requires a round trip to the filesystem and increases memory usage across all processes; it's not really worth it given that we can more easily parallelize the query within a single process.

ADD COMMENT
0
Entering edit mode

Thank you for your detailed answer, I appreciate it and your work.

ADD REPLY

Login before adding your answer.

Traffic: 573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6