Question

Running multiple KNN queries in parallel

0

Entering edit mode

SamGG ▴ 360

@samgg-6428

Last seen 5 weeks ago

France/Marseille/Inserm

Hi,

I know that it is possible to parallelize a single knn query.

I want to run multiple knn queries in parallel although I am unsure if there is a benefit compared to the previous method.

I tried the following, but did not succeed. I am using Windows, so the "snow" option. Probably, it is not possible to pass an object with an external pointer, but if someone could confirm I would appreciate.

Best.

data(iris)

# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])

# Build a research index
library(BiocNeighbors)

prebuilt <- buildIndex(X, BNPARAM = AnnoyParam(
  ntrees = 50,
  distance = "Euclidean"
))
out2 <- queryKNN(prebuilt, X, k=5)

# Set up parallelization
library(BiocParallel)

FUN <- function(x, prebuilt) {
  suppressPackageStartupMessages({
    library(BiocNeighbors)
  })
  queryKNN(prebuilt, x, k=5)
}

# check FUN; this works
FUN(as.data.frame(X[1:10,]), prebuilt)

# Define a 2-worker SOCK Snow cluster.
snow <- SnowParam(workers = 2, type = "SOCK")

# RUN: creates the cluster and distributes the work; this fails
bplapply(split(as.data.frame(X), 1:5), FUN, prebuilt, BPPARAM = snow)

BiocParallel BiocNeighbors • 461 views

ADD COMMENT • link 4 months ago SamGG ▴ 360

score 2 · Accepted Answer · 2024-12-07

That is correct - you can't pass the prebuilt objects to different processes. Maybe MulticoreParam() might work on POSIX systems but I'm not sure how forking interacts with memory addresses.

If you want to make use of multiple cores, it's easier to just parallelize via num.threads=. Or if you can't do that, recreate the prebuilt object in each process.

I suppose it would be possible for each algorithm's prebuilt index to serialize its internal state into a file that could be read back into each process to recreate the index (e.g., as is done by the Annoy, HNSW libraries). However, this requires a round trip to the filesystem and increases memory usage across all processes; it's not really worth it given that we can more easily parallelize the query within a single process.