This repository contains an implementation of the approximate k-Nearest Neighbors (ANN) algorithm using multiple parallelization techniques: Sequential, OpenMP, OpenCilk, and Pthreads
The project implements the k-Nearest Neighbors (k-NN) algorithm, which finds the closest data points to a given query point. This project aims to improve the k-NN algorithm's performance using parallelization strategies to handle large datasets efficiently. The project uses the ANN (Approximate k-NN) approach for faster processing at the cost of a small loss in accuracy (recall).
Before running the script, ensure that you have the following installed:
- make: For building the project.
- OpenMP: Install the OpenMP library for multi-threaded execution.
- OpenCilk: Install the OpenCilk runtime for parallel execution.
- Pthreads: Ensure that the Pthreads library is available for thread-based parallelization.
- OpenBLAS: For fast linear algebra computations, used in distance calculation.
- Linux/Unix environment: For running the bash scripts and performance tools.
To compile and execute different implementations on specified datasets, follow these steps.
Script location:
The script used for running benchmarks and experiments is located at:
`./Results/run.bash`- Note: Before running parallel implementations for a dataset, you must first execute the
Sequentialimplementation (need the results for recall and speedup calculations)
bash run_script.sh <type> <dataset> <sampling_reduction> <candidate_reduction> [MIN_SIZE_factor] [THREADS_NUM]-
<type>: The implementation type (Sequential,OpenMP,OpenCilk,Pthreads). -
<dataset>: The dataset to run the program on (mnist,fashion-mnist,sift). -
<sampling_reduction>: The sampling reduction factor (positive integer). -
<candidate_reduction>: The candidate reduction factor (positive integer). -
[MIN_SIZE_factor](optional): To divide MIN_SIZE (default is1). -
[THREADS_NUM](optional): The number of threads to use (default is1).- Note: For
Pthreads, the number of threads must be manually specified in the code by changing theMAX_THREADSdefine in./src/knn_Pthreads.c.
- Note: For
-
Run Sequential Implementation:
bash run.bash Sequential mnist
-
Run OpenCilk Implementation with 12 threads:
This command will run the
OpenCilkimplementation on themnistdataset with a sampling reduction of25, candidate reduction of200, and a minimum size factor of1. It will use12threads.bash run.bash OpenCilk mnist 25 200 1 12
After the script has completed running, the results will be saved in the following directory structure:
-
Analytic Results:
- Location:
./<type>/<dataset>/Analytic/results_<dataset>_<sampling_reduction>_<candidate_reduction>_<minsize_factor>.txt - Contains indices of k nearest neighbors for each query.
- Location:
-
Statistics
- Location:
./<type>/<dataset>/Statistics/statistics_<dataset>_<sampling_reduction>_<candidate_reduction>_<minsize_factor>.txt - Summarizes performance metrics, including execution time, recall, and queries per second.
- Location:
The directory structure will be created if it does not already exist.