Benchmarking
Running Standardized Benchmarks
In order to run standardized ANN benchmarks, you can use the Python module to run your own benchmark scripts or you can run the example program ggnn_benchmark which is compiled alongside the C++ library and can be applied to arbitrary datasets.
Everything dataset-specific can be configured via the following command line parameters:
basePath to the base dataset
.fvecsor.bvecsfile.subset(optional)In case you want to only load a subset of the base dataset, you can specify the size of that subset here. Only the first
subsetmany points will be loaded. By default, or if set to0, the entire base dataset file will be loaded.queryPath to the query dataset
.fvecsor.bvecsfile.gt(optional)Path to the ground truth indices
.ivecsfile.Note
If not given, the ground truth will be brute-forced, if possible.
If a file name is given, but the file does not exist, the brute-forced result will be stored.
graph_dir(optional)Directory for loading/storing the GGNN graph or graph shards.
Note
If the directory already contains a GGNN graph, it will be loaded and construction will be skipped. Otherwise, the constructed graph will be stored in this directory.
Note
If left empty, the graph will be discarded when the program ends.
If necessary (i.e., if GPU memory is insufficient to keep all shards loaded), GGNN will swap out shards from GPUs to RAM and disk automatically in multi-shard settings.
In that case, GGNN graph shards will be stored in the current working directory.
k_build(optional, default24)Number of neighbors per point in the search graph (see Search Graph Parameters).
tau_build(optional, default0.5)Slack factor for search graph construction (see Search Graph Parameters).
refinement_iterations(optional, default:2)Number of iterations for search graph refinement.
k_query(optional, default10)Number of neighbors to search for (see Query Parameters).
measure(optional, defaulteuclidean)Distance measure (
euclideanorcosine) (see Distance Measures).shard_size(optional)Number of points per shard. With sharding, the base datasets is split into equally-sized shards. This parameter defines the size of one shard.
Caution
The base dataset needs to be evenly divisible by the shard size. The resulting number of shards needs to be evenly divisible by the number of GPUs.
gpu_ids(optional)CUDA device indices of the GPUs to be used by GGNN, separated by spaces. E.g.,
'0 1 2 3'.Note
Using multiple GPUs requires sharding (see
shard_size).Tip
CUDA device indices can be influenced by the CUDA Environment Variables
CUDA_VISIBLE_DEVICESandCUDA_DEVICE_ORDER.grid_search(optional)If set, run a larger sweep of queries with \(\tau_{query} \in [0.7, 2.0]\) rather than just a small set of queries.
v(optional)Verbosity level between
0and4(maximum verbosity).
./build/ggnn_benchmark \
--base /path/to/sift_base.fvecs \
--query /path/to/sift_query.fvecs \
--gt /path/to/sift_groundtruth.ivecs \
--graph_dir ./ \
--tau_build 0.5 \
--refinement_iterations 2 \
--k_build 24 \
--k_query 10 \
--measure euclidean \
--shard_size 0 \
--subset 0 \
--gpu_ids 0 \
--grid_search false
ANN-Benchmarks / HDF5
In order to run a benchmark from ANN-Benchmarks, you might want to load a dataset from an HDF5 file. You can do so with a simple Python script:
import h5py
import numpy as np
# load ANN-benchmark-style HDF5 dataset
with h5py.File(path_to_dataset, 'r') as f:
base = np.array(f['train'])
query = np.array(f['test'])
gt = np.array(f['neighbors'])
See also the example file examples/python/sift1m_hdf5.py.
Reference Configurations
The default values set in the ggnn_benchmark program are set for the SIFT1M dataset. For other datasets, set the parameters as documented in the GGNN paper.
Note
We will update this documentation shortly to reference all necessary configurations.
For now, check the .cu files per dataset under src in the release_0.5 branch
and the official paper GGNN: Graph-based GPU Nearest Neighbor Search.