Vamana Graph Index

In this section, we cover the API and usage of the Vamana graph-based index.

class pysvs.Vamana

Top level class for the Vamana graph index.

__init__(self: pysvs.Vamana, config_path: str, graph_loader: pysvs.GraphLoader, data_loader: Union[pysvs.VectorDataLoader, pysvs.LVQLoader, pysvs.LeanVecLoader, pysvs.SerializedObject], distance: pysvs.DistanceType = <DistanceType.L2: 0>, query_type: pysvs.DataType = <DataType.float32: 9>, enforce_dims: bool = False, num_threads: int = 1) → None

Load a Vamana style index from disk.

Parameters:

config_path – Path to the directory where the index configuration file was generated.
graph_loader – The loader class for the graph.
data_loader – The loader for the dataset. See comment below for accepted types.
distance – The distance function to use.
query_type – The data type of the queries.
enforce_dims –
Require that the compiled dimensionality of the returned index matches the dimensionality provided in the data_loader argument. If a match is not found, an exception is thrown.

This is meant to ensure that specialized dimensionality is provided without falling back to generic implementations. Leaving the dims out when constructing the data_loader will with enable_dims = True will always attempt to use a generic implementation.
num_threads – The number of threads to use for queries (can be changed after loading).

The top level type is an abstract type backed by various specialized backends that will be instantiated based on their applicability to the particular problem instance.

The arguments upon which specialization is conducted are:

data_loader: Both kind (type of loader) and inner aspects of the loader like data type, quantization type, and number of dimensions.
distance: The distance measure being used.

Specializations compiled into the binary are listed below.

Method 0:

data_loader: VectorDataLoader with element type float32 and any dimension OR unknown – (union alternatives 0, 3)
distance: all values

Method 1:

data_loader: VectorDataLoader with element type float16 and any dimension OR unknown – (union alternatives 0, 3)
distance: all values

Method 2:

data_loader: VectorDataLoader with element type uint8 and any dimension OR unknown – (union alternatives 0, 3)
distance: all values

Method 3:

data_loader: VectorDataLoader with element type int8 and any dimension OR unknown – (union alternatives 0, 3)
distance: all values

Method 4:

data_loader: LVQLoader 4x0 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: L2

Method 5:

data_loader: LVQLoader 4x0 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: MIP

Method 6:

data_loader: LVQLoader 4x8 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: L2

Method 7:

data_loader: LVQLoader 4x8 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: MIP

Method 8:

data_loader: LVQLoader 4x8 (turbo<16x8>) with any dimensions OR unknown – (union alternatives 1, 3)
distance: L2

Method 9:

data_loader: LVQLoader 4x8 (turbo<16x8>) with any dimensions OR unknown – (union alternatives 1, 3)
distance: MIP

Method 10:

data_loader: LVQLoader 8x0 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: L2

Method 11:

data_loader: LVQLoader 8x0 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: MIP

Method 12:

data_loader: LVQLoader 8x8 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: L2

Method 13:

data_loader: LVQLoader 8x8 (sequential) with any dimensions OR unknown – (union alternatives 1, 3)
distance: MIP

Method 14:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: L2

Method 15:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: MIP

Method 16:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: L2

Method 17:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: MIP

Method 18:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: L2

Method 19:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: MIP

Method 20:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: L2

Method 21:

data_loader: LeanVecLoader dims-anyx64 OR unknown – (union alternatives 2, 3)
distance: L2

Method 22:

data_loader: LeanVecLoader dims-anyxany OR unknown – (union alternatives 2, 3)
distance: MIP

Method 23:

data_loader: LeanVecLoader dims-768x160 OR unknown – (union alternatives 2, 3)
distance: MIP

static build(*args, **kwargs)

Overloaded function.

build(parameters: pysvs.VamanaBuildParameters, py_data: numpy.ndarray[float16], distance_type: pysvs.DistanceType, num_threads: int = 1) -> pysvs.Vamana

Construct a Vamana index over the given data, returning a searchable index.

Parameters:

parameters – Parameters controlling graph construction. See the documentation of this class.
py_data – The dataset to index. NOTE: PySVS will maintain an internal copy of the dataset. This may change in future releases.
distance_type – The distance type to use for this dataset.
num_threads – The number of threads to use for index construction. Default: 1.

build(parameters: pysvs.VamanaBuildParameters, py_data: numpy.ndarray[numpy.float32], distance_type: pysvs.DistanceType, num_threads: int = 1) -> pysvs.Vamana

Construct a Vamana index over the given data, returning a searchable index.

Parameters:

parameters – Parameters controlling graph construction. See the documentation of this class.
py_data – The dataset to index. NOTE: PySVS will maintain an internal copy of the dataset. This may change in future releases.
distance_type – The distance type to use for this dataset.
num_threads – The number of threads to use for index construction. Default: 1.

build(parameters: pysvs.VamanaBuildParameters, py_data: numpy.ndarray[numpy.uint8], distance_type: pysvs.DistanceType, num_threads: int = 1) -> pysvs.Vamana

Construct a Vamana index over the given data, returning a searchable index.

Parameters:

parameters – Parameters controlling graph construction. See the documentation of this class.
py_data – The dataset to index. NOTE: PySVS will maintain an internal copy of the dataset. This may change in future releases.
distance_type – The distance type to use for this dataset.
num_threads – The number of threads to use for index construction. Default: 1.

build(parameters: pysvs.VamanaBuildParameters, py_data: numpy.ndarray[numpy.int8], distance_type: pysvs.DistanceType, num_threads: int = 1) -> pysvs.Vamana

Construct a Vamana index over the given data, returning a searchable index.

Parameters:

parameters – Parameters controlling graph construction. See the documentation of this class.
py_data – The dataset to index. NOTE: PySVS will maintain an internal copy of the dataset. This may change in future releases.
distance_type – The distance type to use for this dataset.
num_threads – The number of threads to use for index construction. Default: 1.

build(build_parameters: pysvs.VamanaBuildParameters, data_loader: Union[pysvs.VectorDataLoader, pysvs.LVQLoader, pysvs.LeanVecLoader], distance_type: pysvs.DistanceType, num_threads: int = 1) -> pysvs.Vamana

Construct a Vamana index over the given data file, returning a searchable index.

Parameters:

build_parameters (pysvs.VamanaBuildParameters) – Hyper-parameters controlling index build.
data_loader – The source of the data on-disk. Can either be pysvs.DataFile to represent a standard uncompressed dataset, or a compressed loader.
distance_type – The similarity-function to use for this index.
num_threads – The number of threads to use for index construction. Default: 1.

The top level type is an abstract type backed by various specialized backends that will be instantiated based on their applicability to the particular problem instance.

The arguments upon which specialization is conducted are:

data_loader: Both kind (type of loader) and inner aspects of the loader like data type, quantization type, and number of dimensions.
distance: The distance measure being used.

Specializations compiled into the binary are listed below.

Method 0:

data_loader: VectorDataLoader with element type float32 and any dimension – (union alternative 0)
distance: all values

Method 1:

data_loader: VectorDataLoader with element type float16 and any dimension – (union alternative 0)
distance: all values

Method 2:

data_loader: VectorDataLoader with element type uint8 and any dimension – (union alternative 0)
distance: all values

Method 3:

data_loader: VectorDataLoader with element type int8 and any dimension – (union alternative 0)
distance: all values

Method 4:

data_loader: LVQLoader 4x0 (sequential) with any dimensions – (union alternative 1)
distance: L2

Method 5:

data_loader: LVQLoader 4x0 (sequential) with any dimensions – (union alternative 1)
distance: MIP

Method 6:

data_loader: LVQLoader 4x8 (sequential) with any dimensions – (union alternative 1)
distance: L2

Method 7:

data_loader: LVQLoader 4x8 (sequential) with any dimensions – (union alternative 1)
distance: MIP

Method 8:

data_loader: LVQLoader 4x8 (turbo<16x8>) with any dimensions – (union alternative 1)
distance: L2

Method 9:

data_loader: LVQLoader 4x8 (turbo<16x8>) with any dimensions – (union alternative 1)
distance: MIP

Method 10:

data_loader: LVQLoader 8x0 (sequential) with any dimensions – (union alternative 1)
distance: L2

Method 11:

data_loader: LVQLoader 8x0 (sequential) with any dimensions – (union alternative 1)
distance: MIP

Method 12:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: L2

Method 13:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: MIP

Method 14:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: L2

Method 15:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: MIP

Method 16:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: L2

Method 17:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: MIP

Method 18:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: L2

Method 19:

data_loader: LeanVecLoader dims-anyx64 – (union alternative 2)
distance: L2

Method 20:

data_loader: LeanVecLoader dims-anyxany – (union alternative 2)
distance: MIP

Method 21:

data_loader: LeanVecLoader dims-768x160 – (union alternative 2)
distance: MIP

property dimensions: Return the logical number of dimensions for each vector in the dataset.

property experimental_backend_string

Get a string identifying the full-type of the backend implementation.

This property is experimental and subject to change without a deprecation warning.

Type:: Read Only (str)

experimental_calibrate(*args, **kwargs)

Overloaded function.

experimental_calibrate(self: pysvs.Vamana, queries: numpy.ndarray[float16], groundtruth: numpy.ndarray[numpy.uint32], num_neighbors: int, target_recall: float, calibration_parameters: pysvs.VamanaCalibrationParameters = <pysvs.VamanaCalibrationParameters object at 0x7f41207680f0>) -> pysvs.VamanaSearchParameters

NOTE: This method is experimental and subject to change or removal without notice.

Run an experimental calibration routine to select the best search parameters.

Parameters:

queries – Queries used to drive the calibration process.
groundtruth – The groundtruth for the given query set.
num_neighbors – The number of nearest neighbors to calibrate for.
target_recall – The target num_neighbors-recall-at-num_neighbors. If such a recall is possible, then calibration will find parameters that optimize performance at this recall level.
calibration_parameters – The hyper-parameters to use during calibration.

Returns:

The best pysvs.VamanaSearchParameters found.

The calibration routine will also configure the index with the best found parameters. Note that calibration will use the number of threads already assigned to the index and can therefore be used to tune the algorithm to different threading amounts.

See also: pysvs.VamanaCalibrationParameters

experimental_calibrate(self: pysvs.Vamana, queries: numpy.ndarray[numpy.float32], groundtruth: numpy.ndarray[numpy.uint32], num_neighbors: int, target_recall: float, calibration_parameters: pysvs.VamanaCalibrationParameters = <pysvs.VamanaCalibrationParameters object at 0x7f412062a3b0>) -> pysvs.VamanaSearchParameters

NOTE: This method is experimental and subject to change or removal without notice.

Run an experimental calibration routine to select the best search parameters.

Parameters:

queries – Queries used to drive the calibration process.
groundtruth – The groundtruth for the given query set.
num_neighbors – The number of nearest neighbors to calibrate for.
target_recall – The target num_neighbors-recall-at-num_neighbors. If such a recall is possible, then calibration will find parameters that optimize performance at this recall level.
calibration_parameters – The hyper-parameters to use during calibration.

Returns:

The best pysvs.VamanaSearchParameters found.

The calibration routine will also configure the index with the best found parameters. Note that calibration will use the number of threads already assigned to the index and can therefore be used to tune the algorithm to different threading amounts.

See also: pysvs.VamanaCalibrationParameters

experimental_reset_performance_parameters(self: pysvs.Vamana) → None

Reset the internal performance-only parameters to built-in heuristics. This can be useful if experimenting with different dataset implementations which may need different values for performance-only parameters (such as prefetchers).

Calling this method should not affect recall.

property num_threads

Get and set the number of threads used to process queries.

Type:: Read/Write (int)

property query_types: Return the query element types this index is specialized for.

reconstruct(self: pysvs.Vamana, ids: numpy.ndarray[numpy.uint64]) → numpy.ndarray[numpy.float32]

save(self: pysvs.Vamana, config_directory: str, graph_directory: str, data_directory: str) → None

Save a constructed index to disk (useful following index construction).

Parameters:

config_directory – Directory where index configuration information will be saved.
graph_directory – Directory where graph will be saved.
data_directory – Directory where the dataset will be saved.

Note: All directories should be separate to avoid accidental name collision with any auxiliary files that are needed when saving the various components of the index.

If the directory does not exist, it will be created if its parent exists.

It is the caller’s responsibilty to ensure that no existing data will be overwritten when saving the index to this directory.

search(*args, **kwargs)

Overloaded function.

search(self: pysvs.Vamana, queries: numpy.ndarray[float16], n_neighbors: int) -> tuple

Perform a search to return the n_neighbors approximate nearest neighbors to the query.

Parameters:

queries – Numpy Vector or Matrix representing the queries. If the argument is a vector, it will be treated as a single query. If the argument is a matrix, individual queries are assumed to the rows of the matrix. Returned results will have a position-wise correspondence with the queries. That is, the N-th row of the returned IDs and distances will correspond to the N-th row in the query matrix.
n_neighbors – The number of neighbors to return for this search job.

Returns:

A tuple (I, D) where I contains the n_neighbors approximate (or exact) nearest neighbors to the queries and D contains the approximate distances.

Note: This form is returned regardless of whether the given query was a vector or a matrix.

search(self: pysvs.Vamana, queries: numpy.ndarray[numpy.float32], n_neighbors: int) -> tuple

Perform a search to return the n_neighbors approximate nearest neighbors to the query.

Parameters:

queries – Numpy Vector or Matrix representing the queries. If the argument is a vector, it will be treated as a single query. If the argument is a matrix, individual queries are assumed to the rows of the matrix. Returned results will have a position-wise correspondence with the queries. That is, the N-th row of the returned IDs and distances will correspond to the N-th row in the query matrix.
n_neighbors – The number of neighbors to return for this search job.

Returns:

A tuple (I, D) where I contains the n_neighbors approximate (or exact) nearest neighbors to the queries and D contains the approximate distances.

Note: This form is returned regardless of whether the given query was a vector or a matrix.

search(self: pysvs.Vamana, queries: numpy.ndarray[numpy.uint8], n_neighbors: int) -> tuple

Perform a search to return the n_neighbors approximate nearest neighbors to the query.

Parameters:

queries – Numpy Vector or Matrix representing the queries. If the argument is a vector, it will be treated as a single query. If the argument is a matrix, individual queries are assumed to the rows of the matrix. Returned results will have a position-wise correspondence with the queries. That is, the N-th row of the returned IDs and distances will correspond to the N-th row in the query matrix.
n_neighbors – The number of neighbors to return for this search job.

Returns:

A tuple (I, D) where I contains the n_neighbors approximate (or exact) nearest neighbors to the queries and D contains the approximate distances.

Note: This form is returned regardless of whether the given query was a vector or a matrix.

search(self: pysvs.Vamana, queries: numpy.ndarray[numpy.int8], n_neighbors: int) -> tuple

Perform a search to return the n_neighbors approximate nearest neighbors to the query.

Parameters:

queries – Numpy Vector or Matrix representing the queries. If the argument is a vector, it will be treated as a single query. If the argument is a matrix, individual queries are assumed to the rows of the matrix. Returned results will have a position-wise correspondence with the queries. That is, the N-th row of the returned IDs and distances will correspond to the N-th row in the query matrix.
n_neighbors – The number of neighbors to return for this search job.

Returns:

A tuple (I, D) where I contains the n_neighbors approximate (or exact) nearest neighbors to the queries and D contains the approximate distances.

Note: This form is returned regardless of whether the given query was a vector or a matrix.

property search_parameters

Get/set the current search parameters for the index. These parameters modify both the algorithmic properties of search (affecting recall) and non-algorthmic properties of search (affecting queries-per-second).

Example

This example will cover the following topics:

Building a pysvs.Vamana index from a dataset.

Performing queries to retrieve neighbors from a pysvs.Vamana.

Saving the index to disk.

Loading a pysvs.Vamana from disk.

Compressing an on-disk dataset and loading/searching with a compressed pysvs.Vamana.

The complete example is included at the end of this file.

Preamble

First, it is assumed that pysvs is installed and available to Python. We first need to perform a couple of imports.

import os
import pysvs

Then, we generate a sample dataset using the pysvs.generate_test_dataset() generation function. This function generates a data file, a query file, and the ground truth.

Note

The pysvs.generate_test_dataset() function generates datasets randomly with no semantic meaning for the elements within it.

Recall values for this dataset are usually lower than for real datasets.

# Create a test dataset.
# This will create a directory "example_data_vamana" and populate it with three
# entries:
# - data.fvecs: The test dataset.
# - queries.fvecs: The test queries.
# - groundtruth.ivecs: The groundtruth.
test_data_dir = "./example_data_vamana"
pysvs.generate_test_dataset(
    10000,                      # Create 10000 vectors in the dataset.
    1000,                       # Generate 1000 query vectors.
    128,                        # Set the vector dimensionality to 128.
    test_data_dir,              # The directory where results will be generated.
    data_seed = 1234,           # Random number seed for reproducibility.
    query_seed = 5678,          # Random number seed for reproducibility.
    num_threads = 4,            # Number of threads to use.
    distance = pysvs.DistanceType.L2,   # The distance type to use.
)

Building and Index

Now that data has been generated, we need to construct an index over that data. The index is a graph connecting related data vectors in such a way that searching for nearest neighbors yields good results. The first step is to construct an instance of pysvs.VamanaBuildParameters to describe the hyper-parameters of the graph we wish to construct. Don’t worry too much about selecting the correct values for these hyper-parameters right now. This usually involves a bit of experimentation and is dataset dependent.

Now that we’ve established our parameters, it is time to construct the index. Note the use of pysvs.VectorDataLoader to indicate both the file path and the element type of the fvecs file on disk. Passing the dims is optional, but may yield performance benefits if given.

# Build the index.
index = pysvs.Vamana.build(
    parameters,
    pysvs.VectorDataLoader(
        os.path.join(test_data_dir, "data.fvecs"), pysvs.DataType.float32
    ),
    pysvs.DistanceType.L2,
    num_threads = 4,
)

Searching the Index

The graph is now built and we can perform queries over the graph. First, we load the queries and the computed ground truth for our example dataset using pysvs.read_vecs().

# Load the queries and ground truth.
queries = pysvs.read_vecs(os.path.join(test_data_dir, "queries.fvecs"))
groundtruth = pysvs.read_vecs(os.path.join(test_data_dir, "groundtruth.ivecs"))

Performing queries is easy. First establish a base-line search window size (pysvs.Vamana.search_window_size). This provides a parameter by which performance and accuracy can be traded. The larger search_window_size is, the higher the accuracy but the lower the performance. Note that search_window_size must be at least as large as the desired number of neighbors.

# Set the search window size of the index and perform queries.
index.search_window_size = 30
I, D = index.search(queries, 10)

# Compare with the groundtruth.
recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
print(f"Recall = {recall}")
assert(recall == 0.8288)

We use pysvs.Vamana.search() to find the 10 approximate nearest neighbors to each query. Then, we use pysvs.k_recall_at() to compute the 10-recall at 10 of the returned neighbors, checking to confirm the accuracy.

The code snippet below demonstrates how to vary the search window size to change the achieved recall.

# Set the search window size of the index and perform queries.
index.search_window_size = 30
I, D = index.search(queries, 10)

# Compare with the groundtruth.
recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
print(f"Recall = {recall}")
assert(recall == 0.8288)

Saving the Index

If you are satisfied with the performance of the generated index, you can save it to disk to avoid rebuilding it in the future. This is done with pysvs.Vamana.save().

# Finally, we can save the results.
index.save(
    os.path.join(test_data_dir, "example_config"),
    os.path.join(test_data_dir, "example_graph"),
    os.path.join(test_data_dir, "example_data"),
)

Note

The pysvs.Vamana index currently uses three files for saving. All three are needed to be able to reload the index.

One file for the graph.
One file for the data (in a different form from fvecs).
One small metadata file.

This is subject to change in the future.

Reloading a Saved Index

To reload the index from file, use pysvs.Vamana.__init__() with a pysvs.VectorDataLoader for the second argument. This informs the dispatch mechanisms that we’re loading an uncompressed data file from disk.

# We can reload an index from a previously saved set of files.
index = pysvs.Vamana(
    os.path.join(test_data_dir, "example_config"),
    pysvs.GraphLoader(os.path.join(test_data_dir, "example_graph")),
    pysvs.VectorDataLoader(
        os.path.join(test_data_dir, "example_data"), pysvs.DataType.float32
    ),
    pysvs.DistanceType.L2,
    num_threads = 4,
)

# We can rerun the queries to ensure everything works properly.
index.search_window_size = 30
I, D = index.search(queries, 10)

# Compare with the groundtruth.
recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
print(f"Recall = {recall}")
assert(recall == 0.8288)

Performing queries is identical to before.

Using Vector Compression

The library has experimental support for performing online vector compression. The second argument to pysvs.Vamana.__init__() can be one of the compressed loaders (Loaders), which will compress an uncompressed dataset on the fly. To see which loaders are applicable to which methods, study the signature in pysvs.vamana.__init__() carefully.

Specifying the loader is all that is required to use vector compression. Note that vector compression is usually accompanied by an accuracy loss for the same search window size and may require increasing the window size to compensate.

    index = pysvs.Vamana(
        os.path.join(test_data_dir, "example_config"),
        pysvs.GraphLoader(os.path.join(test_data_dir, "example_graph")),
        compressed_loader,
        # Optional keyword arguments
        distance = pysvs.DistanceType.L2,
        num_threads = 4
    )

    # Compare with the groundtruth..
    index.search_window_size = 30
    I, D = index.search(queries, 10)
    recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
    print(f"Compressed recall: {recall}")
    assert(recall == 0.8223)

Entire Example

This ends the example demonstrating the features of the pysvs.Vamana index. The entire executable code is shown below. Please reach out with any questions.

# Import `unittest` to allow for automated testing.
import unittest

# [imports]
import os
import pysvs
# [imports]

DEBUG_MODE = False
def assert_equal(lhs, rhs, message: str = ""):
    if DEBUG_MODE:
        print(f"{message}: {lhs} == {rhs}")
    else:
        assert lhs == rhs, message

def run_test_float(index, queries, groundtruth):
    expected = {
        10: 0.5664,
        20: 0.7397,
        30: 0.8288,
        40: 0.8837,
    }

    for window_size in range(10, 50, 10):
        index.search_window_size = window_size
        I, D = index.search(queries, 10)
        recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
        assert_equal(
            recall, expected[window_size], f"Standard Search Check ({window_size})"
        )

def run_test_two_level4_8(index, queries, groundtruth):
    expected = {
        10: 0.5482,
        20: 0.7294,
        30: 0.8223,
        40: 0.8756,
    }

    for window_size in range(10, 50, 10):
        index.search_window_size = window_size
        I, D = index.search(queries, 10)
        recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
        assert_equal(
            recall, expected[window_size], f"Compressed Search Check ({window_size})"
        )

def run_test_build_two_level4_8(index, queries, groundtruth):
    expected = {
        10: 0.5484,
        20: 0.7295,
        30: 0.8221,
        40: 0.8758,
    }

    for window_size in range(10, 50, 10):
        index.search_window_size = window_size
        I, D = index.search(queries, 10)
        recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
        assert_equal(
            recall, expected[window_size], f"Compressed Search Check ({window_size})"
        )

# Shadow this as a global to make it available to the test-case clean-up.
test_data_dir = None

def run():

    # ###
    # Generating test data
    # ###

    # [generate-dataset]
    # Create a test dataset.
    # This will create a directory "example_data_vamana" and populate it with three
    # entries:
    # - data.fvecs: The test dataset.
    # - queries.fvecs: The test queries.
    # - groundtruth.ivecs: The groundtruth.
    test_data_dir = "./example_data_vamana"
    pysvs.generate_test_dataset(
        10000,                      # Create 10000 vectors in the dataset.
        1000,                       # Generate 1000 query vectors.
        128,                        # Set the vector dimensionality to 128.
        test_data_dir,              # The directory where results will be generated.
        data_seed = 1234,           # Random number seed for reproducibility.
        query_seed = 5678,          # Random number seed for reproducibility.
        num_threads = 4,            # Number of threads to use.
        distance = pysvs.DistanceType.L2,   # The distance type to use.
    )
    # [generate-dataset]


    # ###
    # Building the index
    # ###

    # [build-parameters]
    # Now, we can build a graph index over the data set.
    parameters = pysvs.VamanaBuildParameters(
        graph_max_degree = 64,
        window_size = 128,
    )
    # [build-parameters]

    # [build-index]
    # Build the index.
    index = pysvs.Vamana.build(
        parameters,
        pysvs.VectorDataLoader(
            os.path.join(test_data_dir, "data.fvecs"), pysvs.DataType.float32
        ),
        pysvs.DistanceType.L2,
        num_threads = 4,
    )
    # [build-index]

    # [build-index-fromNumpyArray]
    # Build the index.
    data = pysvs.read_vecs(os.path.join(test_data_dir, "data.fvecs"))
    index = pysvs.Vamana.build(
        parameters,
        data,
        pysvs.DistanceType.L2,
        num_threads = 4,
    )
    # [build-index-fromNumpyArray]


    # ###
    # Searching the index
    # ###

    # [load-aux]
    # Load the queries and ground truth.
    queries = pysvs.read_vecs(os.path.join(test_data_dir, "queries.fvecs"))
    groundtruth = pysvs.read_vecs(os.path.join(test_data_dir, "groundtruth.ivecs"))
    # [load-aux]

    # [perform-queries]
    # Set the search window size of the index and perform queries.
    index.search_window_size = 30
    I, D = index.search(queries, 10)

    # Compare with the groundtruth.
    recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
    print(f"Recall = {recall}")
    assert(recall == 0.8288)
    # [perform-queries]

    # [search-window-size]
    # We can vary the search window size to demonstrate the trade off in accuracy.
    for window_size in range(10, 50, 10):
        index.search_window_size = window_size
        I, D = index.search(queries, 10)
        recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
        print(f"Window size = {window_size}, Recall = {recall}")
    # [search-window-size]

    ##### Begin Test
    run_test_float(index, queries, groundtruth)
    ##### End Test


    # ###
    # Saving the index
    # ###

    # [saving-results]
    # Finally, we can save the results.
    index.save(
        os.path.join(test_data_dir, "example_config"),
        os.path.join(test_data_dir, "example_graph"),
        os.path.join(test_data_dir, "example_data"),
    )
    # [saving-results]


    # ###
    # Reloading a saved index
    # ###

    # [loading]
    # We can reload an index from a previously saved set of files.
    index = pysvs.Vamana(
        os.path.join(test_data_dir, "example_config"),
        pysvs.GraphLoader(os.path.join(test_data_dir, "example_graph")),
        pysvs.VectorDataLoader(
            os.path.join(test_data_dir, "example_data"), pysvs.DataType.float32
        ),
        pysvs.DistanceType.L2,
        num_threads = 4,
    )

    # We can rerun the queries to ensure everything works properly.
    index.search_window_size = 30
    I, D = index.search(queries, 10)

    # Compare with the groundtruth.
    recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
    print(f"Recall = {recall}")
    assert(recall == 0.8288)
    # [loading]

    ##### Begin Test
    run_test_float(index, queries, groundtruth)
    ##### End Test

    # [only-loading]
    # We can reload an index from a previously saved set of files.
    index = pysvs.Vamana(
        os.path.join(test_data_dir, "example_config"),
        pysvs.GraphLoader(os.path.join(test_data_dir, "example_graph")),
        pysvs.VectorDataLoader(
            os.path.join(test_data_dir, "example_data"), pysvs.DataType.float32
        ),
        pysvs.DistanceType.L2,
        num_threads = 4,
    )
    # [only-loading]

    # [runtime-nthreads]
    index.num_threads = 4
    # [runtime-nthreads]


    # ###
    # Search using vector compression
    # ###

    # [search-compressed-loader]
    data_loader = pysvs.VectorDataLoader(
        os.path.join(test_data_dir, "example_data"),  # Uncompressed data
        pysvs.DataType.float32,
        dims = 128    # Passing dimensionality is optional
    )
    B1 = 4    # Number of bits for the first level LVQ quantization
    B2 = 8    # Number of bits for the residuals quantization
    padding = 32
    strategy = pysvs.LVQStrategy.Turbo
    compressed_loader = pysvs.LVQLoader(data_loader,
        primary=B1,
        residual=B2,
        strategy=strategy, # Passing the strategy is optional.
        padding=padding # Passing padding is optional.
    )
    # [search-compressed-loader]

    # [search-compressed]
    index = pysvs.Vamana(
        os.path.join(test_data_dir, "example_config"),
        pysvs.GraphLoader(os.path.join(test_data_dir, "example_graph")),
        compressed_loader,
        # Optional keyword arguments
        distance = pysvs.DistanceType.L2,
        num_threads = 4
    )

    # Compare with the groundtruth..
    index.search_window_size = 30
    I, D = index.search(queries, 10)
    recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
    print(f"Compressed recall: {recall}")
    assert(recall == 0.8223)
    # [search-compressed]

    ##### Begin Test
    run_test_two_level4_8(index, queries, groundtruth)
    ##### End Test

    # [build-index-compressed]
    # Build the index.
    index = pysvs.Vamana.build(
        parameters,
        compressed_loader,
        pysvs.DistanceType.L2,
        num_threads = 4
    )
    # [build-index-compressed]

    # 1. Building Uncompressed
    # 2. Loading Uncompressed
    # 3. Loading with a recompressor

    # We can rerun the queries to ensure everything works properly.
    index.search_window_size = 30
    I, D = index.search(queries, 10)

    # Compare with the groundtruth.
    recall = pysvs.k_recall_at(groundtruth, I, 10, 10)
    print(f"Recall = {recall}")
    assert(recall == 0.8221)
    # [loading]

    ##### Begin Test
    run_test_build_two_level4_8(index, queries, groundtruth)
    ##### End Test

#####
##### Main Executable
#####

if __name__ == "__main__":
    run()

#####
##### As a unit test.
#####

class VamanaExampleTestCase(unittest.TestCase):
    def tearDown(self):
        if test_data_dir is not None:
            print(f"Removing temporary directory {test_data_dir}")
            os.rmdir(test_data_dir)

    def test_all(self):
        run()