Git Product home page Git Product logo

unum-cloud / usearch Goto Github PK

View Code? Open in Web Editor NEW
1.6K 21.0 89.0 3.99 MB

Fast Open-Source Search & Clustering engine Γ— for Vectors & πŸ”œ Strings Γ— in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram πŸ”

Home Page: https://unum-cloud.github.io/usearch/

License: Apache License 2.0

CMake 2.25% Python 15.62% Rust 7.02% C++ 54.75% JavaScript 0.31% Java 1.69% Swift 0.98% Objective-C++ 1.01% Objective-C 0.57% Dockerfile 0.04% C 4.02% Go 1.33% Jupyter Notebook 2.67% Batchfile 0.13% C# 5.15% TypeScript 2.45%
kann vector-search approximate-nearest-neighbor-search simd search similarity-search database faiss search-engine webassembly

usearch's Introduction

USearch

Smaller & Faster Single-File
Similarity Search Engine for Vectors & πŸ”œ Texts


Discord Β Β Β  LinkedIn Β Β Β  Twitter Β Β Β  Blog Β Β Β  GitHub

Spatial β€’ Binary β€’ Probabilistic β€’ User-Defined Metrics
C++ 11 β€’ Python 3 β€’ JavaScript β€’ Java β€’ Rust β€’ C 99 β€’ Objective-C β€’ Swift β€’ C# β€’ GoLang β€’ Wolfram
Linux β€’ MacOS β€’ Windows β€’ iOS β€’ WebAssembly β€’ SQLite3


Technical Insights and related articles:

Comparison with FAISS

FAISS is a widely recognized standard for high-performance vector search engines. USearch and FAISS both employ the same HNSW algorithm, but they differ significantly in their design principles. USearch is compact and broadly compatible without sacrificing performance, primarily focusing on user-defined metrics and fewer dependencies.

FAISS USearch Improvement
Indexing time ⁰
100 Million 96d f32, f16, i8 vectors 2.6 Β· 2.6 Β· 2.6 h 0.3 Β· 0.2 Β· 0.2 h 9.6 Β· 10.4 Β· 10.7 x
100 Million 1536d f32, f16, i8 vectors 5.0 Β· 4.1 Β· 3.8 h 2.1 Β· 1.1 Β· 0.8 h 2.3 Β· 3.6 Β· 4.4 x
Codebase length ΒΉ 84 K SLOC 3 K SLOC maintainable
Supported metrics Β² 9 fixed metrics any metric extendible
Supported languages Β³ C++, Python 10 languages portable
Supported ID types ⁴ 32-bit, 64-bit 32-bit, 40-bit, 64-bit efficient
Filtering ⁡ ban-lists any predicates composable
Required dependencies ⁢ BLAS, OpenMP - light-weight
Bindings ⁷ SWIG Native low-latency
Python binding size ⁸ ~ 10 MB < 1 MB deployable

⁰ Tested on Intel Sapphire Rapids, with the simplest inner-product distance, equivalent recall, and memory consumption while also providing far superior search speed. ¹ A shorter codebase of usearch/ over faiss/ makes the project easier to maintain and audit. ² User-defined metrics allow you to customize your search for various applications, from GIS to creating custom metrics for composite embeddings from multiple AI models or hybrid full-text and semantic search. ³ With USearch, you can reuse the same preconstructed index in various programming languages. ⁴ The 40-bit integer allows you to store 4B+ vectors without allocating 8 bytes for every neighbor reference in the proximity graph. ⁡ With USearch the index can be combined with arbitrary external containers, like Bloom filters or third-party databases, to filter out irrelevant keys during index traversal. ⁢ Lack of obligatory dependencies makes USearch much more portable. ⁷ Native bindings introduce lower call latencies than more straightforward approaches. ⁸ Lighter bindings make downloads and deployments faster.

Base functionality is identical to FAISS, and the interface must be familiar if you have ever investigated Approximate Nearest Neighbors search:

$ pip install numpy usearch

import numpy as np
from usearch.index import Index

index = Index(ndim=3)

vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)

matches = index.search(vector, 10)

assert matches[0].key == 42
assert matches[0].distance <= 0.001
assert np.allclose(index[42], vector, atol=0.1) # Ensure high tolerance in mixed-precision comparisons

More settings are always available, and the API is designed to be as flexible as possible.

index = Index(
    ndim=3, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
    dtype='f32', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
    connectivity=16, # Optional: Limit number of neighbors per graph node
    expansion_add=128, # Optional: Control the recall of indexing
    expansion_search=64, # Optional: Control the quality of the search
    multi=False, # Optional: Allow multiple vectors per key, default = False
)

Serialization & Serving Index from Disk

USearch supports multiple forms of serialization:

  • Into a file defined with a path.
  • Into a stream defined with a callback, serializing or reconstructing incrementally.
  • Into a buffer of fixed length or a memory-mapped file that supports random access.

The latter allows you to serve indexes from external memory, enabling you to optimize your server choices for indexing speed and serving costs. This can result in 20x cost reduction on AWS and other public clouds.

index.save("index.usearch")

loaded_copy = index.load("index.usearch")
view = Index.restore("index.usearch", view=True)

other_view = Index(ndim=..., metric=...)
other_view.view("index.usearch")

Exact vs. Approximate Search

Approximate search methods, such as HNSW, are predominantly used when an exact brute-force search becomes too resource-intensive. This typically occurs when you have millions of entries in a collection. For smaller collections, we offer a more direct approach with the search method.

from usearch.index import search, MetricKind, Matches, BatchMatches
import numpy as np

# Generate 10'000 random vectors with 1024 dimensions
vectors = np.random.rand(10_000, 1024).astype(np.float32)
vector = np.random.rand(1024).astype(np.float32)

one_in_many: Matches = search(vectors, vector, 50, MetricKind.L2sq, exact=True)
many_in_many: BatchMatches = search(vectors, vectors, 50, MetricKind.L2sq, exact=True)

If you pass the exact=True argument, the system bypasses indexing altogether and performs a brute-force search through the entire dataset using SIMD-optimized similarity metrics from SimSIMD. When compared to FAISS's IndexFlatL2 in Google Colab, USearch may offer up to a 20x performance improvement:

  • faiss.IndexFlatL2: 55.3 ms.
  • usearch.index.search: 2.54 ms.

User-Defined Metrics

While most vector search packages concentrate on just two metrics, "Inner Product distance" and "Euclidean distance", USearch allows arbitrary user-defined metrics. This flexibility allows you to customize your search for various applications, from computing geospatial coordinates with the rare Haversine distance to creating custom metrics for composite embeddings from multiple AI models, like joint image-text embeddings. You can use Numba, Cppyy, or PeachPy to define your custom metric even in Python:

from numba import cfunc, types, carray
from usearch.index import Index, MetricKind, MetricSignature, CompiledMetric

@cfunc(types.float32(types.CPointer(types.float32), types.CPointer(types.float32)))
def python_inner_product(a, b):
    a_array = carray(a, ndim)
    b_array = carray(b, ndim)
    c = 0.0
    for i in range(ndim):
        c += a_array[i] * b_array[i]
    return 1 - c

metric = CompiledMetric(pointer=python_inner_product.address, kind=MetricKind.IP, signature=MetricSignature.ArrayArray)
index = Index(ndim=ndim, metric=metric, dtype=np.float32)

Similar effect is even easier to achieve in C, C++, and Rust interfaces. Moreover, unlike older approaches indexing high-dimensional spaces, like KD-Trees and Locality Sensitive Hashing, HNSW doesn't require vectors to be identical in length. They only have to be comparable. So you can apply it in obscure applications, like searching for similar sets or fuzzy text matching, using GZip compression-ratio as a distance function.

Filtering and Predicate Functions

Sometimes you may want to cross-reference search-results against some external database or filter them based on some criteria. In most engines, you'd have to manually perform paging requests, successively filtering the results. In USearch you can simply pass a predicate function to the search method, which will be applied directly during graph traversal. In Rust that would look like this:

let is_odd = |key: Key| key % 2 == 1;
let query = vec![0.2, 0.1, 0.2, 0.1, 0.3];
let results = index.filtered_search(&query, 10, is_odd).unwrap();
assert!(
    results.keys.iter().all(|&key| key % 2 == 1),
    "All keys must be odd"
);

Memory Efficiency, Downcasting, and Quantization

Training a quantization model and dimension-reduction is a common approach to accelerate vector search. Those, however, are only sometimes reliable, can significantly affect the statistical properties of your data, and require regular adjustments if your distribution shifts. Instead, we have focused on high-precision arithmetic over low-precision downcasted vectors. The same index, and add and search operations will automatically down-cast or up-cast between f64_t, f32_t, f16_t, i8_t, and single-bit representations. You can use the following command to check, if hardware acceleration is enabled:

$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
> sapphire
$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
> ice

Using smaller numeric types will save you RAM needed to store the vectors, but you can also compress the neighbors lists forming our proximity graphs. By default, 32-bit uint32_t is used to enumerate those, which is not enough if you need to address over 4 Billion entries. For such cases we provide a custom uint40_t type, that will still be 37.5% more space-efficient than the commonly used 8-byte integers, and will scale up to 1 Trillion entries.

USearch uint40_t support

Indexes for Multi-Index Lookups

For larger workloads targeting billions or even trillions of vectors, parallel multi-index lookups become invaluable. Instead of constructing one extensive index, you can build multiple smaller ones and view them together.

from usearch.index import Indexes

multi_index = Indexes(
    indexes: Iterable[usearch.index.Index] = [...],
    paths: Iterable[os.PathLike] = [...],
    view: bool = False,
    threads: int = 0,
)
multi_index.search(...)

Clustering

Once the index is constructed, USearch can perform K-Nearest Neighbors Clustering much faster than standalone clustering libraries, like SciPy, UMap, and tSNE. Same for dimensionality reduction with PCA. Essentially, the Index itself can be seen as a clustering, allowing iterative deepening.

clustering = index.cluster(
    min_count=10, # Optional
    max_count=15, # Optional
    threads=..., # Optional
)

# Get the clusters and their sizes
centroid_keys, sizes = clustering.centroids_popularity

# Use Matplotlib to draw a histogram
clustering.plot_centroids_popularity()

# Export a NetworkX graph of the clusters
g = clustering.network

# Get members of a specific cluster
first_members = clustering.members_of(centroid_keys[0])

# Deepen into that cluster, splitting it into more parts, all the same arguments supported
sub_clustering = clustering.subcluster(min_count=..., max_count=...)

The resulting clustering isn't identical to K-Means or other conventional approaches but serves the same purpose. Alternatively, using Scikit-Learn on a 1 Million point dataset, one may expect queries to take anywhere from minutes to hours, depending on the number of clusters you want to highlight. For 50'000 clusters, the performance difference between USearch and conventional clustering methods may easily reach 100x.

Joins, One-to-One, One-to-Many, and Many-to-Many Mappings

One of the big questions these days is how AI will change the world of databases and data management. Most databases are still struggling to implement high-quality fuzzy search, and the only kind of joins they know are deterministic. A join differs from searching for every entry, requiring a one-to-one mapping banning collisions among separate search results.

Exact Search Fuzzy Search Semantic Search ?
Exact Join Fuzzy Join ? Semantic Join ??

Using USearch, one can implement sub-quadratic complexity approximate, fuzzy, and semantic joins. This can be useful in any fuzzy-matching tasks common to Database Management Software.

men = Index(...)
women = Index(...)
pairs: dict = men.join(women, max_proposals=0, exact=False)

Read more in the post: Combinatorial Stable Marriages for Semantic Search πŸ’

Functionality

By now, the core functionality is supported across all bindings. Broader functionality is ported per request. In some cases, like Batch operations, feature parity is meaningless, as the host language has full multi-threading capabilities and the USearch index structure is concurrent by design, so the users can implement batching/scheduling/load-balancing in the most optimal way for their applications.

C++ 11 Python 3 C 99 Java JavaScript Rust GoLang Swift
Add, search, remove βœ… βœ… βœ… βœ… βœ… βœ… βœ… βœ…
Save, load, view βœ… βœ… βœ… βœ… βœ… βœ… βœ… βœ…
User-defined metrics βœ… βœ… βœ… ❌ ❌ ❌ ❌ ❌
Batch operations ❌ βœ… ❌ ❌ βœ… ❌ ❌ ❌
Filter predicates βœ… ❌ βœ… ❌ ❌ βœ… ❌ ❌
Joins βœ… βœ… ❌ ❌ ❌ ❌ ❌ ❌
Variable-length vectors βœ… ❌ ❌ ❌ ❌ ❌ ❌ ❌
4B+ capacities βœ… ❌ ❌ ❌ ❌ ❌ ❌ ❌

Application Examples

USearch + AI = Multi-Modal Semantic Search

USearch Semantic Image Search

AI has a growing number of applications, but one of the coolest classic ideas is to use it for Semantic Search. One can take an encoder model, like the multi-modal UForm, and a web-programming framework, like UCall, and build a text-to-image search platform in just 20 lines of Python.

import ucall
import uform
import usearch

import numpy as np
import PIL as pil

server = ucall.Server()
model = uform.get_model('unum-cloud/uform-vl-multilingual')
index = usearch.index.Index(ndim=256)

@server
def add(key: int, photo: pil.Image.Image):
    image = model.preprocess_image(photo)
    vector = model.encode_image(image).detach().numpy()
    index.add(key, vector.flatten(), copy=True)

@server
def search(query: str) -> np.ndarray:
    tokens = model.preprocess_text(query)
    vector = model.encode_text(tokens).detach().numpy()
    matches = index.search(vector.flatten(), 3)
    return matches.keys

server.run()

A more complete demo with Streamlit is available on GitHub. We have pre-processed some commonly used datasets, cleaned the images, produced the vectors, and pre-built the index.

Dataset Modalities Images Download
Unsplash Images & Descriptions 25 K HuggingFace / Unum
Conceptual Captions Images & Descriptions 3 M HuggingFace / Unum
Arxiv Titles & Abstracts 2 M HuggingFace / Unum

USearch + RDKit = Molecular Search

Comparing molecule graphs and searching for similar structures is expensive and slow. It can be seen as a special case of the NP-Complete Subgraph Isomorphism problem. Luckily, domain-specific approximate methods exist. The one commonly used in Chemistry is to generate structures from SMILES and later hash them into binary fingerprints. The latter are searchable with binary similarity metrics, like the Tanimoto coefficient. Below is an example using the RDKit package.

from usearch.index import Index, MetricKind
from rdkit import Chem
from rdkit.Chem import AllChem

import numpy as np

molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
encoder = AllChem.GetRDKitFPGenerator()

fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
fingerprints = np.packbits(fingerprints, axis=1)

index = Index(ndim=2048, metric=MetricKind.Tanimoto)
keys = np.arange(len(molecules))

index.add(keys, fingerprints)
matches = index.search(fingerprints, 10)

That method was used to build the "USearch Molecules", one of the largest Chem-Informatics datasets, containing 7 billion small molecules and 28 billion fingerprints.

USearch + POI Coordinates = GIS Applications... on iOS?

USearch Maps with SwiftUI

With Objective-C and Swift iOS bindings, USearch can be easily used in mobile applications. The SwiftVectorSearch project illustrates how to build a dynamic, real-time search system on iOS. In this example, we use 2-dimensional vectorsβ€”encoded as latitude and longitudeβ€”to find the closest Points of Interest (POIs) on a map. The search is based on the Haversine distance metric but can easily be extended to support high-dimensional vectors.

Integrations

Citations

@software{Vardanian_USearch_2023,
doi = {10.5281/zenodo.7949416},
author = {Vardanian, Ash},
title = {{USearch by Unum Cloud}},
url = {https://github.com/unum-cloud/usearch},
version = {2.11.7},
year = {2023},
month = oct,
}

usearch's People

Contributors

al13n321 avatar alexbarev avatar arman-ghazaryan avatar ashvardanian avatar darvinharutyunyan avatar davvard avatar dluc avatar dranger003 avatar gurgenyegoryan avatar intitni avatar ishkhan42 avatar jlarmstrongiv avatar kimihailv avatar laitassou avatar mbautin avatar mgevor avatar nairihar avatar ngalstyan4 avatar omkar-ranadive avatar plurch avatar ruitard avatar sandorkonya avatar semantic-release-bot avatar simonw avatar sroussey avatar superkelvint avatar vaheandonians avatar var77 avatar vovor avatar weedge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

usearch's Issues

Bug: MultiIndex returns strange search results

Describe the bug

MultiIndex (even from one index) gives wrong nearest neighbours.

Steps to reproduce

import numpy as np
from usearch.index import Indexes, Index

np.random.seed(42)
data = np.random.random((10_000, 256))

index = Index(ndim=256)
index.add(np.arange(data.shape[0]), data)

gt = np.arange(1000)[:, np.newaxis]
print('Single index recall@1:', (index.search(data[:1000], 1).keys == gt).sum(axis=1).mean())

multi = Indexes(
    indexes=[index]
)

print('Multi index recall@1:', (multi.search(data[:1000], 1).keys == gt).sum(axis=1).mean())

Output:

Single index recall@1: 0.927
Multi index recall@1: 0.001

Expected behavior

Similar performance

USearch version

2.5.1

Operating System

Ubuntu 23.04

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Python runtime crashes after calling `Index.restore()`

Describe the bug

Loading empty index causes python runtime terminated.

Steps to reproduce

from pathlib import Path
from usearch.index import Index
import numpy as np

# ensure there is no file prior to test
index_path = Path('index.usearch')
if index_path.exists():
    index_path.unlink()

index = Index(ndim=3)
index.save(index_path)
index.reset()
assert index_path.exists()

# Here python interpreter just exits
# terminate called after throwing an instance of 'std::runtime_error'
# what():  Resource temporarily unavailable
try:
    index = Index.restore(str(index_path)) 
except Exception:
    pass

Expected behavior

Python runtime exits without errors. Or an exception is thrown that Python code could catch.

USearch version

v1.1.1

Operating System

Ubuntu 22.04.2

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: JavaScript Browser Documentation

Describe the bug

No documentation can be found for using usearch in the browser.

Wasmer does not have relevant documentation either.

See Discord discussion

Steps to reproduce

Try to import usearch in the browser

Expected behavior

More details or an example is provided on how to import and setup usearch in the browser

USearch version

v1.1.0

Operating System

Browser

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Invalid: C must work with synthetic data

Describe the bug

All tests across all languages don't depend on any external content. The same should be done for C to simplify its usage.

Steps to reproduce

Expected behavior

USearch version

v1.1.1

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Invalid: Update version across all files

Describe the bug

The .github/workflows/update_version.sh should be patched to update README.md and USEARCH_VERSION_MAJOR, USEARCH_VERSION_MINOR, USEARCH_VERSION_PATCH the main header file.

Steps to reproduce

Run the .github/workflows/update_version.sh and check if all the files have been changed.

Expected behavior

All files must be refreshed.

USearch version

v0.14.0

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Multiple Allocators

Describe what you are looking for

As a performance optimization, we can have a separate allocator for graph nodes that will continuously grow linked arenas and only shrink once all the entries have to be removed. That would require changing the core template class to have one more allocator type.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Java lib cannot be found on maven

Describe the bug

cloud.unum.usearch:0.2.3 this failed to resolve.

Steps to reproduce

cloud.unum.usearch:0.2.3 this failed to resolve.

Expected behavior

cloud.unum.usearch:0.2.3 should succeed.

USearch version

0.2.3

Operating System

android

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: parsing more complex tensor's scalars types

Describe what you are looking for

Currently, usearch can parse only exact dtypes (f8, f32 and etc). But tensors can have more complex description of scalars' type. For instance, numpy __array_interface__ property can return something like this:

{'data': (140283485917200, False), 'strides': None, 'descr': [('', '<f4')], 'typestr': '<f4', 'shape': (1183514, 100), 'version': 3}

So, it'd be useful If usearch supported dtypes like this.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Standardize the web-interface for Docker container

Describe what you are looking for

We currently have a Docker image with a few undocumented features. Those are highly dependent on UCall. To make the interface more generic, we need to add support for array arguments passed from the command line, or a generic JSON-RPC call, without necessarily serializing it in a NumPy fashion.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Other

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Question/Feature: Does "view" allow adding new items directly to disk?

Describe what you are looking for

AFAIK, "view" allows us to memory map the index to disk. This way, we can load an index that doesn't fit into RAM.

I was just wondering if that will also work for adding items to an index.

If so, what is the process?

  1. Instantiate the index
  2. index.save()
  3. index.view() from file
  4. index.add()

If that does work, is it necessary to call index.save() again at any point, or will each index.add() operation directly write to disk?

If memory mapping does not work for adding items, then we will always need a machine with enough RAM to hold the entire index at least for the creation of that index, or for any adding operation. Is that correct?

Thanks.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: `usearch_get` returns wrong value in C99 interface

Describe the bug

The function usearch_get consistently returns 1, regardless of the positive number of vectors actually copied to the buffer. I observed that in C# and then went down to C99 implementation.

I am using the newest main-dev branch.

Steps to reproduce

That function assumed to be inserted into c/test.c and called directly from main()

void test_get_multiple_vectors_from_one_key() {
    // initialization
    usearch_error_t error = NULL;
    size_t vector_dimension = 2;
    usearch_init_options_t opts = create_options(vector_dimension);
    opts.multi = true;
    usearch_index_t idx = usearch_init(&opts, &error);
    ASSERT(!error, error);

    // adding vectors
    usearch_reserve(idx, 4, &error);
    ASSERT(!error, error);
    float vector[] = {1, 2, 3, 4, 5, 6, 7, 8};
    usearch_key_t key = 1;
    for (size_t i = 0; i < 4; ++i) {
        usearch_add(idx, key, &vector[i * vector_dimension], usearch_scalar_f32_k, &error);
        ASSERT(!error, error);
    }

    // retrieve vectors from index
    float* vector_buffer = (float*)malloc(10 * sizeof(float));
    size_t count = usearch_get(idx, key, 4, vector_buffer, usearch_scalar_f32_k, &error);

    // 1 will be printed
    printf("count: %d\n", (int)count);

    // Though this prints 7, 8, 5, 6, 3, 4, 1, 2
    for (int i = 0; i < 8; ++i)
        printf("%.f ", vector_buffer[i]);

    usearch_free(idx, &error);
    ASSERT(!error, error);
    free(vector_buffer);
}

Expected behavior

usearch_get() return the correct number of vectors copied to the buffer.

USearch version

v1.1.0

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Invalid: Memory alignment

Describe the bug

There is a data alignment issue, that we have accounted for ahead of time, but haven't implemented the solution yet.
The index_config_t::vector_alignment controls, how to store vectors of add-ed elements in allocated memory.
Some platforms won't have an issue loading an 8-byte double, with an address that is not a multiple of 8.
Others will have that issue.
So the index_gt class has to be patched.

Steps to reproduce

Run the C++ unit test (available in VS code launchers) with -fsanitize=alignment address sanitizer flag set.

Expected behavior

No errors or warnings.

USearch version

v0.22.3

Operating System

Any

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: "Upload docs to release" action does not work correctly

Describe the bug

"Upload docs to release" action does not work correctly. Files are not uploaded into the latest release.

Steps to reproduce

Run an action on GitHub named "Release"

Expected behavior

Uploaded docs into the latest release

USearch version

v0.10.0

Operating System

Any

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: k parameter seem to restrict search, not output

Describe the bug

The k parameter of the search method seems to restrict not how many of the top matching items are returned, but how many of the items in the index are being searched.

This means that a full similarity search is not being performed unless k >= len(index).

I'm observing this behavior using the Python bindings. I have not checked the other implementations.

Steps to reproduce

Given this index definition:

    index = usearch.Index(
        dim=1024,  # Define the number of dimensions in input vectors
        metric="cos",  # Choose the "metric" or "distance", default = 'ip', optional
        dtype=dtype_string,  # Quantize to 'f16' or 'i8q100' if needed, default = 'f32', optional
        connectivity=16,  # How frequent should the connections in the graph be, optional
        expansion_add=200,  # Control the recall of indexing, optional
        expansion_search=5000,  # Control the quality of search, optional
    )

And given a corpus of 100 documents, where the document with label 32 is the best match:

for k in [200, 3]:
    matches, distances, count = index.search(query_embedding_matrix, k)

    l_dicts_matches_distances = [{"abs_id": m, "similarity": d} for m, d in zip(matches[0], distances[0])]
    bestdoc = sorted(l_dicts_matches_distances, key=lambda x: x["similarity"], reverse=True)[0]
    found = bestdoc["abs_id"] == 32
    
    print(f"{k=} {'βœ…' if found else '❌'}")

The best document is found only if k >= len(corpus):

k=200 βœ…
k=3 ❌

Expected behavior

If I run a search with k=1, I expect that...

  1. the full index will still be searched but
  2. only the best match will be returned

USearch version

0.1.9

Operating System

Ubuntu 20.04.5 LTS

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Removing vectors

Describe what you are looking for

Removing efficiently isn’t trivial, but can be a minimal base solution to adapt USearch to caching tasks, and integrate into Semantic Kernel, LangChain, etc.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Wolfram CI integration

Describe what you are looking for

We want the Paclet to pass tests in GitHub CI and get uploaded to the Marketplace. Docs

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Progress tracking and conditional termination from Python

Describe what you are looking for

Most long-running tasks in the C++ layer have an optional progress callback argument that can report the progress and potentially terminate the job somewhere in the middle. Python and other language bindings don't currently use that feature, externally slicing the np.ndarray of tasks and updating tqdm indicator after every batch.

That, however, does not apply to cluster, join, save, and load. So we need to pass some functions from Python to C++ bindings, that will be invoked only from one thread to avoid GIL issues.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Seg Fault on index.add in Rust

Describe the bug

I receive a segmentation fault when trying to add to an index in Rust, the error can be a direct segmentation fault or a "process didn't exit successfully", even when trying code from the docs. Is there something I am missing?

Steps to reproduce

Create IndexOptions, create index, create a vector slice, and pass it into index.add() alongside a key

Expected behavior

Index should add with no issues and be able to be searchable

USearch version

1.1.1

Operating System

Windows, Arch Linux

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Index only adds 12 items at a time (Python)

Hi Ash,

great library!

I'm having some trouble:

My indices contain fewer items than I add.

In fact, I'm having this issue even when running the sample Python code from your README.md.

import numpy as np
import usearch

index = usearch.Index(
    dim=256, # Define the number of dimensions in input vectors
    metric='cos', # Choose the "metric" or "distance", default = 'ip', optional
    dtype='f16', # Quantize to 'f16' or 'i8q100' if needed, default = 'f32', optional
    connectivity=16, # How frequent should the connections in the graph be, optional
    expansion_add=128, # Control the recall of indexing, optional
    expansion_search=64, # Control the quality of search, optional
)

n = 100
labels = np.array(range(n), dtype=np.longlong)
vectors = np.random.uniform(0, 0.3, (n, index.ndim)).astype(np.float32)

# check for duplicate vectors
vectors_deduped = np.unique(vectors, axis=0)
assert len(vectors_deduped) == len(vectors), "Duplicate vectors found"

# ensure we have n labels and vectors
assert len(vectors) == len(labels) == n, "Number of vectors and labels must match"

# You can avoid copying the data
# Handy when build 1B+ indexes of memory-mapped files
index.add(labels, vectors, copy=True)
assert len(index) == n, f"Index should contain {n} items, but contains {len(index)}"

Output:

`AssertionError: Index should contain 100 items, but contains 12``

Curiously:

  • I'm getting 12 items, no matter whether I add 100 items or 1_000. If I add 12 or fewer, the assertion passes. As soon as I pass 13, only 12 are added.
  • If I re-run the index.add() line, the index size increments by 12 every time.
  • I tried initializing the index with capacity=100, but that also didn't change things.
  • The parameter copy=True also apparently doesn't influence this behavior. I tried it both True and False.

My questions:

  1. What am I doing wrong?

  2. What does copy=True actually do? I didn't find any further documentation. It allows the index to just reference the label/vector data as long as it's stored in, say, an numpy array? Also, does copy=True mean "Yes, copy the data", or does it mean "Avoid copying the data"? Since you're explicity passing it when stating "You can avoid copying the data", it's a bit ambiguous.

Thank you!

Bug: RISC-V compilation failure

Describe the bug

As reported by @rshu1ze and the ClickHouse team, there is a compilation issue on RISC-V hardware.

Steps to reproduce

We need a compilation pipeline on GitHub CI that would compile USearch for RISC-V.

Expected behavior

Passing builds.

USearch version

v1.1.1

Operating System

Any

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Molecular Search (Bitwise Tanimoto) probably doesn't work well.

Describe the bug

I have tried the tutorial with bitwise tanimoto (by the way it should be MetricKind.Tanimoto, MetricKind.BitwiseTanimoto throws import error)

from usearch.index import Index, MetricKind
from rdkit import Chem
from rdkit.Chem import AllChem

import numpy as np

molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
encoder = AllChem.GetRDKitFPGenerator()

fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
fingerprints = np.packbits(fingerprints, axis=1)

index = Index(ndim=2048, metric=MetricKind.BitwiseTanimoto)
labels = np.arange(len(molecules))

index.add(labels, fingerprints)
matches: Matches = index.search(fingerprints, 10)

The problem is, I tried it in practice (adding this way cca 100K molecules from chembl), and while the search speed is good, ithe results are nonsensical.

Example query:
Screenshot 2023-07-18 at 22 01 20

Example output:
Screenshot 2023-07-18 at 22 01 37

i haven't had the time to dive into details what might be wrong
another observation

Steps to reproduce

Take any dataset of molecules, I used chembl_31, add the molecule into index as per tutorial and display the results.

Expected behavior

Expected output (ground truth best tanimoto neighbors on the dataset):
Screenshot 2023-07-18 at 22 04 18

USearch version

latest

Operating System

Ubuntu

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Compilation issue with a defined __AVX512__ without _Float16 support

Describe the bug

While working on the lanterndb project, we discovered a bug that made compilation fail. Compilation failed for my machine since it had __AVX512F__ defined, but environment did not support _Float16. See code below:

#if !defined(USEARCH_USE_NATIVE_F16)
#if defined(__AVX512F__)
#define USEARCH_USE_NATIVE_F16 1
#elif defined(USEARCH_DEFINED_ARM)
#define USEARCH_USE_NATIVE_F16 1
#else
#define USEARCH_USE_NATIVE_F16 0
#include <fp16/fp16.h>
#endif
#else
#define USEARCH_USE_NATIVE_F16 0
#include <fp16/fp16.h>
#endif

There is another underlying issue in the else condition of the same code, since the code throws away the environment's choice, even if USEARCH_USE_NATIVE_F16 is pre-defined,

A better choice is to

  • Directly check the thing we require directly i.e. instead of checking if __AVX512__ is defined, we check if _Float16 is defined.
  • Handle the user choice in the else condition as well

We have filed the bug internally here:
lanterndata/lantern#59

Attempted to solve it here:
Ngalstyan4#6

Steps to reproduce

  1. Build the lanterndb project (which internally builds usearch) following QuickStart instructions.
  2. On the make install step, I hit the error message at 7%: error: _Float16 does not name a type

Expected behavior

Build should be successful and continue to 100%

USearch version

v0.22.1

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: C# bindings and Semantic Kernel integration

Describe what you are looking for

With most of the common programming languages covered, the only major one we don't have bindings for is C#.
Having that will open doors to integrate USearch into Semantic Kernel C# backend, as well as a number of Unity applications.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Merging Small Indexes

Describe what you are looking for

As a performance optimization, we can implement merge for existing indexes instead of moving (inserting) elements one by one.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Index.save method raise TypeError on `Path` argument

Describe the bug

Index.save method has following declaration: save(path: Optional[PathLike] = None). From python docs Path implements PathLike protocol. So I expected that passing Path object is fine, but I got an error.

TypeError: save(): incompatible function arguments. The following argument types are supported:
    1. (self: usearch.compiled.Index, path: str) -> None
Invoked with: <usearch.compiled.Index object at 0x7f5955d810b0>, PosixPath('index.usearch')

I recommend to use either str type annotation or add support for Path objects: e.g. just calling str(path).

Steps to reproduce

import numpy as np
from pathlib import Path
from usearch.index import Index

index = Index(ndim=3)
vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)
index.save(Path("index.usearch"))

Expected behavior

File "index.usearch" created.

USearch version

0.22.3

Operating System

Ubuntu 20.04.6

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: index.hpp leaks memory

Describe the bug

the simple c++ db (index.hpp) leaks. I ran it with asan, and got the following output:

=================================================================
==884330==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 1024 byte(s) in 1 object(s) allocated from:
    #0 0x7f8e43669587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x559417f9d685 in unum::usearch::index_gt<unum::usearch::cos_gt<float, float>, unsigned long, unsigned int, std::allocator<char>, std::allocator<char> >::search_to_find_in_base_(unsigned int, unum::usearch::span_gt<float const>, unsigned long, unum::usearch::index_gt<unum::usearch::cos_gt<float, float>, unsigned long, unsigned int, std::allocator<char>, std::allocator<char> >::context_t&) const (/home/green/workspace/clip.cpp/build/bin/image-search+0x46685)
    #2 0x559417fa1883 in unum::usearch::index_gt<unum::usearch::cos_gt<float, float>, unsigned long, unsigned int, std::allocator<char>, std::allocator<char> >::search(unum::usearch::span_gt<float const>, unsigned long, unum::usearch::search_config_t) const (/home/green/workspace/clip.cpp/build/bin/image-search+0x4a883)
    #3 0x559417fa4053 in main (/home/green/workspace/clip.cpp/build/bin/image-search+0x4d053)
    #4 0x7f8e4301d082 in __libc_start_main ../csu/libc-start.c:308

Direct leak of 960 byte(s) in 24 object(s) allocated from:
    #0 0x7f8e43669587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x559417f99b1e in unum::usearch::index_gt<unum::usearch::cos_gt<float, float>, unsigned long, unsigned int, std::allocator<char>, std::allocator<char> >::view(char const*) (/home/green/workspace/clip.cpp/build/bin/image-search+0x42b1e)
    #2 0x559417fa3820 in main (/home/green/workspace/clip.cpp/build/bin/image-search+0x4c820)
    #3 0x7f8e4301d082 in __libc_start_main ../csu/libc-start.c:308

Direct leak of 512 byte(s) in 1 object(s) allocated from:
    #0 0x7f8e43669587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x559417fa0099 in unum::usearch::index_gt<unum::usearch::cos_gt<float, float>, unsigned long, unsigned int, std::allocator<char>, std::allocator<char> >::search(unum::usearch::span_gt<float const>, unsigned long, unum::usearch::search_config_t) const (/home/green/workspace/clip.cpp/build/bin/image-search+0x49099)
    #2 0x559417fa4053 in main (/home/green/workspace/clip.cpp/build/bin/image-search+0x4d053)
    #3 0x7f8e4301d082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: 2496 byte(s) leaked in 26 allocation(s).

Steps to reproduce

checkout this pr monatis/clip.cpp#5 (or main if merged)

Expected behavior

no leakage :)

USearch version

v0.19.3

Operating System

ubuntu 20.04

Hardware architecture

x86

Which interface are you using?

C++ implementation

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Crash when adding a labeled vector, using the Swift binding

Describe the bug

I modified the test in the package to run in an XCTestCase, and it crashes when adding a labeled vector at the line index.add(label: 42, vector: vectorA[...])

The package crashes at index.hpp line 889 with Thread 1: EXC_BAD_ACCESS (code=1, address=0x488)

Thread 1 Queue : com.apple.main-thread (serial)
#0	0x00000001183a6d80 in unum::usearch::sorted_buffer_gt<unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::candidate_t, unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::compare_by_distance_t, unum::usearch::aligned_allocator_gt<unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::candidate_t, 64ul> >::clear() at /Users/x/Developer/Projects/usearch/include/usearch/index.hpp:889
#1	0x00000001183a6678 in unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::add(unsigned int, unum::usearch::span_gt<char const>, unum::usearch::add_config_t) at /Users/x/Developer/Projects/usearch/include/usearch/index.hpp:1783
#2	0x00000001183a63bc in unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::add_result_t unum::usearch::index_punned_dense_gt<unsigned int, unsigned int>::add_<float>(unsigned int, float const*, unum::usearch::add_config_t, std::__1::function<bool (char const*, unsigned long, char*)> const&) at /Users/x/Developer/Projects/usearch/include/usearch/index_punned_dense.hpp:373
#3	0x00000001183a61c8 in unum::usearch::index_gt<unum::usearch::index_punned_dense_metric_t, unsigned int, unsigned int, unum::usearch::aligned_allocator_gt<char, 64ul>, unum::usearch::memory_mapping_allocator_gt<1ul> >::add_result_t unum::usearch::index_punned_dense_gt<unsigned int, unsigned int>::add_<float>(unsigned int, float const*, std::__1::function<bool (char const*, unsigned long, char*)> const&) at /Users/x/Developer/Projects/usearch/include/usearch/index_punned_dense.hpp:450
#4	0x000000011834aabc in unum::usearch::index_punned_dense_gt<unsigned int, unsigned int>::add(unsigned int, float const*) at /Users/x/Developer/Projects/usearch/include/usearch/index_punned_dense.hpp:258
#5	0x000000011834a984 in -[USearchIndex addSingle:vector:] at /Users/x/Developer/Projects/usearch/objc/USearchObjective.mm:126
#6	0x00000001183b0e50 in closure #1 in USearchIndex.add(label:vector:) at /Users/x/Developer/Projects/usearch/swift/Index+Sugar.swift:16
#7	0x0000000107bb0e7c in partial apply for closure #1 in USearchIndex.add(label:vector:) ()
#8	0x000000019d3c3d9c in ArraySlice.withContiguousStorageIfAvailable<Ο„_0_0>(_:) ()
#9	0x0000000107bb0d18 in USearchIndex.add(label:vector:) at /Users/x/Developer/Projects/usearch/swift/Index+Sugar.swift:15
#10	0x0000000107b49858 in Test.testUnit() at /Users/x/Developer/Projects/usearch/swift/Test.swift:18

I checked the CI output, actually, the test was not running.

warning: 'usearch': found 1 file(s) which are unhandled; explicitly declare them as resources or exclude from the target
    /Users/runner/work/usearch/usearch/swift/Test.swift

[0/1] Planning build
Building for debugging...
[1/3] Emitting module USearchTests
[2/3] Compiling USearchTests Test.swift
[2/3] Linking USearchPackageTests
Build complete! (7.68s)
Test Suite 'All tests' started at 2023-06-27 10:08:35.166
Test Suite 'All tests' passed at 2023-06-27 10:08:35.168.
	 Executed 0 tests, with 0 failures (0 unexpected) in 0.000 (0.002) seconds

Steps to reproduce

  1. wrap the testUnit() in swift/Test.swift in class Test: XCTestCase {}.
  2. run the test

Expected behavior

No crash.

USearch version

v0.19.0

Operating System

macOS 13.4 (22F66)

Hardware architecture

Arm

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Go library build error

Describe the bug

Running the example I get this error:

# github.com/unum-cloud/usearch/golang
../go/pkg/mod/github.com/unum-cloud/usearch/[email protected]/lib.go:10:10: fatal error: '../c/usearch.h' file not found
#include "../c/usearch.h"
         ^~~~~~~~~~~~~~~~
1 error generated.

Looking through the compiled package I see that this path doesn't exist.

Reference: https://github.com/unum-cloud/usearch/tree/main/golang

Steps to reproduce

  • create a Go project
  • go get https://github.com/unum-cloud/usearch
  • try to run the example go run .

Expected behavior

library runs without issue

USearch version

v0.0.0-20230720201502-ff268f175932

Operating System

Mac OS Ventura

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Get labels and vectors from index

Describe what you are looking for

I'd like to be able to access all labels stored inside an index. I'd also like to be able to access vectors from the index by id.

hnswlib has these methods:

  • index.get_items(ids)
  • get_ids_list()

I find them useful, and am missing them in usearch.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Parallel `view`

Describe what you are looking for

Working with large datasets, it's a pity when your 200+ vCPU machine is idle, waiting for one thread to parse all the files.
Let's adjust view function to use the executor passed. It will require structural changes to the file, to be able to jump from one offset to another.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: JS keys should be `bigint`

Describe the bug

Index::Add currently validates the key expecting it to be a number.

Steps to reproduce

N/A

Expected behavior

The validation should expect it to be a bigint.

USearch version

#20566e0ecdd99564150602dd96610e4be377e080

Operating System

N/A

Hardware architecture

Arm

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Slow index.add

Describe the bug

Index creation, vector addition and search loop is 90x slower than identical loop for faiss.

Reproduced in Google colab.

Steps to reproduce

import faiss
from usearch.index import Index

import numpy as np

# Set the seed for reproducibility
np.random.seed(42)

# Generate 500 random vectors of length 1024 with values between 0 and 1
vectors = np.random.rand(500, 1024)
vector=np.random.rand(1024)

%%timeit
# FAISS

indexf = faiss.IndexFlatL2(vectors.shape[-1])
indexf.add(np.array(vectors).astype(np.float32))
D, I = indexf.search(np.array([vector]), 50)
# 1.14 ms Β± 4.34 Β΅s per loop (mean Β± std. dev. of 7 runs, 1000 loops each)


%%timeit

index = Index(
    ndim=vectors.shape[-1], # Define the number of dimensions in input vectors
    metric='l2_sq', # Choose 'cos', 'l2sq', 'haversine' or other metric, default = 'ip'
  
   )
index.add(labels= np.arange(len(vectors)), vectors=vectors)
matches, distances, count = index.search(vector, 50, exact=True)
  
# 94.7 ms Β± 20.9 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)

Expected behavior

n/a

USearch version

latest pip

Operating System

ubuntu

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

The automated release is failing 🚨

🚨 The automated release from the main branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you can benefit from your bug fixes and new features again.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can fix this πŸ’ͺ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the main branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here are some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


Cannot push to the Git repository.

semantic-release cannot push the version tag to the branch main on the remote Git repository with URL https://[secure]@github.com/unum-cloud/usearch.git.

This can be caused by:


Good luck with your project ✨

Your semantic-release bot πŸ“¦πŸš€

Bug: Add/search function use questions

Describe the bug

I used the C++ example on the official website, but I encountered a problem with the add and search function. I don’t know which function should be used here in metric_at&& metric. The following is the official website code.

I hope you can give me an example that can be used. Thank you.

Steps to reproduce

index_gt<metric_cos_gt> index;
float vec[3] = {0.1, 0.3, 0.2};

index.reserve(10);
index.add(/* key: / 42, / vector: / {&vec[0], 3}); ??? has problem.
auto results = index.search(/
query: / {&vec[0], 3}, 5 / neighbors */);

Expected behavior

I hope you can give me a case where C++ can be used

USearch version

lastest

Operating System

windows 11

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: WASM Build

Describe what you are looking for

I would like to run this off-line in the browser have you considered adding a WASM build?

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Other

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Python `Index.restore` returns `None` when called with `pathlib.Path` object

Describe the bug

Index.restore returns None when called with pathlib.Path object. I suppose this could be fixed by using os.fspath before passing path to _index_dense_metadata. If it works maybe it should be done everywhere in the module where PathLike object is passed to the _compiled.

def metadata(path: os.PathLike) -> Optional[dict]:
if not os.path.exists(path):
return None
try:
return _index_dense_metadata(path)
except Exception:
return None

Steps to reproduce

from pathlib import Path
from usearch.index import Index
import numpy as np

# ensure there is no file prior to test
index_path = Path('index.usearch')
if index_path.exists():
    index_path.unlink()

index = Index(ndim=3)
vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)

index.save(index_path)
index.reset()

index = Index.restore(str(index_path))
assert index is not None  # okay

index = Index.restore(index_path)
assert index is not None  # assertion error

Expected behavior

No AssertionError occurs.

USearch version

v1.1.1

Operating System

Ubuntu 22.04.2

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Brute force (KNN) search without index

Describe what you are looking for

Please add an option to run brute force (KNN) search - without an index.
It can be useful for use cases where a small number of searches has to be done.
It's available in both faiss and lance.

Thanks for this awesome library!

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

"cosine distance" inverted?

Running the python example from the docs:

import numpy as np
import usearch

index = usearch.Index(
    ndim=256, # Define the number of dimensions in input vectors
    metric='cos', # Choose the "metric" or "distance", default = 'ip', optional
    dtype='f32', # Quantize to 'f16' or 'i8q100' if needed, default = 'f32', optional
    connectivity=16, # How frequent should the connections in the graph be, optional
    expansion_add=128, # Control the recall of indexing, optional
    expansion_search=64, # Control the quality of search, optional
)

vector = np.random.uniform(0, 0.3, (index.ndim)).astype(np.float32)
index.add(42, vector)
matches, distances, count = index.search(vector, 10)

I get "distances" of array([0.99999994], dtype=float32).

Looks like this is inverted - the distance between vector and itself should be 0 (float eps notwithstanding).

This is a problem with big indexes, where the number of results is much smaller than the number of vectors in the index. I want the lowest distance vectors, which is the opposite . AFACIT there is no way to invert the order used in the search function.

Am I missing something here, or is there an option to invert the order of vectors returned by search?

Feature: support for retrieving metadata from index file in C interface

Describe what you are looking for

Hey, I found an unobvious behaviour with the Go code related to the index pointer and configuration. Here's what's going on:

  1. In the Load function only the index pointer is set up, but the configuration is left with incorrect dimensions.

    usearch/golang/lib.go

    Lines 341 to 356 in e686a3d

    func (index *Index) Load(path string) error {
    if index.opaque_handle == nil {
    panic("Index is uninitialized")
    }
    c_path := C.CString(path)
    defer C.free(unsafe.Pointer(c_path))
    var errorMessage *C.char
    C.usearch_load((C.usearch_index_t)(unsafe.Pointer(index.opaque_handle)), c_path, (*C.usearch_error_t)(&errorMessage))
    if errorMessage != nil {
    return errors.New(C.GoString(errorMessage))
    }
    return nil
    }

  2. In the Get function, a vector is created with a length based on dimensions from config.

    vector = make([]float32, index.config.Dimensions)

    Same in the Search function
    if len(query) != int(index.config.Dimensions) {

  3. The dimensions come from the config struct, but it looks like it might not be set up properly. Since the Load function doesn't set the dimensions, and user can forget to do that explicitly.

One solution might be to set dimensions in Load function.

Or better to add support for metadata loading in C interface, as it is done for Python.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug:

Describe the bug

usearch exact search do not report the keys properly even in the latest version: 2.5.1

Steps to reproduce

test code:

from usearch.index import search, MetricKind, Matches, BatchMatches
import numpy as np

# Generate 10'000 random vectors with 1024 dimensions
vectors = np.random.rand(1000, 64).astype(np.float32)

one_in_many: Matches = search(vectors, vectors[100], 1, MetricKind.Cos, exact=True)
print(one_in_many[0])

print:

Match(key=0, distance=0.0)

Expected behavior

The expected behavior is like the approximate one - we want

Match(key=100, distance=0.0)

USearch version

v2.5.1

Operating System

OSX

Hardware architecture

Arm

Which interface are you using?

Python bindings

Contact Details

happy to reply to comment or send me an email

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: cannot catch Runtime exception when index fails to load

Describe the bug

We are using USearch from swift
When a saved index cannot be loaded because, say, the file has been corrupted during save(), the lib crashes.
We get a runtime exception that cannot be caught.

My question is as follows (sorry for being newbie in c++):

  • why are the save() and load() function declared as noexcept since fread can fail() ?

There is a mechanism in Swift to ensure that C exception can be handled via ObjC bridging...requires some easy code...but since these methods will call std:terminate when there is an error in loading the index, it won't help

Steps to reproduce

create an index and corrupt the file on disk
try to load it
it crashes the runtime

Expected behavior

the runtime should not crash
exception should be sent to the caller of the library

USearch version

1.3.0

Operating System

macOS

Hardware architecture

Arm

Which interface are you using?

C++ implementation

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Distance is negative

Describe the bug

Shouldn't distances for all metrics be non-negative? When I used the Inner Product as a distance measure, I found that it could produce negative values when using Index.search().

The same behavior is observed in the C# binding, so I suspect the same will occur when using the C99 interface.

Steps to reproduce

import numpy as np
from usearch.index import Index, Matches

index = Index(
    ndim=3, 
    metric='ip', 
    dtype='f32', 
)

keys = np.arange(5)
vectors = np.array(
    [
        [1, 1, 1],
        [-1, -1, -1],
        [1, 2, 3],
        [-1, -2, -3],
        [1, -2, -3],
    ], 
    dtype=np.float32
)
search_vector = np.array([1, 1, 1], dtype=np.float32)

index.add(keys, vectors)
matches: Matches = index.search(search_vector, 4)

for match in matches:
    assert(match.distance > -1e-6), f"Negative distance {match.distance}"

Expected behavior

Distance is always >= -epsilon, where epsilon is very small.

USearch version

v2.3.1

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: unable to install with wasmer

Describe the bug

Following the JavaScript docs for the browser, running:

wasmer install unum/usearch

Results in

error: Unable to find "install" in the registry
╰─▢ 1: Not found

Steps to reproduce

  • install wasmer brew install wasmer
  • try to install usearch wasmer install unum/usearch

Expected behavior

usearch installs correctly

USearch version

2.1.2

Operating System

MacOS

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: Windows support for memory mapping

Describe what you are looking for

We can use the safe WinAPI interface directly to memory-map serialized .usearch files.
See: _sopen_s , _fstat64 , CreateFileMapping, MapViewOfFile, UnmapViewOfFile.
Here is an example implementation of mmap.
We don't need it, nor should we use macros in the header.
Still, use it for inspiration :)

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Documenting Memory Usage

Describe what you are looking for

Recent benchmarks by @davvard suggest, that even for the same f32 vectors USearch uses up to 4x less memory than FAISS. We should double-check those numbers and measure the memory consumption on the same test dataset we have used in the FAISS comparison on the front page.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: possibility of a streaming API?

Describe what you are looking for

Caveat, this question may not make sense: I'm coming at it from a node.js perspective and have little understanding of the underlying C++ implementation.

When handling large datasets in node, it's customary to pass streams around instead of loading it all into memory as a string. Given that usearch can already serialize indices to disk, it would be really neat if I could open a stream when fetching/reading a document and connect it to a usearch index for streamed serialization.

Would something like even that be possible? I'd be happy to try and implement it if so, with guidance.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Feature: WAI(WebAssembly Interfaces) for USearch WASM binary

Describe what you are looking for

Provide WebAssembly Interfaces(WAI) over USearch C interface for USearch WASM binary.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: Freeze on add method

Describe the bug

Interpreter execution freezes after adding vector with label same as one removed before.

Steps to reproduce

import numpy as np
from usearch.index import Index

index = Index(ndim=3)
vector = np.array([0.2, 0.6, 0.4])
index.add(42, vector)
index.remove(42)
index.add(42, vector) # Here execution freezes

Expected behavior

Interpreter finish execution with no errors.

USearch version

v0.22.3

Operating System

Ubuntu 22.04.1

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Bug: require(usearch) doesn’t work in node.js

Describe the bug

This was brought up and discussed in discord channel.

Node 18.0.0
Error: Cannot find module 'usearch' at Module.require (node:internal/modules/cjs/loader:1005:19) {code: 'MODULE_NOT_FOUND', requireStack: Array(16), stack: 'Error: Cannot find module 'usearch'Require s…re (node:internal/modules/cjs/loader:1005:19)', message: 'Cannot find module 'usearch'

even though my package.json has the dependency installed for "usearch": "^0.9.5”

Also, when directly using usearch.node as require(usearch.node), we get below error:
error: /src/binaries/usearch.node: invalid ELF header

Steps to reproduce

in a node js file:
npm install usearch

then add below code
var usearch = require('usearch’);var index = new usearch.Index({ metric: 'cos', connectivity: 16, dimensions: 3 });index.add(42, new Float32Array([0.2, 0.6, 0.4]));var results = index.search(new Float32Array([0.2, 0.6, 0.4]), 10);console.log(results);

Expected behavior

ability to use require(β€˜usearch’)

USearch version

v0.9.5

Operating System

Ubuntu

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.