pablocael / pynear Goto Github PK

A python library for efficient KNN search within metric spaces using multiple distance functions.

License: MIT License

CMake 0.07% C++ 98.47% C 0.82% Python 0.63% Makefile 0.01%

pynear's Issues

Feature request: more distance metrics

It would be great to support more distance metrics, especially some which cannot be emulated by pre/post-processing of the data.

Manhattan metric (https://en.wikipedia.org/wiki/Taxicab_geometry)
Chebyshev metric (https://en.wikipedia.org/wiki/Chebyshev_distance), useful for many practical applications in robotics
Jensen-Shannon metric (https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence), applications in machine learning and many other fields using statistics

I would also like to see the Haversine metric (https://en.wikipedia.org/wiki/Haversine_formula). This one can be emulated by transforming spherical latitude longitude coordinates to Cartesian first, but since this transform is a bit cumbersome, and also the memory usage increases from 2 to 3 dimensions it would be useful to use it directly. It's useful in GPS applications.

I can help implementing them, but my SIMD/AVX knowledge is very limited.

Migrate all recursive functions to iterative with stack

Update installation instruction to use pypi repo

Readmee install instructions is pointing to local build package by running pip install .

We can add a build section containing current instructions and a separate install section that will have pypi repo instructions.

Add multiplatform support (linux, windows, mac)

Use benchmarks baseline to improve peformance

Specially for low dimensionality using L2 functionas and high dimensions using binary index

Squared Euclidean not metric

Hi, great library!

But I think it’s incorrect to use the squared Euclidean optimization since vp trees require a real metric which I think the squared distance is not.

Prevent multiple data copies by using move operator

When building index and searching from python, data can be copied twice both for indice data and query data. We can improve performance by prevent two copies using move operators.

Pypi package

Since we are building wheels on CI already, it would be easy to upload them to PyPI automatically through Github actions as well. We need to agree on a name though. Should we use pyvptree (it's still available on PyPI https://pypi.org/project/pyvptree/) or use another name, maybe more generic related to nearest neighbor search?

`isEmpty()` issues

https://github.com/pablocael/pyvptree/blob/f764e562e641d2616e2d2ddfcc404942cc6ecffa/pyvptree/include/VPTree.hpp#L43

should the second condition be _indexEnd? also both variables are unsigned integers which are never -1. I would feel better with explicit casts for -1 everywhere (also reduced compiler warnings)

Improve tests

Add more tests

Implement fallback non-simd functions for supporting non simd archs

Arm (and possible other archs) have no support for simd operations.
Createe a fallback implementation for specific platforms (like ARM64) and use non simd implementation builds for those plaforms.

`stylize` doesn't support windows

I suggest to use https://pypi.org/project/clang-format/ and https://pypi.org/project/black/ (or https://pypi.org/project/yapf/ if you prefer) for formatting directly. Both can be pip installed and don't require further apt or brew installs since clang-format offers wheels with the necessary binaries.

Select proper implementation based on dimension
Name indices for each size like BinaryIndex256, BinaryIndex64 and assert dimension * 8 = size

This is because hamming distances are optimized by the size of the bits:

Right now we use fixed 256 bits for hammind:

int64_t dist_hamming(const arrayli &p1, const arrayli &p2) {

    return hamming<256>(reinterpret_cast<const uint64_t *>(&p1[0]), reinterpret_cast<const uint64_t *>(&p2[0]));
}

So users can just set wrong data dimension and it will fail.

pablocael / pynear Goto Github PK

pynear's Issues

Recommend Projects

Recommend Topics

Recommend Org