pablocael / pynear Goto Github PK
View Code? Open in Web Editor NEWA python library for efficient KNN search within metric spaces using multiple distance functions.
License: MIT License
A python library for efficient KNN search within metric spaces using multiple distance functions.
License: MIT License
It would be great to support more distance metrics, especially some which cannot be emulated by pre/post-processing of the data.
I would also like to see the Haversine metric (https://en.wikipedia.org/wiki/Haversine_formula). This one can be emulated by transforming spherical latitude longitude coordinates to Cartesian first, but since this transform is a bit cumbersome, and also the memory usage increases from 2 to 3 dimensions it would be useful to use it directly. It's useful in GPS applications.
I can help implementing them, but my SIMD/AVX knowledge is very limited.
Readmee install instructions is pointing to local build package by running pip install .
We can add a build section containing current instructions and a separate install section that will have pypi repo instructions.
Specially for low dimensionality using L2 functionas and high dimensions using binary index
Hi, great library!
But I think itβs incorrect to use the squared Euclidean optimization since vp trees require a real metric which I think the squared distance is not.
When building index and searching from python, data can be copied twice both for indice data and query data. We can improve performance by prevent two copies using move operators.
Since we are building wheels on CI already, it would be easy to upload them to PyPI automatically through Github actions as well. We need to agree on a name though. Should we use pyvptree (it's still available on PyPI https://pypi.org/project/pyvptree/) or use another name, maybe more generic related to nearest neighbor search?
should the second condition be _indexEnd
? also both variables are unsigned integers which are never -1
. I would feel better with explicit casts for -1
everywhere (also reduced compiler warnings)
Add more tests
Arm (and possible other archs) have no support for simd operations.
Createe a fallback implementation for specific platforms (like ARM64) and use non simd implementation builds for those plaforms.
I suggest to use https://pypi.org/project/clang-format/ and https://pypi.org/project/black/ (or https://pypi.org/project/yapf/ if you prefer) for formatting directly. Both can be pip installed and don't require further apt or brew installs since clang-format offers wheels with the necessary binaries.
Pickle support would be nice to have. All scikit-learn
or scipy
trees are pickle'able.
Create different datasets such as normal clustered and others, including uniformly random (worst case)
We can make use of the fact that binary distances have limited states and try to use some similar to a BK-tree or have a map between possible pairs and distances to speed up binary index calculation.
Since binary indices must be used with proper data dimension, we have two options:
This is because hamming distances are optimized by the size of the bits:
Right now we use fixed 256 bits for hammind:
int64_t dist_hamming(const arrayli &p1, const arrayli &p2) {
return hamming<256>(reinterpret_cast<const uint64_t *>(&p1[0]), reinterpret_cast<const uint64_t *>(&p2[0]));
}
So users can just set wrong data dimension and it will fail.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.