Comments (5)
@HannanKan These are just some empirical figures that we've found to work well for some general datasets. If you want a more specific answer, then the theory of Graph ANNS dictates that the R should be greater than or equal to the intrinsic dimensionality of the dataset. Here the intrinsic dimensionality is a tricky notion, because it is subjective to the way in which you want to measure dimensionality. It can be through PCA, or some other measure of choice. For most human-generated (and some machine-generated) datasets, the intrinsic dimensionality is much lower than the actual dimensionality, and as such, graph ANNS algorithms give much better performance compared to other algos which do not make use of this information.
As to what value of L is suitable, we've only found this through our experiments that a value between 1.5x to 2x the value of R suffices for most datasets. You can always increase L, as much as you want, but then you'll only be able to achieve diminishing returns, while the build time blows up.
Hope this helps.
from diskann.
According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?
To add to Shikhar's points, since there is an element of randomness to the graph construction, the degree needs to be at least log n (n=#pts in index), and we find that 2log n to 4 log n is suitable. So for a billion points, 64 to 128 is reasonable.
The search list size to hit a recall is a function of "hardness" of the dataset and Shikhar has given some idea of how hardness could be quantified. However, he empirically try a few values to see what matches scenario requirements. For datasets like BIGANN-1B, DEEP-1B, this parameter size corresponds to reasonable results with a build time around 4 days
from diskann.
Grateful for your explanation @ShikharJ @harsha-simhadri. Why does increasing L in building phase leads to better search performance(recall and qps)?
My understanding is: large L result in large visited set V (returned by GreedySearch function). And large V increases the probability of introducing long range edges (in RobustPrune function)
Is that right?
from diskann.
@HannanKan Sorry for the late reply. Yes this is correct.
from diskann.
Closing for now.
from diskann.
Related Issues (20)
- [BUG] Do not assume write access to the data folder while creating PQ-based in-mem index.
- [Question] About test_concurr_merge_insert in the diskv2 branch
- [Question] DiskANN performance compared with mmap HOT 1
- [Question] How do I test the FreshDiskANN system?Such as Insert, delete, streamingmerge and other operations HOT 2
- [Program received signal SIGILL, Illegal instruction] HOT 4
- std:bad_alloc Error when loading PQ pivots
- [BUG] Usage for filtered indices needs to be updated
- [Question] Documentation for index binary file
- [BUG] Value of query_result_dist is 0 / 0.0000 HOT 1
- [Question] Hitting Weird Error HOT 1
- [Question] Kernel died after diskannpy import
- [BUG] Cosine + StaticMemoryIndex not working
- [Question] Met error when building rust/diskann
- [Question] Is FreshedDiskAnn supported now ?
- [BUG] Low recall rate on a custom dataset
- [Question]Why we need to merge edge sets after building vamana index?
- [BUG] Distance return for inner_product metric is not expected
- [Question] Parallel index building strategy
- Add multi filter changes for BANN_save_load_one_index branch
- [Question] Why require numpy version stick to 1.25? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diskann.