Question about R, L parameter setup. about diskann HOT 5 CLOSED

microsoft commented on May 23, 2024

Question about R, L parameter setup.

from diskann.

Comments (5)

ShikharJ commented on May 23, 2024

@HannanKan These are just some empirical figures that we've found to work well for some general datasets. If you want a more specific answer, then the theory of Graph ANNS dictates that the R should be greater than or equal to the intrinsic dimensionality of the dataset. Here the intrinsic dimensionality is a tricky notion, because it is subjective to the way in which you want to measure dimensionality. It can be through PCA, or some other measure of choice. For most human-generated (and some machine-generated) datasets, the intrinsic dimensionality is much lower than the actual dimensionality, and as such, graph ANNS algorithms give much better performance compared to other algos which do not make use of this information.

As to what value of L is suitable, we've only found this through our experiments that a value between 1.5x to 2x the value of R suffices for most datasets. You can always increase L, as much as you want, but then you'll only be able to achieve diminishing returns, while the build time blows up.

Hope this helps.

from diskann.

harsha-simhadri commented on May 23, 2024

According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?

To add to Shikhar's points, since there is an element of randomness to the graph construction, the degree needs to be at least log n (n=#pts in index), and we find that 2log n to 4 log n is suitable. So for a billion points, 64 to 128 is reasonable.

The search list size to hit a recall is a function of "hardness" of the dataset and Shikhar has given some idea of how hardness could be quantified. However, he empirically try a few values to see what matches scenario requirements. For datasets like BIGANN-1B, DEEP-1B, this parameter size corresponds to reasonable results with a build time around 4 days

from diskann.

HannanKan commented on May 23, 2024

Grateful for your explanation @ShikharJ @harsha-simhadri. Why does increasing L in building phase leads to better search performance(recall and qps)?

My understanding is: large L result in large visited set V (returned by GreedySearch function). And large V increases the probability of introducing long range edges (in RobustPrune function)

Is that right?

from diskann.

ShikharJ commented on May 23, 2024

@HannanKan Sorry for the late reply. Yes this is correct.

from diskann.

ShikharJ commented on May 23, 2024

Closing for now.

from diskann.

Recommend Projects

Question about R, L parameter setup. about diskann HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent