zjulearning / nsg Goto Github PK

View Code? Open in Web Editor NEW

595.0 595.0 144.0 1.78 MB

Navigating Spreading-out Graph For Approximate Nearest Neighbor Search

License: MIT License

CMake 2.32% C++ 97.08% Dockerfile 0.60%

nsg's People

Stargazers

Watchers

Forkers

liuyuyuil 0xellos vseledkin thkfly xinshu lrt05hust zengmingm xuxiaohui4 apple1987 timedcy dsisds offbye wy-gsix jacke121 linhongxiu yxz9087 tidesq liuheng0111 yiqinggit xlicz zhuoranlyu pixeldoodle dreadlord1984 runauto rkshuai shashlik007 ahappycutedog litaoshao zhouyonglong kakashidan skjack minghao2016 liushuan dbaranchuk lethetann feng-y lucienzhu ssddn salensoft peterzs vjrkr xuehy songyaheng chunelfeng waterblas xkfengxd227 lzhlynn rtygbwwwerr jkpol pottry beyond1235 willdamon wangweidamon xiao1dian el-anqiao taishanfuxiao mrhwc hgollakota whenever5225 tiankonghenlan20113046 t-mac-curry ljeagle wushicanasl georgethrax regzhuce skyinthesea johnpzh lpdone dreamerzp op-hunter ganghu1993 llltttppp zhuomingliang antonovvk beesitech pgsrv xiaohulugo yjbwict fc-h hyfhkc deepjake changya1990 ahzz1207 amandachi cse-ljl ahaldar radimbaca pultarmi holygen paradoxzw jiadiwu dovahyol evilmorty137 xiaming9880 iitians grideyes-2010 keyvec liql2007 yulinyu1999 zsyf102900

nsg's Issues

Comparison with SPTAG

Dear Team,

Amazing work on NSG !! Question please - is there a comparison available against https://github.com/microsoft/SPTAG ? Didn´t dive very deep into SPTAG but wonder how it does against NSG.

Many thanks

Why not optimize tree_grow function by union-find-set in Build function?

Learn much posture of C++ programing from your code. But I think there is still space for optimizing the performance by using union-find-set to judge the connectivity of the NSG.
emm..
maybe can economize a non-recursive DFS function, am i right:)

Build逻辑中，InterInsert函数实现有问题

nsg/src/index_nsg.cpp

Line 332 in 4da5ee5

for (unsigned t = 0; t < result.size(); t++) {

根据三角不等式裁边之后，遗留的边的个数，即result中的元素个数，很可能小于range，此时需要将末尾下一个元素的distance置为-1.

即，需要在 for 循环之后，加上代码：
if (result.size() < range) { des_pool[result.size()].distance = -1; }

Algorithm and Code bugs?

Hi. I have some questions regarding the algorithm and the code:

In the paper there's no mention about what InterInsert function is doing. What's its relevance?
There's a DFS function but not being used, contrary to what's been mentioned in the paper.
As the DFS function’s not getting used, how are we ensuring the connectivity.
When performing InterInsert, in the case when temp_pool.size() > range, it might so happen that final pool has less nodes than range. But there’s no assignment of distance as ‘-1’ to mark the end of the pool.
When performing InterInsert, when (temp_pool.size() > range), how are we ensuring that the graph formed still has a monotic path from the navigating node to every other node?

Compile errors when running make -j

Hi,

I am trying to compile, I am following the 4 steps in the README. This unfortunately does not work for me.
This is the output I get: pastebin . What should I do different?

Building NSG index for single core CPU

Hello, I'm interested in your work and I'm trying to reproduce your result.

I want to reproduce the result for DEEP100M dataset on single core.
However, test_nn_descent is not working because of segmentation fault.
Can you let me know how did you build kNN graphs for this dataset?
Also, can you share both kNN graph and nsg index?

Thanks

The distance calculation of inner product is wrong.

The inner product of two vectors represents the similarity. So we should return the largest ones as kNN. It's opposite to L2 distance.

logic error?

in file index_nsg.cpp:248

if (pool[start].id == q) start++;

in file index_nsg.cpp:251

while (result.size() < range && (++start) < pool.size() && start < maxc) {

if line 248 executed, it may have influence on line 251, the times of loop is not exact maxc.
so I think line 248 should be:

if (pool[start].id == q) start++, maxc ++;

how to build on windows?

such as the title

unnecessary "flags[id] = true"

nsg/src/index_nsg.cpp

Line 549 in eceb09e

flags[id] = true;

在这里， flags[id] = true 必然已经成立了

MRNG edge selection implement

in index_nsg.cpp method sync_prune()

if (djk < p.distance)
{
occlude = true;
break;
}

i do not understand what "occlude" means, why dik < p.distance can get the conclusion of "occlude = true"?

thanks

你好，请教下，该算法在索引已建好的情况下，如何支持插入新数据，插入一条或多条？

How can i calculate mAP ？

./test_nsg_optimized_search DATA_PATH QUERY_PATH NSG_PATH SEARCH_L SEARCH_K RESULT_PATH
DATA_PATH is the path of the base data in fvecs format.
QUERY_PATH is the path of the query data in fvecs format.
NSG_PATH is the path of the pre-built NSG index in previous section.
SEARCH_L controls the quality of the search results, the larger the better but slower. The SEARCH_L cannot be samller than the SEARCH_K
SEARCH_K controls the number of result neighbors we want to query.
RESULT_PATH is the query results in ivecs format.

Memory leak

in file index_nsg.cpp:394

SimpleNeighbor *cut_graph_ = new SimpleNeighbor[nd_ * (size_t)range];

the space of cut_graph_ does not be deleted, which leads to memory leak.

How to integrated into search engine, such as ElasticSearch?

How to integrated into search engine, such as ElasticSearch? Expect your feedback.

Uninitialised variable ep_

The first init of ep_ occurs at the end of IndexNSG::init_graph. Before that, IndexNSG::get_neighbors is called and uses ep_. This sometimes causes segfaults.

What's the intended initial value of ep_? If it's 0, it should be explicitly initialised to 0 (default behaviour in this is compiler-dependent, but there's no guarantee it will be initialised to 0 automatically). If it's an arbitrary valid node index, leave it to the user by making it protected instead of private; this is my preferred solution, since ep_ is read in IndexNSG::Load.

syn dataset performance

论文中提到的随机生成的数据集，按照均匀分布和高斯分布来生成，这个指的是一个128维向量，这128个值呈现均匀分布或者高斯分布吗？我这样生成的数据集图索引的性能会急剧下降，而annoy库的性能表现要超过图索引，这和论文里的结果并不符合诶，希望可以解答，谢谢！

How to choose parameters

I have converts my images into vectors with 4096 dimension.
The number of training data is around 200,000 pictures.
To build graph in NSG, need to set the default parameters, including:

L controls the quality of the NSG, the larger the better, L > R.
R controls the index size of the graph, the best R is related to the intrinsic dimension of the dataset.

Could you suggest value of the parameters?

an Error when reading binary file

In file nsg/tests/test_nsg_index.cpp:9, for function load_data, you read a variable dim from the head of binary file, then move the file pointer to the end of file to get its bytes, then calculate total number of features, then move the file pointer to the begin of file. Just at this time, you have not read a variable but read the features directly, this may lead to some problem.
BTW, thank you for sharing the source code.
Best regards,

NSG on Deep100m dataset

Hi, I want to run NSG on Deep100m dataset.
Would you kindly send me the link to the dataset, please?
Also, can you share Faiss, efanna refine, SSG/NSG config for Deep100m?
Thanks,

some questions about NSG

Find the approximate medoid of the dataset by search on the kNN graph with greedy best-first search algorithm(algorithm 1). Why not compare all other nodes to choice the closest node?
The dataset occupies large memory, and the PQ's metric is mentioned in the code. Will this be added in the future?
Is NSG-Naive much faster than NSG at indexing stage? Have you compared the memory space, data-pre-processing time, precision and response scenarios ?
Have you ever used the kNN graph of LargeVis? And compare with Faiss?
Incremental indexing is mentioned in the paper, how to do it? Now, it can only split the dataset and update one partition to avoid full updates?

Best wishes.

Is this support vector added and deleted?

Why is the vector size of sift_groundtruth.ivecs, sift_query.fvecs and query results in ANN_SIFT1M different?

Why is the vector size of sift_groundtruth.ivecs, sift_query.fvecs and query results in ANN_SIFT1M different?How to calculate precision?

Really Need your help

Can you tell me the how to evalute the recall for your project :https://github.com/ZJULearning/pixel_link

retset vector index issue

https://github.com/ZJULearning/nsg/blob/eceb09e2754d978b4c8b07e3248018987feb597a/src/index_nsg.cpp#L111C1-L123C49

118 行是否应改为 retset[L] ？之后 123 行只对前 L 个排序
这样的问题在这个文件中应该有四处？

Estimating the Local Intrinsic Dimensionality

Hi, I wonder if you could open source the code of estimating the local intrinsic dimensionality? Thanks!

how to search by multi-cores

Run test_nsg_search using only one core

Inconsistent I/O approach

"Because there is no unified format for input data, users may need to write input function to read your own data."

This doesn't hold for the call Build() -> Load_nn_graph(). If I want to make a wrapper for class IndexNSG that deals with my own I/O, I can't use Build() and overload Load_nn_graph() to do nothing, which is the simplest solution. (The other alternative is copypasting Build() and deleting the call to Load_nn_graph(), which is bad practice.)

This problem should be fixed by making IndexNSG::Load_nn_graph() virtual.

Another possible solution is not calling Load_nn_graph if parameter nn_graph_path doesn't exist.

Unable to replicate example workflow in README -- stack smashing

$ kgraph/index -I 14 -L 150 -S 10 -R 100 /mnt/SIFT1M/sift_base.kg.data /mnt/SIFT1M/kgraph.result
Generating control...
Initializing...
iteration: 1 recall: 0 accuracy: 2.23433 cost: 0.00038 M: 10 delta: 1 time: 5.81898 one-recall: 0 one-ratio: 3.53372
iteration: 2 recall: 0.002 accuracy: 1.13812 cost: 0.000637557 M: 10 delta: 0.855773 time: 9.32155 one-recall: 0.01 one-ratio: 2.71138
iteration: 3 recall: 0.0344 accuracy: 0.621388 cost: 0.00109574 M: 11.5365 delta: 0.834505 time: 14.0836 one-recall: 0.08 one-ratio: 2.16501
iteration: 4 recall: 0.1912 accuracy: 0.295963 cost: 0.00163154 M: 11.8474 delta: 0.782809 time: 18.7587 one-recall: 0.23 one-ratio: 1.74459
iteration: 5 recall: 0.5088 accuracy: 0.0961833 cost: 0.00223746 M: 12.6159 delta: 0.664009 time: 23.4768 one-recall: 0.64 one-ratio: 1.28671
iteration: 6 recall: 0.7808 accuracy: 0.0235666 cost: 0.00298158 M: 15.1331 delta: 0.432179 time: 29.2621 one-recall: 0.95 one-ratio: 1.00885
iteration: 7 recall: 0.9024 accuracy: 0.00668195 cost: 0.00395733 M: 21.148 delta: 0.196402 time: 35.0734 one-recall: 0.97 one-ratio: 1.00386
iteration: 8 recall: 0.9488 accuracy: 0.00301341 cost: 0.00498334 M: 27.3224 delta: 0.088411 time: 40.8331 one-recall: 0.97 one-ratio: 1.00269
iteration: 9 recall: 0.9672 accuracy: 0.00172632 cost: 0.00577787 M: 31.3168 delta: 0.0513151 time: 45.942 one-recall: 0.98 one-ratio: 1.00231
iteration: 10 recall: 0.978 accuracy: 0.000938965 cost: 0.00626418 M: 33.429 delta: 0.0371458 time: 49.2892 one-recall: 0.99 one-ratio: 1.00059
iteration: 11 recall: 0.9816 accuracy: 0.000794071 cost: 0.00652194 M: 34.4583 delta: 0.0312213 time: 51.9613 one-recall: 0.99 one-ratio: 1.00059
iteration: 12 recall: 0.9824 accuracy: 0.000786144 cost: 0.00664935 M: 34.9466 delta: 0.0286689 time: 53.9705 one-recall: 0.99 one-ratio: 1.00059
iteration: 13 recall: 0.9828 accuracy: 0.000784735 cost: 0.00671028 M: 35.1763 delta: 0.0275193 time: 55.3717 one-recall: 0.99 one-ratio: 1.00059
iteration: 14 recall: 0.9832 accuracy: 0.000671959 cost: 0.00674068 M: 35.2903 delta: 0.0269744 time: 57.0438 one-recall: 1 one-ratio: 1
0
62.858780s wall, 1093.320000s user + 74.460000s system = 1167.780000s CPU (1857.8%)

$ nsg/tests/kgraph2ivec /mnt/SIFT1M/kgraph.result /mnt/SIFT1M/sift.150nngraph
KNNGRAPH
2
0
1000000
*** stack smashing detected ***: nsg/tests/kgraph2ivec terminated
Aborted (core dumped)

Could you please suggest how to fix this? experiment run on Ubuntu 16LTS, gcc5.4.0.

Inconsistent Greedy Search

Hello!

I've looked closely at the paper, and it explicitly mentions that for the proper NSG constructiom the greedy search has to be performed from just the navigating node:

Link is calling get_neighbors to perform the greedy search from the navigating node. However, the following lines don't match the paper https://github.com/ZJULearning/nsg/blob/master/src/index_nsg.cpp#L171-L177

Here the pool is set to the neighbors of the navigating node, and the rest of the pool is filled with random points in the graph. So for large L, majority of the points are nowhere near the navigating node. This seems to contradict the paper. What is the reason for this? Doesn't it worsen the monotonic structure of the graph?

Supported distance metrics

Hi,

I want to know what are the distance metrics in nsg supported?

From readme, it seems that I do not found any metric parameter for input.

low recall accuracy

test_nsg_optimized_search_recall.cpp.tar.gz

I'm trying to reproduce your benchmark on sift1M dataset. I adjusted the parameters of kNN graph and search. However I failed to improve the recall accuracy to 0.993。

[root@cdsl-gpu-a07 tests]# ./test_nndescent ../../sift/sift_base.fvecs sift.50NN.graph 50 70 10 10 100
data dimension: 128
recall : 0.0003
iter: 0
recall : 0.002
iter: 1
recall : 0.0221
iter: 2
recall : 0.1011
iter: 3
recall : 0.2899
iter: 4
recall : 0.511
iter: 5
recall : 0.6378
iter: 6
recall : 0.6833
iter: 7
recall : 0.6955
iter: 8
recall : 0.6987
iter: 9
1000000
Time cost: 40.4963

[root@cdsl-gpu-a07 tests]# ./test_nsg_index ../../sift/sift_base.fvecs ../../efanna_graph/tests/sift.50NN.graph 256 100 sift.50NN.nsg
100:34:2:0
indexing time: 102.874

[root@cdsl-gpu-a07 tests]# ./test_nsg_optimized_search_recall ../../sift/sift_base.fvecs ../../sift/sift_query.fvecs sift.50NN.nsg 100 1 ../../sift/sift_groundtruth.ivecs
34
search time: 9.37674
Compute recalls
R@1 = 0.9906
R@10 = 0.9906
R@100 = 0.9906

The attached is code of test_nsg_optimized_search_recall, which's based on test_nsg_optimized_search.cpp and (faiss repo) demo_sift1M.cpp.

As a reference, the nmslib HNSW sample code gived recall accuracy 0.999 on sift1M with 500 queries.
What I need is exact 1NN search on big dataset. And it's fine to comprise on R@1 and performance a bit.

Could you help me?
Thanks!

how many memory used ？

10,000,000 128-dimension vectors

IndexNSG::InterInsert lock issue

https://github.com/ZJULearning/nsg/blob/eceb09e2754d978b4c8b07e3248018987feb597a/src/index_nsg.cpp#L308C7-L308C7

这里会用一个锁来读取 des_pool，之后又会加个锁写入 des_pool，假如处理另一个节点的函数在读、写操作中间读了，又在本线程写后写了数据，那么本线程的工作不就等于没干？

当然，这里这样处理可能是没问题的，在 des_pool 空间未满时，写操作分别在末尾添加数据是没有问题的，在 des_pool 空间已满时，究竟怎么更新数据（把入边反转一下也加入出边？），也许并没有区别？

Recall for rand dataset with d=30

I generated a set of n points which are drawn uniformly at random from the unit D-dim sphere. I generated a query dataset from the same distribution with n/10 points.
I used nndescent from efanna to generate approx-NN graph (with 98%recall@100) and then used NSG index construction with L=200, R=200, C=600. I used NSG search with Ls=500, K=100.

The recall@100 for D=128 dimensions is in the high 90s, which is good.
However, recall@100 for D=30 dimensions does not go past 67%. I have tuned most of the parameters exposed in the command line.
This problem also arises with grid graphs in low dimensions and other d<100 dimension datasets we have internally.

Do you face similar problems or are you able to get high recall for random D<50 dimensional base and query datasets?

Thanks!

IndexNSG

IndexNSG类中变量未初始化，在执行test_nsg_search时index_nsg.cpp 81行抛出异常。

Provide Python interface

Is it possible to provide a Python 3 interface to NSG? Thanks.

when will nsg support python?

a simple question

hello, I'm reading the implemention of NSG again, in file index_nsg.cpp:330, after execute the strategy of edge selection, I think it still cannot ensure result.size() == range, so after put result into dest_pool, I think there is a need to judge whether result.size() < range and put the first invalid element's distance=-1 if necessary.

Unnecessary deallocation

The command free(data); (index_nsg.cpp:465) deallocates an external argument. This serves no purpose in the class IndexNSG and complicates a code that relies on data later on.

I suggest removing it.

zjulearning / nsg Goto Github PK

nsg's People

Stargazers

Watchers

Forkers

nsg's Issues

Recommend Projects

Recommend Topics

Recommend Org