Git Product home page Git Product logo

diskann's Introduction

DiskANN

DiskANN Main PyPI version Downloads shield License: MIT

DiskANN Paper DiskANN Paper DiskANN Paper

DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters. This code is based on ideas from the DiskANN, Fresh-DiskANN and the Filtered-DiskANN papers with further improvements. This code forked off from code for NSG algorithm.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

See guidelines for contributing to this project.

Linux build:

Install the following packages through apt-get

sudo apt install make cmake g++ libaio-dev libgoogle-perftools-dev clang-format libboost-all-dev

Install Intel MKL

Ubuntu 20.04 or newer

sudo apt install libmkl-full-dev

Earlier versions of Ubuntu

Install Intel MKL either by downloading the oneAPI MKL installer or using apt (we tested with build 2019.4-070 and 2022.1.2.146).

# OneAPI MKL Installer
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18487/l_BaseKit_p_2022.1.2.146.sh
sudo sh l_BaseKit_p_2022.1.2.146.sh -a --components intel.oneapi.lin.mkl.devel --action install --eula accept -s

Build

mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 

Windows build:

The Windows version has been tested with Enterprise editions of Visual Studio 2022, 2019 and 2017. It should work with the Community and Professional editions as well without any changes.

Prerequisites:

  • CMake 3.15+ (available in VisualStudio 2019+ or from https://cmake.org)
  • NuGet.exe (install from https://www.nuget.org/downloads)
    • The build script will use NuGet to get MKL, OpenMP and Boost packages.
  • DiskANN git repository checked out together with submodules. To check out submodules after git clone:
git submodule init
git submodule update
  • Environment variables:
    • [optional] If you would like to override the Boost library listed in windows/packages.config.in, set BOOST_ROOT to your Boost folder.

Build steps:

  • Open the "x64 Native Tools Command Prompt for VS 2019" (or corresponding version) and change to DiskANN folder
  • Create a "build" directory inside it
  • Change to the "build" directory and run
cmake ..

OR for Visual Studio 2017 and earlier:

<full-path-to-installed-cmake>\cmake ..

This will create a diskann.sln solution. Now you can:

  • Open it from VisualStudio and build either Release or Debug configuration.
  • <full-path-to-installed-cmake>\cmake --build build
  • Use MSBuild:
msbuild.exe diskann.sln /m /nologo /t:Build /p:Configuration="Release" /property:Platform="x64"
  • This will also build gperftools submodule for libtcmalloc_minimal dependency.
  • Generated binaries are stored in the x64/Release or x64/Debug directories.

Usage:

Please see the following pages on using the compiled code:

Please cite this software in your work as:

@misc{diskann-github,
   author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
   title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
   url = {https://github.com/Microsoft/DiskANN},
   version = {0.6.1},
   year = {2023}
}

diskann's People

Contributors

bryantower avatar chakpc avatar darvg avatar daxpryce avatar dengcai78 avatar dependabot[bot] avatar dj3500 avatar dkaczynski avatar gopalrs avatar harsha-simhadri avatar hliu18 avatar initc avatar jigaoluo avatar jinwei14 avatar jonmclean avatar ltan1ms avatar michael-popov avatar microsoftopensource avatar neelammahapatro avatar ningyuanchen avatar philipbadams avatar rakri avatar sanhaoji2 avatar shanewil avatar shawnzhong avatar shikharj avatar theantony avatar varunsivashankar avatar yashpatel007 avatar yiylin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diskann's Issues

What is difference between Index and PQFlashIndex?

In tests/build_disk_index use Index, and in tests/search_disk_index use PQFlashIndex;
In tests/build_memory_index use PQFlashIndex, and in tests/search_disk_index.

Why Index and PQFlashIndex work together?

Segmentation fault when build_disk_index with Multithreading

What happened:

I generate 10k bin data with a simple program.

#include <iostream>
#include <fstream>
using namespace std;
int main()
{
  ofstream myFile ("data.bin", ios::out | ios::binary);
  int mockNum = 10;
  int dimension = 10;
  myFile.write((char*)&mockNum, sizeof(int));
  myFile.write((char*)&dimension, sizeof(int));
  for (int i = 0; i < mockNum; i++)
  {
    for (int j = 0; j < dimension; j++)
    {
      int8_t x = (i + j);
      myFile.write((char*)&x, sizeof(int8_t));
    }
  }
  myFile.close();
  return 0;
}

But segmentation fault when I running build_disk_index with Multithreading.

successful command

./tests/build_disk_index int8 ../cmake-build-debug/tests/data.bin output/index 60 75 10 10 1

core dump command

./tests/build_disk_index int8 ../cmake-build-debug/tests/data.bin output/index 60 75 10 10 2

message log

Starting index build: R=60 L=75 Query RAM budget: 1.0469e+10 Indexing ram budget: 10 T: 2
Compressing 10-dimensional data into 10 bytes per vector.
Opened: ../cmake-build-debug/tests/data.bin, size: 108, cache_size: 108
Training data loaded of size 10
Stat(output/index_pq_pivots.bin) returned: 0
Reading bin file output/index_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 10...
PQ pivot file exists. Not generating again
Opened: ../cmake-build-debug/tests/data.bin, size: 108, cache_size: 108
Stat(output/index_pq_pivots.bin) returned: 0
Reading bin file output/index_pq_pivots.bin_centroid.bin ...
Metadata: #pts = 10, #dims = 1...
Reading bin file output/index_pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 10, #dims = 1...
Reading bin file output/index_pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 11, #dims = 1...
Reading bin file output/index_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 10...
Loaded PQ pivot information
Processing points [0, 10)...done.
Full index fits in RAM, building in one shot
Number of frozen points = 0
Reading bin file ../cmake-build-debug/tests/data.bin ...Metadata: #pts = 10, #dims = 10, aligned_dim = 16...allocating aligned memory, 160 bytes...done. Copying data... done.
Using AVX2 distance computation
Starting index build...
Number of syncs: 40
Completed (round: 0, sync: 1/40 with L 75) sync_time: 0.00149s; inter_time: 2.958e-05s
Completed (round: 0, sync: 3/40 with L 75) sync_time: 0.001415s; inter_time: 2.37e-05s
Segmentation fault (core dumped)

core trace

Starting index build...
Number of syncs: 40

Thread 1 "build_disk_inde" received signal SIGSEGV, Segmentation fault.
tcmalloc::SLL_PopRange (end=, start=, N=8, head=0x1387c60) at src/linked_list.h:88
88 tmp = SLL_Next(tmp);
(gdb) bt
#0 tcmalloc::SLL_PopRange (end=, start=, N=8, head=0x1387c60) at src/linked_list.h:88
#1 tcmalloc::ThreadCache::FreeList::PopRange (end=, start=, N=8, this=0x1387c60) at src/thread_cache.h:238
#2 tcmalloc::ThreadCache::ReleaseToCentralCache (this=this@entry=0x1387c40, src=src@entry=0x1387c60, cl=, N=8, N@entry=32) at src/thread_cache.cc:206
#3 0x00007f03767b878c in tcmalloc::ThreadCache::ListTooLong (this=0x1387c40, list=0x1387c60, cl=) at src/thread_cache.cc:164
#4 0x00000000004d8020 in __gnu_cxx::new_allocator::deallocate (this=0x7ffe91b27f10, __p=0x1b3c0c0) at /usr/include/c++/9/ext/new_allocator.h:128
#5 0x00000000004d3ab5 in std::allocator_traits<std::allocator >::deallocate (__a=..., __p=0x1b3c0c0, __n=2) at /usr/include/c++/9/bits/alloc_traits.h:469
#6 0x00000000004cfb7c in std::_Vector_base<unsigned int, std::allocator >::_M_deallocate (this=0x7ffe91b27f10, __p=0x1b3c0c0, __n=2) at /usr/include/c++/9/bits/stl_vector.h:351
#7 0x00000000004cde10 in std::_Vector_base<unsigned int, std::allocator >::~_Vector_base (this=0x7ffe91b27f10, __in_chrg=) at /usr/include/c++/9/bits/stl_vector.h:332
#8 0x00000000004cde61 in std::vector<unsigned int, std::allocator >::~vector (this=0x7ffe91b27f10, __in_chrg=) at /usr/include/c++/9/bits/stl_vector.h:680
#9 0x00000000005129ae in std::__shrink_to_fit_aux<std::vector<unsigned int, std::allocator >, true>::_S_do_it (__c=...) at /usr/include/c++/9/bits/allocator.h:265
#10 0x000000000050ea67 in std::vector<unsigned int, std::allocator >::_M_shrink_to_fit (this=0x1b3e240) at /usr/include/c++/9/bits/vector.tcc:693
#11 0x0000000000509896 in std::vector<unsigned int, std::allocator >::shrink_to_fit (this=0x1b3e240) at /usr/include/c++/9/bits/stl_vector.h:987
#12 0x000000000051c7bf in diskann::Index<signed char, int>::_ZN7diskann5IndexIaiE4linkERNS_10ParametersE._omp_fn.2(void) () at /workspace/DiskANN/src/index.cpp:875
#13 0x00007f036f6d08f8 in __kmp_api_GOMP_parallel (task=0x7, data=0x1b3c0c0, num_threads=0, flags=8) at ../../src/kmp_gsupport.cpp:1430
#14 0x00000000004f77cd in diskann::Index<signed char, int>::link (this=0x1b40200, parameters=...) at /workspace/DiskANN/src/index.cpp:867
#15 0x00000000004f02b2 in diskann::Index<signed char, int>::build (this=0x1b40200, parameters=..., tags=...) at /workspace/DiskANN/src/index.cpp:988
#16 0x00000000004ca33c in diskann::build_merged_vamana_index (base_file=..., _compareMetric=diskann::L2, L=75, R=60, sampling_rate=150000, ram_budget=10, mem_index_path=..., medoids_file=..., centroids_file=...)
at /workspace/DiskANN/src/aux_utils.cpp:371
#17 0x00000000004c7abc in diskann::build_disk_index (dataFilePath=0x7ffe91b2d217 "../cmake-build-debug/tests/data.bin", indexFilePath=0x7ffe91b2d23b "output/index", indexBuildParameters=0x7ffe91b2b140 "60 75 10 10 2",
_compareMetric=diskann::L2) at /workspace/DiskANN/src/aux_utils.cpp:712
#18 0x00000000004be7a7 in build_index (dataFilePath=0x7ffe91b2d217 "../cmake-build-debug/tests/data.bin", indexFilePath=0x7ffe91b2d23b "output/index", indexBuildParameters=0x7ffe91b2b140 "60 75 10 10 2")
at /workspace/DiskANN/tests/build_disk_index.cpp:15
#19 0x00000000004be1ad in main (argc=9, argv=0x7ffe91b2b3d8) at /workspace/DiskANN/tests/build_disk_index.cpp:34

How DiskANN initialize the beginning random graph?

Hi, I'm working on some evaluations of the DiskANN code. But I can't figure out how the DiskANN builds the very first stage graph as it is described in the paper. I also checked the link() function, but I got nothing about the random initialization.

Question about R, L parameter setup.

According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?

remove outdated copy of robin-map in `include/tsl/`

include/tsl/ contains an outdated copy of the robin-map project by @Tessil. Please remove the copy from the DiskANN codebase and depend on it instead, so that we can build against the latest version of robin-map at the time of the build.

I suggest the replacement should be to change the README.md and Dockerfile to install the robin-map-dev package from Debian based distros and for Microsoft Windows use cmake's FetchContent module.

missing "float"

The commands in workflow/in_memory_index.md seems to be missing type args.

./tests/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./tests/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin

instead of

./tests/utils/fvecs_to_bin data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./tests/utils/fvecs_to_bin data/sift/sift_query.fvecs data/sift/sift_query.fbin

Error if destination missing

Index building fails with a non-specific DiskANNException if the destination folder does not yet exist.

I would consider testing at runtime if a: the destination exists, b: if it can be written to, or error prior to any computation steps. Fail early and fail hard, I'd say :)

[FEATURE REQUEST] Python bindings for parsing data structure

Dear Team, I was wondering if having python support for reading the data structure would be helpful. This would help the community to better understand and further develop over DiskANN.
Feature requests include

  • reading and loading other variables in python data structures which are required to do graph traversal
  • a pythonic version of graph traversal

Some infos

Hello,

just wondering if a documentation is maintained
for this project ?
as well as the avalability of python wrapper with pybindgen ?

README and build_disk_index are inconsistent with each other

build_disk_index requires memory budgetary settings for the [M] and [B] arguments, as a float, in gigabytes. The README doesn't tell us expected units, which could use some help.

Further, the README doesn't mention the PQ_disk_bytes argument.

In addition, the PQ_disk_bytes argument is mentioned to be optional in the source code, but it is required even if the default value is 0. I'd consider either a: defaulting it to 0 if it isn't provided, or b: make it required and make sure it's documented that 0 is a perfectly fine value.

distance metric parameter: Not Mentioned in README

if (std::string(argv[2]) == std::string("mips"))

if (std::string(argv[ctr]) == std::string("mips"))

if (std::string(argv[ctr]) == std::string("mips"))

if (std::string(argv[ctr]) == std::string("mips"))

This distance metric parameter is not mentioned in README.
README needs to be updated.

Problem of cmake cannot find mkl.h

Hello, I 'm trying to test the DiskANN code, and I have some problems about the environment.
It seems that Intel currently only provides mkl in oneapi, and it is incompatible with the CMakeLists.txt.
Can you share the mkl version used in current repo?
Thanks a lot!

index_build_prefix and build_memory_index error

If you try to build an index using build_memory_index and don't provide a prefix fragment, only a path, the build fails at the ~ 99% mark.

Reproduction steps:

mkdir /tmp/myindex
chmod 777 /tmp/myindex
$BUILD/tests/build_memory_index --index_path_prefix=/tmp/myindex/ --[other arguments ellided]

Results:

Starting index build with R: 32  Lbuild: 50  alpha: 1.2  #threads: 24
L2: Using AVX2 distance computation DistanceL2Float
Using only first 5841480 from file..
Starting index build with 5841480 points...
99.2899% of index build completed.Starting final cleanup..done. Link time: 300.633s
Index built with degree: max:32  avg:22.1867  min:1  count(deg<2):11166
Indexing time: 303.848
basic_ios::clear: iostream error
Index build failed.
cd /tmp/myindex
>>>  bash: cd: /tmp/myindex: No such file or directory

Note: If you do provide a full prefix, such as /tmp/myindex/randomprefixthingherewhatever it works just fine. It's only when you don't give it an index prefix and give it a directory that this occurs.

Assertion:
No matter what, we should not be deleting the myindex folder from the above example case when attempting to write. If we want to make the file prefix fragment a strict requirement vs. allowing just an output folder, that is fine, but we should identify when the user has presented us with that scenario and error accordingly, ideally prior to any index building work being done first.

An even better way may be to split --index_build_prefix into --index_output_directory and --index_prefix as two required, non-empty strings. It would make it easy to detect and to elevate the importance of a non empty string file prefix as an important CTA for the user.

what does parameter "PQ_disk_bytes" mean?

I will appreciate if you can tell me what "PQ_disk_bytes" parameters controls. Some hint by source code are "(for very large dimensionality, use 0 for full vectors)". I still can not get it.

tcmalloc: large alloc error

I run diskann on my server with 128g main memory and 1T SSD.
My dataset has 10M points and each of them is 200-dim.
I build diskann and run with scripts
./tests/build_disk_index float mips /home/user1/ann/dataset/data10m/data10m /home/user1/ann/dataset/index/index 100 100 64g 128g 64 20
I vary the last parameter from 2 to 100(2, 20, 50,100). I always get "tcmalloc: large alloc " error followed by segment fault (core dump).

Look forward to your help!

Segmentation fault (core dumped) when building the index

Hello,

Building a DiskANN index from Glove-100 dataset with the following command:

./build/tests/build_disk_index float mips input_data.bin . 70 100 1.5 2 4 0

fails with Segmentation fault (core dumped) message. This is all of the output in terminal:

Using Inner Product search, so need to pre-process base data into temp file. Please ensure there is additional (n*(d+1)*4) bytes for storing pre-processed base vectors, apart from the intermin indices and final index.
Pre-processing base file by adding extra coordinate
Writing bin: ._disk.index_max_base_norm.bin
bin: #pts = 1, #dims = 1, size = 12B
Finished writing bin.
Starting index build: R=70 L=100 Query RAM budget: 1.34218e+09 Indexing ram budget: 2 T: 4
Compressing 101-dimensional data into 100 bytes per vector.
Opened: ._prepped_base.bin, size: 478139664, cache_size: 67108864
Training data loaded of size 100003
 Stat(._pq_pivots.bin) returned: 0
Reading bin file ._pq_pivots.bin ...
Metadata: #pts = 256, #dims = 101...
PQ pivot file exists. Not generating again
Opened: ._prepped_base.bin, size: 478139664, cache_size: 67108864
 Stat(._pq_pivots.bin) returned: 0
Reading bin file ._pq_pivots.bin_centroid.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin ...
Metadata: #pts = 256, #dims = 101...
Loaded PQ pivot information
Processing points  [0, 1183514)..tcmalloc: large alloc 1211924480 bytes == 0x55b74b514000 @  0x7ffbf9622680 0x7ffbf9642ff4 0x55b6cdfe08b0 0x55b6cdf9607c 0x55b6cdf2c315 0x55b6cdf1ea7f 0x55b6cdf1e292 0x7ffbf1a540b3 0x55b6cdf1d3ae
tcmalloc: large alloc 1211924480 bytes == 0x55b74b514000 @  0x7ffbf9622680 0x7ffbf9642ff4 0x55b6cdfe08b0 0x55b6cdf9607c 0x55b6cdf2c315 0x55b6cdf1ea7f 0x55b6cdf1e292 0x7ffbf1a540b3 0x55b6cdf1d3ae
.done.
Full index fits in RAM budget, should consume at most 1.10047GiBs, so building in one shot
Number of frozen points = 0
Reading bin file ._prepped_base.bin ...Metadata: #pts = 1183514, #dims = 101, aligned_dim = 104...allocating aligned memory, 492341824 bytes...done. Copying data... done.
Using AVX2 distance computation
Starting index build...
Number of syncs: 289
[1]    2703529 segmentation fault (core dumped)  ./build/tests/build_disk_index float mips scripts/input_data.bin . 70 100 1.5

this is the python script used to build the input data binary:

import h5py 
import numpy as np

glove_h5py = h5py.File("./glove-100-angular.hdf5", "r")

dataset = glove_h5py['train']

normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]

N, dim = normalized_dataset.shape

byteorder = 'little'
with open('./input_data.bin', 'wb') as out:
    out.write((N).to_bytes(4, byteorder=byteorder))
    out.write((dim).to_bytes(4, byteorder=byteorder))
    out.write(normalized_dataset.tobytes())

Any idea what's going wrong? Thank you.

Report a bug about computing query norm in MIPS mode

for (uint32_t i = 0; i < this->data_dim; i++) {

loop condition should be
for (uint32_t i = 0; i < (this->data_dim-1); i++) {
this->data_dim is assigned at
this->data_dim = pq_file_dim;

and pq_file_dim is assigned at

get_bin_metadata(pq_table_bin, pq_file_num_centroids, pq_file_dim);

pq_file_dim is equal to number of dimensions of vectors in index file. However, In MIPS, original vectors are transformed with an additional dimension, which is sqrt(1-||x||/||w||). So vectors in index file are 1 more dimension than original vectors, whose number of dimensions are equal to queries'.

What's the purpose of warmup in search_disk_index?

Hello, after reading the search_disk_index.cpp, I find that the macro WARMUP is true by default, that means after loading the hot data into cache, it will do the same search process called warmup again, which seems meaningless because in function cached_beam_search, there is no cache update operation. It seems that the process of generate_cache_list_from_sample_queries has already warmed the system up.

broken link to `CONTRIBUTING.md`

The link to CONTRIBUTING.md in the README.md is broken because the file is called CONTRIBUTING.MD (uppercase .MD instead) and many filesystems and also GitHub URLs are case-sensitive.

I suggest the right fix would be to rename CONTRIBUTING.MD (uppercase .MD) to CONTRIBUTING.md (lowercase .md).

Disk Index and MIPS Search

For MIPS disk index, the dimension is artificially increased by 1, hence the data_dim object will be 1 more than the inherent dimension of the data. There are places in the code where query[data_dim -1] is accessed, which might case access violations.

Need to fix this?

Support non AVX CPU

When I tried to build DiskANN, I met

error: there are no arguments to ‘_mm128_loadu_ps’ that depend on a template parameter, so a declaration of ‘_mm128_loadu_ps’ must be available [-fpermissive]
  378 |   tmp1 = _mm128_loadu_ps(addr1);

The error is reported on https://github.com/microsoft/DiskANN/blob/main/src/distance.cpp#L405 .

From the x86 intrinsic list, I didn't see _mm128_loadu_ps, but _mm_loadu_ps. And I tried manually change it to _mm_loadu_ps and some other errors popped up.

I also ran less /proc/cpuinfo, and looks like my CPU doesn't support AVX and AVX2. And I guess this is the reason.

Design decisions and discussions

Hi Team,

I am interested in DiskANN as a concept. However I am not able to find any design decision discussions for major parts of DiskANN. Can any one point me to it ?

I am more interested in answers to questions like the below:

  1. Does the whole vector index be present in the memory to query ?
  2. How are updates handled ? In the sense if I have a scenario where updates are also frequent will DiskANN be a better solution ?
  3. How are deletions handled ? Is it tombstoned with a flag or is the graph reorganized (if the index is graph based) ?
  4. How are the performance benchmarks in comparison with other popular ANN's in terms of memory footprint, time to build the graph, and time to query ?

Segmentation fault while trying to search disk indexes

Hi,
I am trying to test DiskANN on a small dataset of about 10000 queries. (I will be dealing with billions in production). I was successfully able to build the indices for the dataset, but upon running the search command:

./tests/search_disk_index float mips ../embeddings/bing 0 1 0 ../queryfile.bin null 0 ../result 1

I get a segmentation fault.
I am attaching the build query as well:

./tests/build_disk_index float mips ../datafile.bin ../embeddings/bing 70 100 1.5 2 4 0

And the query output for searching

Search parameters: #threads: 1, beamwidth to be optimized for each L value
Reading bin file ../queryfile.bin ...Metadata: #pts = 128, #dims = 768, aligned_dim = 768...allocating aligned memory, 393216 bytes...done. Copying data... done.
 Stat(null) returned: -1
Using inner product distance function
Reading bin file ../embeddings/bing_pq_compressed.bin ...
Metadata: #pts = 10000, #dims = 100...
Reading bin file ../embeddings/bing_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 769...
 Stat(../embeddings/bing_pq_pivots.bin_chunk_offsets.bin) returned: 0
Reading bin file ../embeddings/bing_pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 769, #dims = 1...
Reading bin file ../embeddings/bing_pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 101, #dims = 1...
PQ data has 100 bytes per point.
Reading bin file ../embeddings/bing_pq_pivots.bin_centroid.bin ...
Metadata: #pts = 769, #dims = 1...
PQ Pivots: #ctrs: 256, #dims: 769, #chunks: 100
Loaded PQ centroids and in-memory compressed vectors. #points: 10000 #dim: 769 #aligned_dim: 776 #chunks: 100
 Stat(../embeddings/bing_disk.index_pq_pivots.bin) returned: -1
 Tellg: 40964096 as u64: 40964096
Disk-Index File Meta-data: # nodes per sector: 1, max node len (bytes): 3360, max node degree: 70
Setting up thread-specific contexts for nthreads: 1
allocating ctx: 0x7f5690d9a000 to thread-id:140009772226432
 Stat(../embeddings/bing_disk.index_medoids.bin) returned: -1
Loading centroid data from medoids vector data of 1 medoid(s)
 Stat(../embeddings/bing_disk.index_max_base_norm.bin) returned: 0
Reading bin file ../embeddings/bing_disk.index_max_base_norm.bin ...
Metadata: #pts = 1, #dims = 1...
Setting re-scaling factor of base vectors to 1
done..
Caching 0 BFS nodes around medoid(s)
 Stat(../embeddings/bing_sample_data.bin) returned: 0
Reading bin file ../embeddings/bing_sample_data.bin ...Metadata: #pts = 10000, #dims = 769, aligned_dim = 776...allocating aligned memory, 31040000 bytes...done. Copying data... done.
Loading the cache list into memory....done.
     L   Beamwidth             QPS    Mean Latency    99.9 Latency        Mean IOs         CPU (s)
==========================================================================================================
Segmentation fault (core dumped)

Feature Request: Fixed distance search

Hi,

This is such a great tool! I am wondering if there might be an easier way to retrieve all indices approximately within a distance R of the input rather than the k-nearest neighbors. I have an application that would require more exhaustive searching and would benefit from this greatly! Thank you for your consideration!

Best,
Han

`clang-format-4.0` package in `README.md` is outdated

The clang-format-4.0 package mentioned in README.md is no longer available in recent Debian based distributions, the latest release it is present in is Debian stretch and Ubuntu bionic.

I suggest to replace it with clang-format instead, which is a meta-package that pulls in the most recent version of clang-format available in the current distro release.

What’s the use of frozen points?

During reading source code of DiskANN, I notice a new concept, frozen points, which does not show up in DiskANN paper. I wonder what's the use of frozen points. Hope for your reply!

Why does more threads result in more IOs?

The document says num_threads: search using specified number of threads in parallel, one thread per query. More will result in more IOs, so find the balance depending on the bandwidth of the SSD.

I wonder why more threads result in more IOs? Isn't the number of IOs fixed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.