microsoft / diskann Goto Github PK

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search

License: Other

CMake 1.55% Dockerfile 0.08% C++ 65.55% C 0.07% Shell 0.37% Python 8.16% Rust 24.22%

diskann's Introduction

DiskANN

DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters. This code is based on ideas from the DiskANN, Fresh-DiskANN and the Filtered-DiskANN papers with further improvements. This code forked off from code for NSG algorithm.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

See guidelines for contributing to this project.

Linux build:

Install the following packages through apt-get

sudo apt install make cmake g++ libaio-dev libgoogle-perftools-dev clang-format libboost-all-dev

Install Intel MKL

Ubuntu 20.04 or newer

sudo apt install libmkl-full-dev

Earlier versions of Ubuntu

Install Intel MKL either by downloading the oneAPI MKL installer or using apt (we tested with build 2019.4-070 and 2022.1.2.146).

# OneAPI MKL Installer
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18487/l_BaseKit_p_2022.1.2.146.sh
sudo sh l_BaseKit_p_2022.1.2.146.sh -a --components intel.oneapi.lin.mkl.devel --action install --eula accept -s

Build

mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release .. && make -j

Windows build:

The Windows version has been tested with Enterprise editions of Visual Studio 2022, 2019 and 2017. It should work with the Community and Professional editions as well without any changes.

Prerequisites:

CMake 3.15+ (available in VisualStudio 2019+ or from https://cmake.org)
NuGet.exe (install from https://www.nuget.org/downloads)
- The build script will use NuGet to get MKL, OpenMP and Boost packages.
DiskANN git repository checked out together with submodules. To check out submodules after git clone:

git submodule init
git submodule update

Environment variables:
- [optional] If you would like to override the Boost library listed in windows/packages.config.in, set BOOST_ROOT to your Boost folder.

Build steps:

Open the "x64 Native Tools Command Prompt for VS 2019" (or corresponding version) and change to DiskANN folder
Create a "build" directory inside it
Change to the "build" directory and run

cmake ..

OR for Visual Studio 2017 and earlier:

<full-path-to-installed-cmake>\cmake ..

This will create a diskann.sln solution. Now you can:

Open it from VisualStudio and build either Release or Debug configuration.
<full-path-to-installed-cmake>\cmake --build build
Use MSBuild:

msbuild.exe diskann.sln /m /nologo /t:Build /p:Configuration="Release" /property:Platform="x64"

This will also build gperftools submodule for libtcmalloc_minimal dependency.
Generated binaries are stored in the x64/Release or x64/Debug directories.

Usage:

Please see the following pages on using the compiled code:

Please cite this software in your work as:

@misc{diskann-github,
   author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
   title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
   url = {https://github.com/Microsoft/DiskANN},
   version = {0.6.1},
   year = {2023}
}

diskann's People

Contributors

Stargazers

Watchers

Forkers

shuixianhua xiaming9880 githubprogramman shikharj loveheaven pierrehao kia-zandi-i xiongchenyan edwardzh hildebrandmw op-hunter rozydagger qpc-database llljun abangdd tiangles devillove084 b-xiang delightrun jigaoluo hannankan neiko2002 august2016 wyfunique johnpzh miguelpinho longjiquan xiacedar windsorwho huangyanjuner snu-arc corey886 matchyc zeraph6 memmeta benjaminxiang achierius daxpryce isururanawaka harsha-simhadri jungletryne ironysuzumiya whutbd anthonyskim xinhuitian songlinlife phosphorylation heaoxiang1012 aetherprior lijiunderstand ningyuanchen ayang818 broomstar fll02020 philipbadams yaakovs isabella232 zjw0304 diego-cai neelammahapatro llyink cqy123456 darvg pwzxxm ltan1ms dsm-fudan caucherwang zeitgeisttt itsmeakashgoyal classic130 dylanrb123 jinwei14 siramasw magdalendobson padadox plutolove zhuyaguang hhy3 josehu07 yahnoosh chasingegg shubhampachori12110095 yushangdi shuicao lakshyaaagrawal purvanshsingh liyunfei5126 yuenxq yemaedahrav abhisinghh chakpc varunsivashankar ranonrkm harkirat155 sparkstar dkaczynski michaelyhuang23 suhasjs johan511 marnenimanoj

diskann's Issues

What is difference between Index and PQFlashIndex?

In tests/build_disk_index use Index, and in tests/search_disk_index use PQFlashIndex;
In tests/build_memory_index use PQFlashIndex, and in tests/search_disk_index.

Why Index and PQFlashIndex work together?

Segmentation fault when build_disk_index with Multithreading

What happened:

I generate 10k bin data with a simple program.

#include <iostream>
#include <fstream>
using namespace std;
int main()
{
  ofstream myFile ("data.bin", ios::out | ios::binary);
  int mockNum = 10;
  int dimension = 10;
  myFile.write((char*)&mockNum, sizeof(int));
  myFile.write((char*)&dimension, sizeof(int));
  for (int i = 0; i < mockNum; i++)
  {
    for (int j = 0; j < dimension; j++)
    {
      int8_t x = (i + j);
      myFile.write((char*)&x, sizeof(int8_t));
    }
  }
  myFile.close();
  return 0;
}

But segmentation fault when I running build_disk_index with Multithreading.

successful command

./tests/build_disk_index int8 ../cmake-build-debug/tests/data.bin output/index 60 75 10 10 1

core dump command

./tests/build_disk_index int8 ../cmake-build-debug/tests/data.bin output/index 60 75 10 10 2

message log

Starting index build: R=60 L=75 Query RAM budget: 1.0469e+10 Indexing ram budget: 10 T: 2
Compressing 10-dimensional data into 10 bytes per vector.
Opened: ../cmake-build-debug/tests/data.bin, size: 108, cache_size: 108
Training data loaded of size 10
Stat(output/index_pq_pivots.bin) returned: 0
Reading bin file output/index_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 10...
PQ pivot file exists. Not generating again
Opened: ../cmake-build-debug/tests/data.bin, size: 108, cache_size: 108
Stat(output/index_pq_pivots.bin) returned: 0
Reading bin file output/index_pq_pivots.bin_centroid.bin ...
Metadata: #pts = 10, #dims = 1...
Reading bin file output/index_pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 10, #dims = 1...
Reading bin file output/index_pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 11, #dims = 1...
Reading bin file output/index_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 10...
Loaded PQ pivot information
Processing points [0, 10)...done.
Full index fits in RAM, building in one shot
Number of frozen points = 0
Reading bin file ../cmake-build-debug/tests/data.bin ...Metadata: #pts = 10, #dims = 10, aligned_dim = 16...allocating aligned memory, 160 bytes...done. Copying data... done.
Using AVX2 distance computation
Starting index build...
Number of syncs: 40
Completed (round: 0, sync: 1/40 with L 75) sync_time: 0.00149s; inter_time: 2.958e-05s
Completed (round: 0, sync: 3/40 with L 75) sync_time: 0.001415s; inter_time: 2.37e-05s
Segmentation fault (core dumped)

core trace

Starting index build...
Number of syncs: 40

Thread 1 "build_disk_inde" received signal SIGSEGV, Segmentation fault.
tcmalloc::SLL_PopRange (end=, start=, N=8, head=0x1387c60) at src/linked_list.h:88
88 tmp = SLL_Next(tmp);
(gdb) bt
#0 tcmalloc::SLL_PopRange (end=, start=, N=8, head=0x1387c60) at src/linked_list.h:88
#1 tcmalloc::ThreadCache::FreeList::PopRange (end=, start=, N=8, this=0x1387c60) at src/thread_cache.h:238
#2 tcmalloc::ThreadCache::ReleaseToCentralCache (this=this@entry=0x1387c40, src=src@entry=0x1387c60, cl=, N=8, N@entry=32) at src/thread_cache.cc:206
#3 0x00007f03767b878c in tcmalloc::ThreadCache::ListTooLong (this=0x1387c40, list=0x1387c60, cl=) at src/thread_cache.cc:164
#4 0x00000000004d8020 in __gnu_cxx::new_allocator::deallocate (this=0x7ffe91b27f10, __p=0x1b3c0c0) at /usr/include/c++/9/ext/new_allocator.h:128
#5 0x00000000004d3ab5 in std::allocator_traits<std::allocator >::deallocate (__a=..., __p=0x1b3c0c0, __n=2) at /usr/include/c++/9/bits/alloc_traits.h:469
#6 0x00000000004cfb7c in std::_Vector_base<unsigned int, std::allocator >::_M_deallocate (this=0x7ffe91b27f10, __p=0x1b3c0c0, __n=2) at /usr/include/c++/9/bits/stl_vector.h:351
#7 0x00000000004cde10 in std::_Vector_base<unsigned int, std::allocator >::~_Vector_base (this=0x7ffe91b27f10, __in_chrg=) at /usr/include/c++/9/bits/stl_vector.h:332
#8 0x00000000004cde61 in std::vector<unsigned int, std::allocator >::~vector (this=0x7ffe91b27f10, __in_chrg=) at /usr/include/c++/9/bits/stl_vector.h:680
#9 0x00000000005129ae in std::__shrink_to_fit_aux<std::vector<unsigned int, std::allocator >, true>::_S_do_it (__c=...) at /usr/include/c++/9/bits/allocator.h:265
#10 0x000000000050ea67 in std::vector<unsigned int, std::allocator >::_M_shrink_to_fit (this=0x1b3e240) at /usr/include/c++/9/bits/vector.tcc:693
#11 0x0000000000509896 in std::vector<unsigned int, std::allocator >::shrink_to_fit (this=0x1b3e240) at /usr/include/c++/9/bits/stl_vector.h:987
#12 0x000000000051c7bf in diskann::Index<signed char, int>::_ZN7diskann5IndexIaiE4linkERNS_10ParametersE._omp_fn.2(void) () at /workspace/DiskANN/src/index.cpp:875
#13 0x00007f036f6d08f8 in __kmp_api_GOMP_parallel (task=0x7, data=0x1b3c0c0, num_threads=0, flags=8) at ../../src/kmp_gsupport.cpp:1430
#14 0x00000000004f77cd in diskann::Index<signed char, int>::link (this=0x1b40200, parameters=...) at /workspace/DiskANN/src/index.cpp:867
#15 0x00000000004f02b2 in diskann::Index<signed char, int>::build (this=0x1b40200, parameters=..., tags=...) at /workspace/DiskANN/src/index.cpp:988
#16 0x00000000004ca33c in diskann::build_merged_vamana_index (base_file=..., _compareMetric=diskann::L2, L=75, R=60, sampling_rate=150000, ram_budget=10, mem_index_path=..., medoids_file=..., centroids_file=...)
at /workspace/DiskANN/src/aux_utils.cpp:371
#17 0x00000000004c7abc in diskann::build_disk_index (dataFilePath=0x7ffe91b2d217 "../cmake-build-debug/tests/data.bin", indexFilePath=0x7ffe91b2d23b "output/index", indexBuildParameters=0x7ffe91b2b140 "60 75 10 10 2",
_compareMetric=diskann::L2) at /workspace/DiskANN/src/aux_utils.cpp:712
#18 0x00000000004be7a7 in build_index (dataFilePath=0x7ffe91b2d217 "../cmake-build-debug/tests/data.bin", indexFilePath=0x7ffe91b2d23b "output/index", indexBuildParameters=0x7ffe91b2b140 "60 75 10 10 2")
at /workspace/DiskANN/tests/build_disk_index.cpp:15
#19 0x00000000004be1ad in main (argc=9, argv=0x7ffe91b2b3d8) at /workspace/DiskANN/tests/build_disk_index.cpp:34

How DiskANN initialize the beginning random graph?

Hi, I'm working on some evaluations of the DiskANN code. But I can't figure out how the DiskANN builds the very first stage graph as it is described in the paper. I also checked the link() function, but I got nothing about the random initialization.

Question about R, L parameter setup.

According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?

remove outdated copy of robin-map in `include/tsl/`

include/tsl/ contains an outdated copy of the robin-map project by @Tessil. Please remove the copy from the DiskANN codebase and depend on it instead, so that we can build against the latest version of robin-map at the time of the build.

I suggest the replacement should be to change the README.md and Dockerfile to install the robin-map-dev package from Debian based distros and for Microsoft Windows use cmake's FetchContent module.

missing "float"

The commands in workflow/in_memory_index.md seems to be missing type args.

./tests/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./tests/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin

instead of

./tests/utils/fvecs_to_bin data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
./tests/utils/fvecs_to_bin data/sift/sift_query.fvecs data/sift/sift_query.fbin

Why is the performance of DiskANN on Yandex Text-to-Image* dataset limited？

I have recently participated NeurIPS competition (link here). Competition homepage shows that recall of DiskANN on Yandex Text-to-Image dataset is limited to 48.8%. Norm of query vectors are all 1 and norms of base vector are between 0.758 and 0.993. Could you please tell me why DiskANN do not perform well? R and L parameters set to 100.

clean up includes hierarchy

Update dockerfile

Error if destination missing

Index building fails with a non-specific DiskANNException if the destination folder does not yet exist.

I would consider testing at runtime if a: the destination exists, b: if it can be written to, or error prior to any computation steps. Fail early and fail hard, I'd say :)

replace double with float

Replace 64 byte double with 32 byte float where possible to make vector operations faster.

[FEATURE REQUEST] Python bindings for parsing data structure

Dear Team, I was wondering if having python support for reading the data structure would be helpful. This would help the community to better understand and further develop over DiskANN.
Feature requests include

reading and loading other variables in python data structures which are required to do graph traversal
a pythonic version of graph traversal

Some infos

Hello,

just wondering if a documentation is maintained
for this project ?
as well as the avalability of python wrapper with pybindgen ?

README and build_disk_index are inconsistent with each other

build_disk_index requires memory budgetary settings for the [M] and [B] arguments, as a float, in gigabytes. The README doesn't tell us expected units, which could use some help.

Further, the README doesn't mention the PQ_disk_bytes argument.

In addition, the PQ_disk_bytes argument is mentioned to be optional in the source code, but it is required even if the default value is 0. I'd consider either a: defaulting it to 0 if it isn't provided, or b: make it required and make sure it's documented that 0 is a perfectly fine value.

distance metric parameter: Not Mentioned in README

DiskANN/tests/build_disk_index.cpp

Line 33 in bc39cd0

if (std::string(argv[2]) == std::string("mips"))

DiskANN/tests/search_disk_index.cpp

Line 62 in bc39cd0

if (std::string(argv[ctr]) == std::string("mips"))

DiskANN/tests/build_memory_index.cpp

Line 60 in bc39cd0

if (std::string(argv[ctr]) == std::string("mips"))

DiskANN/tests/search_memory_index.cpp

Line 33 in bc39cd0

if (std::string(argv[ctr]) == std::string("mips"))

This distance metric parameter is not mentioned in README.
README needs to be updated.

Problem of cmake cannot find mkl.h

Hello, I 'm trying to test the DiskANN code, and I have some problems about the environment.
It seems that Intel currently only provides mkl in oneapi, and it is incompatible with the CMakeLists.txt.
Can you share the mkl version used in current repo?
Thanks a lot!

index_build_prefix and build_memory_index error

If you try to build an index using build_memory_index and don't provide a prefix fragment, only a path, the build fails at the ~ 99% mark.

Reproduction steps:

mkdir /tmp/myindex
chmod 777 /tmp/myindex
$BUILD/tests/build_memory_index --index_path_prefix=/tmp/myindex/ --[other arguments ellided]

Results:

Starting index build with R: 32  Lbuild: 50  alpha: 1.2  #threads: 24
L2: Using AVX2 distance computation DistanceL2Float
Using only first 5841480 from file..
Starting index build with 5841480 points...
99.2899% of index build completed.Starting final cleanup..done. Link time: 300.633s
Index built with degree: max:32  avg:22.1867  min:1  count(deg<2):11166
Indexing time: 303.848
basic_ios::clear: iostream error
Index build failed.

cd /tmp/myindex
>>>  bash: cd: /tmp/myindex: No such file or directory

Note: If you do provide a full prefix, such as /tmp/myindex/randomprefixthingherewhatever it works just fine. It's only when you don't give it an index prefix and give it a directory that this occurs.

Assertion:
No matter what, we should not be deleting the myindex folder from the above example case when attempting to write. If we want to make the file prefix fragment a strict requirement vs. allowing just an output folder, that is fine, but we should identify when the user has presented us with that scenario and error accordingly, ideally prior to any index building work being done first.

An even better way may be to split --index_build_prefix into --index_output_directory and --index_prefix as two required, non-empty strings. It would make it easy to detect and to elevate the importance of a non empty string file prefix as an important CTA for the user.

Consider using io_uring for linux

io_uring has shown promise improving I/O throughput in database workloads, including random access for index pages. See this paper: https://db.in.tum.de/~fent/papers/coroutines.pdf, for using io_uring with C++ coroutine.

Reference:

Separate scratch space class and its manager into a separate file

what does parameter "PQ_disk_bytes" mean?

I will appreciate if you can tell me what "PQ_disk_bytes" parameters controls. Some hint by source code are "(for very large dimensionality, use 0 for full vectors)". I still can not get it.

Create a consistent file format across use cases, probably using Bond(?)

Use standard serialization tools to normalize file format across index types

tcmalloc: large alloc error

I run diskann on my server with 128g main memory and 1T SSD.
My dataset has 10M points and each of them is 200-dim.
I build diskann and run with scripts
./tests/build_disk_index float mips /home/user1/ann/dataset/data10m/data10m /home/user1/ann/dataset/index/index 100 100 64g 128g 64 20
I vary the last parameter from 2 to 100(2, 20, 50,100). I always get "tcmalloc: large alloc " error followed by segment fault (core dump).

Look forward to your help!

For MIPS, you'll need to use `tests/utils/vector_analysis` functionality to convert both input and query files, and then use them while building and searching. Use `./vector_analysis float /path_to_data_file 4 /path_to_converted_data_file`.

Segmentation fault (core dumped) when building the index

Hello,

Building a DiskANN index from Glove-100 dataset with the following command:

./build/tests/build_disk_index float mips input_data.bin . 70 100 1.5 2 4 0

fails with Segmentation fault (core dumped) message. This is all of the output in terminal:

Using Inner Product search, so need to pre-process base data into temp file. Please ensure there is additional (n*(d+1)*4) bytes for storing pre-processed base vectors, apart from the intermin indices and final index.
Pre-processing base file by adding extra coordinate
Writing bin: ._disk.index_max_base_norm.bin
bin: #pts = 1, #dims = 1, size = 12B
Finished writing bin.
Starting index build: R=70 L=100 Query RAM budget: 1.34218e+09 Indexing ram budget: 2 T: 4
Compressing 101-dimensional data into 100 bytes per vector.
Opened: ._prepped_base.bin, size: 478139664, cache_size: 67108864
Training data loaded of size 100003
 Stat(._pq_pivots.bin) returned: 0
Reading bin file ._pq_pivots.bin ...
Metadata: #pts = 256, #dims = 101...
PQ pivot file exists. Not generating again
Opened: ._prepped_base.bin, size: 478139664, cache_size: 67108864
 Stat(._pq_pivots.bin) returned: 0
Reading bin file ._pq_pivots.bin_centroid.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 101, #dims = 1...
Reading bin file ._pq_pivots.bin ...
Metadata: #pts = 256, #dims = 101...
Loaded PQ pivot information
Processing points  [0, 1183514)..tcmalloc: large alloc 1211924480 bytes == 0x55b74b514000 @  0x7ffbf9622680 0x7ffbf9642ff4 0x55b6cdfe08b0 0x55b6cdf9607c 0x55b6cdf2c315 0x55b6cdf1ea7f 0x55b6cdf1e292 0x7ffbf1a540b3 0x55b6cdf1d3ae
tcmalloc: large alloc 1211924480 bytes == 0x55b74b514000 @  0x7ffbf9622680 0x7ffbf9642ff4 0x55b6cdfe08b0 0x55b6cdf9607c 0x55b6cdf2c315 0x55b6cdf1ea7f 0x55b6cdf1e292 0x7ffbf1a540b3 0x55b6cdf1d3ae
.done.
Full index fits in RAM budget, should consume at most 1.10047GiBs, so building in one shot
Number of frozen points = 0
Reading bin file ._prepped_base.bin ...Metadata: #pts = 1183514, #dims = 101, aligned_dim = 104...allocating aligned memory, 492341824 bytes...done. Copying data... done.
Using AVX2 distance computation
Starting index build...
Number of syncs: 289
[1]    2703529 segmentation fault (core dumped)  ./build/tests/build_disk_index float mips scripts/input_data.bin . 70 100 1.5

this is the python script used to build the input data binary:

import h5py 
import numpy as np

glove_h5py = h5py.File("./glove-100-angular.hdf5", "r")

dataset = glove_h5py['train']

normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]

N, dim = normalized_dataset.shape

byteorder = 'little'
with open('./input_data.bin', 'wb') as out:
    out.write((N).to_bytes(4, byteorder=byteorder))
    out.write((dim).to_bytes(4, byteorder=byteorder))
    out.write(normalized_dataset.tobytes())

Any idea what's going wrong? Thank you.

Report a bug about computing query norm in MIPS mode

DiskANN/src/pq_flash_index.cpp

Line 849 in bc39cd0

for (uint32_t i = 0; i < this->data_dim; i++) {

loop condition should be
for (uint32_t i = 0; i < (this->data_dim-1); i++) {
this->data_dim is assigned at

DiskANN/src/pq_flash_index.cpp

Line 619 in bc39cd0

this->data_dim = pq_file_dim;

and pq_file_dim is assigned at

DiskANN/src/pq_flash_index.cpp

Line 608 in bc39cd0

get_bin_metadata(pq_table_bin, pq_file_num_centroids, pq_file_dim);

pq_file_dim is equal to number of dimensions of vectors in index file. However, In MIPS, original vectors are transformed with an additional dimension, which is sqrt(1-||x||/||w||). So vectors in index file are 1 more dimension than original vectors, whose number of dimensions are equal to queries'.

clean up parameters class using boost::any or C++17 any

What's the purpose of warmup in search_disk_index?

Hello, after reading the search_disk_index.cpp, I find that the macro WARMUP is true by default, that means after loading the hot data into cache, it will do the same search process called warmup again, which seems meaningless because in function cached_beam_search, there is no cache update operation. It seems that the process of generate_cache_list_from_sample_queries has already warmed the system up.

git conflict marker `<<<<<<< HEAD` in `.gitignore`

The .gitignore file has a <<<<<<< HEAD git conflict marker introduced in commit 4084c29.

I suggest that it should be removed.

Add Pull Request templates

PR templates simply prepopulate the PR description with some useful information. (see: https://raw.githubusercontent.com/microsoft/graspologic/dev/.github/PULL_REQUEST_TEMPLATE.md for an example)

Asking questions about breaking changes (re API or file format) will allow us to make the right call with regards to our semver numbering convention.

broken link to `CONTRIBUTING.md`

The link to CONTRIBUTING.md in the README.md is broken because the file is called CONTRIBUTING.MD (uppercase .MD instead) and many filesystems and also GitHub URLs are case-sensitive.

I suggest the right fix would be to rename CONTRIBUTING.MD (uppercase .MD) to CONTRIBUTING.md (lowercase .md).

Disk Index and MIPS Search

For MIPS disk index, the dimension is artificially increased by 1, hence the data_dim object will be 1 more than the inherent dimension of the data. There are places in the code where query[data_dim -1] is accessed, which might case access violations.

Need to fix this?

Support AVX512-VNNI

Support non AVX CPU

When I tried to build DiskANN, I met

error: there are no arguments to ‘_mm128_loadu_ps’ that depend on a template parameter, so a declaration of ‘_mm128_loadu_ps’ must be available [-fpermissive]
  378 |   tmp1 = _mm128_loadu_ps(addr1);

The error is reported on https://github.com/microsoft/DiskANN/blob/main/src/distance.cpp#L405 .

From the x86 intrinsic list, I didn't see _mm128_loadu_ps, but _mm_loadu_ps. And I tried manually change it to _mm_loadu_ps and some other errors popped up.

I also ran less /proc/cpuinfo, and looks like my CPU doesn't support AVX and AVX2. And I guess this is the reason.

add recall checks to CI tests

Check the search runs provide non-trivial (say 80+%) recall for the test to pass

Streaming ANNS with guarantees

freshdiskann_wal-1.pdf
Attached a proposal to get Atomicity, Consistency and Durability for streaming DiskANN updates.

Design decisions and discussions

Hi Team,

I am interested in DiskANN as a concept. However I am not able to find any design decision discussions for major parts of DiskANN. Can any one point me to it ?

I am more interested in answers to questions like the below:

Does the whole vector index be present in the memory to query ?
How are updates handled ? In the sense if I have a scenario where updates are also frequent will DiskANN be a better solution ?
How are deletions handled ? Is it tombstoned with a flag or is the graph reorganized (if the index is graph based) ?
How are the performance benchmarks in comparison with other popular ANN's in terms of memory footprint, time to build the graph, and time to query ?

Refactor code to support index logic from graph; support different graph types

See attached proposal for refactoring the DiskANN codebase to support different graph types, with the goal of moving towards types which provide a linearizable order of updates and queries as well as snapshotting.
Code_Reorganization.pdf

Avoid using vector.data() for copying data

DiskANN/src/index.cpp

Line 769 in 8bb74ff

read_array(reader, tmp.data(), k, graph_offset);

Segmentation fault while trying to search disk indexes

Hi,
I am trying to test DiskANN on a small dataset of about 10000 queries. (I will be dealing with billions in production). I was successfully able to build the indices for the dataset, but upon running the search command:

./tests/search_disk_index float mips ../embeddings/bing 0 1 0 ../queryfile.bin null 0 ../result 1

I get a segmentation fault.
I am attaching the build query as well:

./tests/build_disk_index float mips ../datafile.bin ../embeddings/bing 70 100 1.5 2 4 0

And the query output for searching

Search parameters: #threads: 1, beamwidth to be optimized for each L value
Reading bin file ../queryfile.bin ...Metadata: #pts = 128, #dims = 768, aligned_dim = 768...allocating aligned memory, 393216 bytes...done. Copying data... done.
 Stat(null) returned: -1
Using inner product distance function
Reading bin file ../embeddings/bing_pq_compressed.bin ...
Metadata: #pts = 10000, #dims = 100...
Reading bin file ../embeddings/bing_pq_pivots.bin ...
Metadata: #pts = 256, #dims = 769...
 Stat(../embeddings/bing_pq_pivots.bin_chunk_offsets.bin) returned: 0
Reading bin file ../embeddings/bing_pq_pivots.bin_rearrangement_perm.bin ...
Metadata: #pts = 769, #dims = 1...
Reading bin file ../embeddings/bing_pq_pivots.bin_chunk_offsets.bin ...
Metadata: #pts = 101, #dims = 1...
PQ data has 100 bytes per point.
Reading bin file ../embeddings/bing_pq_pivots.bin_centroid.bin ...
Metadata: #pts = 769, #dims = 1...
PQ Pivots: #ctrs: 256, #dims: 769, #chunks: 100
Loaded PQ centroids and in-memory compressed vectors. #points: 10000 #dim: 769 #aligned_dim: 776 #chunks: 100
 Stat(../embeddings/bing_disk.index_pq_pivots.bin) returned: -1
 Tellg: 40964096 as u64: 40964096
Disk-Index File Meta-data: # nodes per sector: 1, max node len (bytes): 3360, max node degree: 70
Setting up thread-specific contexts for nthreads: 1
allocating ctx: 0x7f5690d9a000 to thread-id:140009772226432
 Stat(../embeddings/bing_disk.index_medoids.bin) returned: -1
Loading centroid data from medoids vector data of 1 medoid(s)
 Stat(../embeddings/bing_disk.index_max_base_norm.bin) returned: 0
Reading bin file ../embeddings/bing_disk.index_max_base_norm.bin ...
Metadata: #pts = 1, #dims = 1...
Setting re-scaling factor of base vectors to 1
done..
Caching 0 BFS nodes around medoid(s)
 Stat(../embeddings/bing_sample_data.bin) returned: 0
Reading bin file ../embeddings/bing_sample_data.bin ...Metadata: #pts = 10000, #dims = 769, aligned_dim = 776...allocating aligned memory, 31040000 bytes...done. Copying data... done.
Loading the cache list into memory....done.
     L   Beamwidth             QPS    Mean Latency    99.9 Latency        Mean IOs         CPU (s)
==========================================================================================================
Segmentation fault (core dumped)

Add Windows build CI test

Feature Request: Fixed distance search

Hi,

This is such a great tool! I am wondering if there might be an easier way to retrieve all indices approximately within a distance R of the input rather than the k-nearest neighbors. I have an application that would require more exhaustive searching and would benefit from this greatly! Thank you for your consideration!

Best,
Han

remove embedded prebuilt copies of tcmalloc in `dependencies/windows/tcmalloc/`

dependencies/windows/tcmalloc/ contains an outdated prebuilt copy of the tcmalloc project by @google. Please remove the copy from the DiskANN codebase and depend on it instead, so that we can use the latest version of tcmalloc at the time of the build/install.

I suggest the replacement should be for Microsoft Windows use cmake's FetchContent module.

Test support for multiple "frozen" starting points

Why PQ chunk size > 100 will cause core dumped in search_disk_index?

hello, I noticed that the max pq chunk size is limited to 100, and when I try to set a bigger chunk size, it will cause core dumped in search_disk_index() function.
Have you encountered this problem？

allow building with alternative BLAS implementations

Intel MKL is a proprietary project specific to a single CPU vendor that was found to reduce performance on other CPU vendors.

Some folks might want to use the AOCL project from AMD instead.

Folks who do not want to depend on proprietary products might want to use OpenBLAS, BLIS, ATLAS or even the original netlib BLAS.

Please consider adding support for alternative BLAS implementations.

assert(*ptr != nullptr);

Replace assert with if and a meaningful message.

cd ..

Why does more threads result in more IOs?

The document says num_threads: search using specified number of threads in parallel, one thread per query. More will result in more IOs, so find the balance depending on the bandwidth of the SSD.

I wonder why more threads result in more IOs? Isn't the number of IOs fixed?

microsoft / diskann Goto Github PK

diskann's Introduction

DiskANN

Linux build:

Install Intel MKL

Ubuntu 20.04 or newer

Earlier versions of Ubuntu

Build

Windows build:

Usage:

diskann's People

Contributors

Stargazers

Watchers

Forkers

diskann's Issues

What happened:

successful command

core dump command

message log

core trace

Recommend Projects

Recommend Topics

Recommend Org