Comments (8)
DiskANN/src/pq_flash_index.cpp
Line 35 in 4162c21
from diskann.
@harsha-simhadri : this is actually the behavior of the diskann C++ lib, and it does seem like a bug. Am I calling it wrong or something?
from diskann.
I don't seem to be calling anything incorrectly; with disk indices specifically, we just switch the cosine metric over to the L2 metric, spit something out to console about it, and move on.
At least with memory indices, we seem to be normalizing the vectors first, though I only validated up through to us setting the normalize_vecs
flag when cosine is chosen (and we use L2 from that point on)
Edit: correctly -> incorrectly
from diskann.
#445 documents what currently works, and what does not. It also has python checks on metric : vector dtype prior to even reaching into the extension modules.
Leaving this open as a bug though because it's still not properly fixed.
from diskann.
Will investigate and get back soon.
from diskann.
I wonder if there is any update on this issue and if it has been solved yet?
Moreover, if the vectors are normalized in the same space, cosine and l2 become similar in terms of ranking since after normalization (every vector's length is 1) both consider measuring the angle between vectors. Can you confirm that the normalization can lead to same results using DiskANN?
from diskann.
Following my previous comment, normalizing the vectors would solve the problem as you can see in the following:
import glob
import os
from pathlib import Path
import diskannpy as dap
import numpy as np
def directory_is_empty(directory: str) -> bool:
dir = Path(directory)
fpath = dir.resolve()
empty = not any(dir.iterdir())
if not empty:
print("Found {} . Removing contet".format(fpath))
files = glob.glob('{}/*.*'.format(fpath))
for f in files:
os.remove(f)
return empty
rel_dir = "bug_fix"
Path(rel_dir).mkdir(exist_ok=True)
directory_is_empty(rel_dir)
query = [1, -0.1]
search_space = [
[10, 0], # according to cosine distance, this is the closest
[1, 0.1], # according to l2, this is the closest
]
# normalizing all vectors in the search space
normalized_search_space = []
for idx,col in enumerate(search_space):
norm = np.linalg.norm(col)
normalized_search_space.append(col / norm)
# to store back the normalized_search_space in search_space
search_space = normalized_search_space
dap.build_disk_index(
data = np.array(search_space, dtype=np.float32),
distance_metric="l2",
#distance_metric="cosine",
index_directory=rel_dir,
graph_degree=16,
complexity=32,
vector_dtype=np.float32,
search_memory_maximum=0.00003,
build_memory_maximum=1,
num_threads=0,
pq_disk_bytes=0
)
index = dap.StaticDiskIndex(distance_metric="l2",
vector_dtype=np.float32,
index_directory=Path(rel_dir).resolve(),
num_threads=16,
num_nodes_to_cache=10)
res = index.search(np.array(query, dtype=np.float32), 1, 2)
assert res[0].shape == (1,)
print("The results are: ",res)
assert res[0][0] == 0, "cosine distance is not being used"
@bddap could you confirm this?
Thanks in advance.
from diskann.
If you normalize the vectors before indexing them, I think l2
will behave similarly to cosine
. This can serve as a workaround for some users.
from diskann.
Related Issues (20)
- [BUG] Do not assume write access to the data folder while creating PQ-based in-mem index.
- [Question] About test_concurr_merge_insert in the diskv2 branch
- [Question] DiskANN performance compared with mmap HOT 1
- [Question] How do I test the FreshDiskANN system?Such as Insert, delete, streamingmerge and other operations HOT 2
- [Program received signal SIGILL, Illegal instruction] HOT 4
- std:bad_alloc Error when loading PQ pivots
- [BUG] Usage for filtered indices needs to be updated
- [Question] Documentation for index binary file
- [BUG] Value of query_result_dist is 0 / 0.0000 HOT 1
- [Question] Hitting Weird Error HOT 1
- [Question] Kernel died after diskannpy import
- [BUG] Cosine + StaticMemoryIndex not working
- [Question] Met error when building rust/diskann
- [Question] Is FreshedDiskAnn supported now ?
- [BUG] Low recall rate on a custom dataset
- [Question]Why we need to merge edge sets after building vamana index?
- [BUG] Distance return for inner_product metric is not expected
- [Question] Parallel index building strategy
- Add multi filter changes for BANN_save_load_one_index branch
- [Question] Why require numpy version stick to 1.25? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diskann.