Hi! First, thanks for creating and maintaining this awesome extension 🚀 . <p dir=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Just noticed you're using m = 4 and <code class="notr

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Different results when doing a query with or without and HNSW index about pgvector HOT 5 CLOSED

jpbalarini commented on June 10, 2024

Different results when doing a query with or without and HNSW index

from pgvector.

Comments (5)

ankane commented on June 10, 2024

Hi @jpbalarini, you'll see different results for queries that use an approximate index (docs). You can try increasing ef_construction (and then m if that doesn't work) to see if it improves recall.

Do you have a lot of duplicate embeddings? It's possible that could be contributing to poor recall.

from pgvector.

ankane commented on June 10, 2024

Just noticed you're using m = 4 and ef_construction = 10. This is likely why recall is poor. Try using the defaults.

from pgvector.

jpbalarini commented on June 10, 2024

Thanks @ankane . I found where I took those parameters from:
https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector#tldr
No idea why they use those values.

Leaving the default values (partially) did the trick. I now get something that makes some sense, but I'm not getting the values that I need that are closer (0.84 and 0.82).

Without using the index (seq scan):

id	distance
e8bec92d-6d75-43c0-a397-cc223136eb98	0.8405799269676208
e8d2ff73-fd86-4216-99c8-7dcdbaca4c5b	0.8213955163955688
f31dbe3f-026c-45ed-baeb-328f98bdbad5	0.8135669231414795
d6534c95-523d-4255-b2e7-4dc1758cf5f3	0.8135632872581482
...	...

Using the index:

id	distance
f31dbe3f-026c-45ed-baeb-328f98bdbad5	0.8135669231414795
d6534c95-523d-4255-b2e7-4dc1758cf5f3	0.8135632872581482
fa29f109-dac3-49e0-ada3-e176efeb71f2	0.8135602474212646
d09dffd7-1d7e-4e33-aaca-a66253ae9043	0.8135518431663513
...	...

I tried increasing the ef_construction from 64 to 96, but the results were the same. I'm now trying with 128 (still building the index). Do you have something to recommend? Maybe trying to increase the m size too?
Any recommendation is useful; thanks!

from pgvector.

ankane commented on June 10, 2024

I'd recommend calculating recall programmatically over many queries to quantify the difference between sets of parameters. The best parameters will be different for each dataset. I'd try ef_construction = 64, 128, 256, and 512 (with m = 16) and then m = 32 and 64 (with ef_construction being at least 4 * m) until you reach your target recall.

from pgvector.

jpbalarini commented on June 10, 2024

Thanks Andrew for your insights! I will try that 💪🏻

from pgvector.

Recommend Projects

Different results when doing a query with or without and HNSW index about pgvector HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent