Git Product home page Git Product logo

Comments (5)

ankane avatar ankane commented on June 10, 2024

Hi @jpbalarini, you'll see different results for queries that use an approximate index (docs). You can try increasing ef_construction (and then m if that doesn't work) to see if it improves recall.

Do you have a lot of duplicate embeddings? It's possible that could be contributing to poor recall.

from pgvector.

ankane avatar ankane commented on June 10, 2024

Just noticed you're using m = 4 and ef_construction = 10. This is likely why recall is poor. Try using the defaults.

from pgvector.

jpbalarini avatar jpbalarini commented on June 10, 2024

Thanks @ankane . I found where I took those parameters from:
https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector#tldr
No idea why they use those values.

Leaving the default values (partially) did the trick. I now get something that makes some sense, but I'm not getting the values that I need that are closer (0.84 and 0.82).

Without using the index (seq scan):

id distance
e8bec92d-6d75-43c0-a397-cc223136eb98 0.8405799269676208
e8d2ff73-fd86-4216-99c8-7dcdbaca4c5b 0.8213955163955688
f31dbe3f-026c-45ed-baeb-328f98bdbad5 0.8135669231414795
d6534c95-523d-4255-b2e7-4dc1758cf5f3 0.8135632872581482
... ...

Using the index:

id distance
f31dbe3f-026c-45ed-baeb-328f98bdbad5 0.8135669231414795
d6534c95-523d-4255-b2e7-4dc1758cf5f3 0.8135632872581482
fa29f109-dac3-49e0-ada3-e176efeb71f2 0.8135602474212646
d09dffd7-1d7e-4e33-aaca-a66253ae9043 0.8135518431663513
... ...

I tried increasing the ef_construction from 64 to 96, but the results were the same. I'm now trying with 128 (still building the index). Do you have something to recommend? Maybe trying to increase the m size too?
Any recommendation is useful; thanks!

from pgvector.

ankane avatar ankane commented on June 10, 2024

I'd recommend calculating recall programmatically over many queries to quantify the difference between sets of parameters. The best parameters will be different for each dataset. I'd try ef_construction = 64, 128, 256, and 512 (with m = 16) and then m = 32 and 64 (with ef_construction being at least 4 * m) until you reach your target recall.

from pgvector.

jpbalarini avatar jpbalarini commented on June 10, 2024

Thanks Andrew for your insights! I will try that 💪🏻

from pgvector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.