Comments (13)
Recenlty I was experimenting with other hosted db solutions, one of the db providers suggested us to upload the vectors from a vm in the same infra provider(aws , gcp, azure) and same region as where the db was hosted. I initially thought that this should not impact the perfomance of indexing much as my experiments has each push batch size of barely a 100 with 768 dimensional vectors. But I was wrong. Was able to see a indexing speedup of upto 9x by following the suggested solution.
I know that this might be a no brainer for highly experienced cloud devs but may not be the case with growing AI devs like me. Adding this in the readme might help a lot for fellow ai devs
Possible explanation :
100 batch size with a vector dimension of 768 will consume 100x768x32 bits ~= 0.3MB
But while uploading these vectors are jsonized which will easily blow up the transmission data to 1.2MB (4 times higher or possible more) which can be a bottleneck in cross region bulk uploads
8(bits per chareter) * 16 (guess of number of charecters required to represent a float integer including all additional charecters required by json format) * 768 * 100 = 1.2MB
from vectordb.
Hey @TheSeriousProgrammer ,
Thanks for raising the issue.
Let's see if this solution could work for you.
Only for HNSW:
- Add an endpoint and API called
push
(this creates a list of Document in memory for HNSW and behaves the same as Index for the ExactNNInMemory). - When
index
is called, we make sure that all the cached of (pushed Documents is indexed before the input Documents). - Add a potential
build_index
that has no effect for InMemory, but has for HNSWVectorDB. This will make sure that all pushed docs are indexed properly and removes the cache.
Do you think this could work for your use case?
from vectordb.
Btw, you do not need to add id
to your BaseDoc, BaseDoc already has an ID field that u can handle or that is randomized for you.
from vectordb.
Could u give a try to this PR #52?
It would not work in the cloud because it is not released, but locally and with serve
it should work and give a hint at wether it can solve ur issues
from vectordb.
Sure, will give it a shot
from vectordb.
Hello @TheSeriousProgrammer ,
Before jumping into this solution, I would like to explore a new solution that would make things simpler:
- Use the version 0.0.17.
- There is a way to have a more fine-grained way to control the behavior of the batches passed to the vectordb for indexers.
So, you are telling me that you are passing 64k documents in each call, so u must be doing something like this:
from more_itertools import chunked
for batch in chunked(docs, 64000):
client.index(batch)
This would indeed pass 64000 to the client, but the client internally will batch in requests of size 1000 (this means that the vectordb will try to build the index 64 times). And the client will not return until all the calls are successfull.
What u can do is to pass request_size
parameter to the index
method and adjust so that u get the best performance. There may be a limitation of the size of a single request that can be passed to the vectordb
as grpc
only accepts 2GB limits.
So you can try:
from more_itertools import chunked
for batch in chunked(docs, 64000):
client.index(batch, request_size=64000) # edit this number to the largest value that does not fail
Could you give this approach a try to see if this would satisfy ur needs? This could allow us to avoid adding more methods and complicating the API/interface.
If this you find successful, I would add this into the README for documentation.
Thanks a lot for the help and patience.
from vectordb.
I tried the request_size workaround, the request timings still took a lot of time now around 35 hours, dont know why.. Will try the pr
from vectordb.
I tried the request_size workaround, the request timings still took a lot of time now around 35 hours, dont know why.. Will try the pr
I will try to look into it
from vectordb.
may I ask how do u generate the embeddings? or they are already computed?
from vectordb.
its precomputed
from vectordb.
what are ur jina and docarray versions?
from vectordb.
Hey @TheSeriousProgrammer ,
Are you sure u are using vectordb 0.0.17
?.
I believe this huge poor performance was due to a bug solved in that version where the configuration
passed to the db
did not properly applied to the server deployed.
This ended up problematic because I believe most of the time is spent resizing the index (the max_elements parameter is not properly passed).
Can you please check the version?
Thanks
from vectordb.
I believe that with the latest vectordb
release and the latest docarray
release there should be better performance for you.
from vectordb.
Related Issues (20)
- Fix HNSWVectorDB implementation
- Pass all RuntimeConfigs to Docarray Executors HOT 1
- Allow select `search_field`
- Allow passing `filter` at search time HOT 1
- Think about containerization journey HOT 1
- Think about deploying to cloud journey
- Handle properly workspace
- Pass find parameters
- Test serving in integration tests
- Publish and test e2e
- Recent Jina update breaks vectordb initialization HOT 2
- How to search for particular things stored this can be anything that is stored I would like either time or something and is there a way to earase all data of a particular user HOT 11
- Does vectordb search the document only with embedding? HOT 4
- Instruction for using the dockerfile is not found in readme file HOT 2
- Itterate over items in database HOT 2
- ignore this issue HOT 2
- Using poetry HOT 4
- Cannot restore large index HOT 3
- Missing `__validators__` property with `pydantic>=2` HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vectordb.