jina-ai / vectordb Goto Github PK
View Code? Open in Web Editor NEWA Python vector database you just need - no more, no less.
License: Apache License 2.0
A Python vector database you just need - no more, no less.
License: Apache License 2.0
HNSWVectorDB does not seem to work because of issues on docarray
.
For instance:
Is there a way to iterate over / retrieve all items in the database.
Let's say I use the database to collect entries and then at some later point I want to do some clustering on the embeddings.
There is no instruction for how use the dockerfiles of vectordb in the readme file, Should we work on that?
Hello,
I am using Python 3.10.9 and vectordb==0.0.20
(latest as of this date), and I have a trouble when restoring saved data.
I have two large files A
and B
, and when I index them, snapshot them and restore them separately, everything works fine.
When I read and parse files A
and B
, index all the documents in both, then save them together, the snapshotting is successful. However, when trying to restore the data, I get the following error:
Traceback (most recent call last):
...
File "~/.local/lib/python3.10/site-packages/vectordb/db/executors/inmemory_exact_indexer.py", line 86, in restore
self._indexer = InMemoryExactNNIndex[self._input_schema](index_file_path=snapshot_file)
File "~/.local/lib/python3.10/site-packages/docarray/index/backends/in_memory.py", line 68, in __init__
self._docs = DocList.__class_getitem__(
File "~/.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 810, in load_binary
return cls._load_binary_all(
File "~ /.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 608, in _load_binary_all
proto.ParseFromString(d)
google.protobuf.message.DecodeError: Error parsing message
Given previous tests I made and explanation, I suspect the issue is that the index is too large, hence raising the error. Does anyone know what can be done to fix this issue?
When instantiating an InMemoryExactNNVectorDB
instance using pydantic>=2
, an exception is raised about the __validators__
attribute being missing. This does not happen with pydantic<2
.
conda
)vectordb==0.0.20
docarray==0.40.0
pydantic==2.3.0
pydantic-extra-types==2.3.0
pydantic-settings==2.1.0
pydantic_core==2.6.3
When I add the __validators__
property (even with a value of None
), the bug is avoided and the database works as expected. I'm not clear if it's functioning with less validation than desired since this line will result in __validators__=None
, but the DB is able to store and retrieve documents just fine.
from docarray import BaseDoc
from vectordb import InMemoryExactNNVectorDB
class BrokenDoc(BaseDoc):
text: str
# Uncomment this to work around the bug!
# __validators__ = None
db = InMemoryExactNNVectorDB[BrokenDoc]()
Traceback (most recent call last):
File "vdb_repro.py", line 8, in <module>
db = InMemoryExactNNVectorDB[BrokenDoc]()
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 40, in __class_getitem__
class VectorDBTyped(cls): # type: ignore
File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 42, in VectorDBTyped
_executor_cls: Type[TypedExecutor] = cls._executor_type[item]
~~~~~~~~~~~~~~~~~~^^^^^^
File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/executors/typed_executor.py", line 71, in __class_getitem__
output_schema = create_output_doc_type(input_schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/utils/create_doc_type.py", line 18, in create_output_doc_type
__validators__=input_doc_type.__validators__,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/pydantic/_internal/_model_construction.py", line 210, in __getattr__
raise AttributeError(item)
AttributeError: __validators__
Thank you for your great project.
It's really good.
My question is "Does vectordb search the document only with embedding?"
I mean the search algorithm.
class ToyDoc(BaseDoc):
text: str = ""
embedding: NdArray[1536]
If I set ToyDoc Class like this. (refer to your examples)
What does the vairable "text" do?
Just for save the information related to embedding?
Or use some search algorithm for "text" too?
@JoanFM
Does vectordb use poetry for managing for managing all packages?
The 3.20.0 update on Jina introduced a few new positional arguments to _FunctionWithSchema including 'is_singleton_doc', 'parameters_is_pydantic_model', and 'paramaters_model'. This seems to have broken the initialization of at least the InMemoryExactNNVectorDB and HNSWVectorDB. When trying to create a db the following error message is generated:
TypeError Traceback (most recent call last)
Cell In[3], line 6
3 from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
5 # Specify your workspace path
----> 6 db = InMemoryExactNNVectorDB[ToyDoc](workspace='./workspace_path')
8 # Index a list of documents with random embeddings
9 doc_list = [ToyDoc(text=f'toy doc {i}', embedding=np.random.rand(128)) for i in range(1000)]
File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/vectordb/db/base.py:61, in VectorDB.__init__(self, *args, **kwargs)
59 kwargs['requests'] = REQUESTS_MAP
60 kwargs['runtime_args'] = {'workspace': self._workspace}
---> 61 self._executor = self._executor_cls(*args, **kwargs)
File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/executors/decorators.py:61, in avoid_concurrent_lock_cls..avoid_concurrent_lock_wrapper..arg_wrapper(self, *args, **kwargs)
59 return f
60 else:
---> 61 return func(self, *args, **kwargs)
File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/helper.py:73, in store_init_kwargs..arg_wrapper(self, *args, **kwargs)
71 self._init_kwargs_dict = tmp
72 convert_tuple_to_list(self._init_kwargs_dict)
---> 73 f = func(self, *args, **kwargs)
74 return f
...
35 else:
36 self._requests[k] = _FunctionWithSchema(self._requests[k].fn, DocList[self._input_schema],
37 DocList[self._output_schema])
TypeError: _FunctionWithSchema.__new__() missing 3 required positional arguments: 'is_singleton_doc', 'parameters_is_pydantic_model', and 'parameters_model'
I fixed temporarily by reverting to Jina 3.19.0
I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob
class LogoDoc(BaseDoc):
embedding: NdArray[768]
id: str
db = HNSWVectorDB[LogoDoc](
workspace="hnsw_vectordb",
space = "ip",
max_elements = 2700000,
ef_construction = 256,
M = 16,
num_threads = 8
)
if __name__=="__main__" :
with db.serve() as service :
service.block()
and tried to push my vectors using the client interface
I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)
It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call
db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()
which could replace the
db.index()
and during the build process we could easily block the crud operations with a is_building_tree
flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed
Run tests about serving with replicas, shards, workspace, direct from context manager, blocking, etc ...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.