jina-ai / vectordb Goto Github PK

View Code? Open in Web Editor NEW

474.0 9.0 36.0 1.25 MB

A Python vector database you just need - no more, no less.

License: Apache License 2.0

Shell 5.95% Python 93.56% Dockerfile 0.49%

vectordb's Issues

Fix HNSWVectorDB implementation

HNSWVectorDB does not seem to work because of issues on docarray.

For instance:

docarray/docarray#1589

Itterate over items in database

Is there a way to iterate over / retrieve all items in the database.
Let's say I use the database to collect entries and then at some later point I want to do some clustering on the embeddings.

Instruction for using the dockerfile is not found in readme file

There is no instruction for how use the dockerfiles of vectordb in the readme file, Should we work on that?

Allow passing `filter` at search time

Think about deploying to cloud journey

Cannot restore large index

Hello,

I am using Python 3.10.9 and vectordb==0.0.20 (latest as of this date), and I have a trouble when restoring saved data.

I have two large files A and B, and when I index them, snapshot them and restore them separately, everything works fine.

When I read and parse files A and B, index all the documents in both, then save them together, the snapshotting is successful. However, when trying to restore the data, I get the following error:

Traceback (most recent call last):
  ...
  File "~/.local/lib/python3.10/site-packages/vectordb/db/executors/inmemory_exact_indexer.py", line 86, in restore
    self._indexer = InMemoryExactNNIndex[self._input_schema](index_file_path=snapshot_file)
  File "~/.local/lib/python3.10/site-packages/docarray/index/backends/in_memory.py", line 68, in __init__
    self._docs = DocList.__class_getitem__(
  File "~/.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 810, in load_binary
    return cls._load_binary_all(
  File "~ /.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 608, in _load_binary_all
    proto.ParseFromString(d)
google.protobuf.message.DecodeError: Error parsing message

Given previous tests I made and explanation, I suspect the issue is that the index is too large, hence raising the error. Does anyone know what can be done to fix this issue?

Missing `validators` property with `pydantic>=2`

Description

When instantiating an InMemoryExactNNVectorDB instance using pydantic>=2, an exception is raised about the __validators__ attribute being missing. This does not happen with pydantic<2.

Environment

M1 MacBookPro (M1 Max arm chip)
Python 3.11.5 (conda)
Packages:
- vectordb==0.0.20
- docarray==0.40.0
- pydantic==2.3.0
- pydantic-extra-types==2.3.0
- pydantic-settings==2.1.0
- pydantic_core==2.6.3

Workaround

When I add the __validators__ property (even with a value of None), the bug is avoided and the database works as expected. I'm not clear if it's functioning with less validation than desired since this line will result in __validators__=None, but the DB is able to store and retrieve documents just fine.

Repro

from docarray import BaseDoc
from vectordb import InMemoryExactNNVectorDB

class BrokenDoc(BaseDoc):
    text: str
    # Uncomment this to work around the bug!
    # __validators__ = None

db = InMemoryExactNNVectorDB[BrokenDoc]()

Error

Traceback (most recent call last):
  File "vdb_repro.py", line 8, in <module>
    db = InMemoryExactNNVectorDB[BrokenDoc]()
         ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 40, in __class_getitem__
    class VectorDBTyped(cls):  # type: ignore
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 42, in VectorDBTyped
    _executor_cls: Type[TypedExecutor] = cls._executor_type[item]
                                         ~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/executors/typed_executor.py", line 71, in __class_getitem__
    output_schema = create_output_doc_type(input_schema)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/utils/create_doc_type.py", line 18, in create_output_doc_type
    __validators__=input_doc_type.__validators__,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/pydantic/_internal/_model_construction.py", line 210, in __getattr__
    raise AttributeError(item)
AttributeError: __validators__

Think about containerization journey

Does vectordb search the document only with embedding?

Thank you for your great project.
It's really good.

My question is "Does vectordb search the document only with embedding?"
I mean the search algorithm.

class ToyDoc(BaseDoc):
    text: str = ""
    embedding: NdArray[1536]

If I set ToyDoc Class like this. (refer to your examples)
What does the vairable "text" do?

Just for save the information related to embedding?

Or use some search algorithm for "text" too?

ignore this issue

Using poetry

@JoanFM
Does vectordb use poetry for managing for managing all packages?

Pass all RuntimeConfigs to Docarray Executors

How to search for particular things stored this can be anything that is stored I would like either time or something and is there a way to earase all data of a particular user

Recent Jina update breaks vectordb initialization

The 3.20.0 update on Jina introduced a few new positional arguments to _FunctionWithSchema including 'is_singleton_doc', 'parameters_is_pydantic_model', and 'paramaters_model'. This seems to have broken the initialization of at least the InMemoryExactNNVectorDB and HNSWVectorDB. When trying to create a db the following error message is generated:

 TypeError                                 Traceback (most recent call last)
 Cell In[3], line 6
       3 from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
       5 # Specify your workspace path
 ----> 6 db = InMemoryExactNNVectorDB[ToyDoc](workspace='./workspace_path')
       8 # Index a list of documents with random embeddings
       9 doc_list = [ToyDoc(text=f'toy doc {i}', embedding=np.random.rand(128)) for i in range(1000)]
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/vectordb/db/base.py:61, in VectorDB.__init__(self, *args, **kwargs)
      59 kwargs['requests'] = REQUESTS_MAP
      60 kwargs['runtime_args'] = {'workspace': self._workspace}
 ---> 61 self._executor = self._executor_cls(*args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/executors/decorators.py:61, in avoid_concurrent_lock_cls..avoid_concurrent_lock_wrapper..arg_wrapper(self, *args, **kwargs)
      59     return f
      60 else:
 ---> 61     return func(self, *args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/helper.py:73, in store_init_kwargs..arg_wrapper(self, *args, **kwargs)
      71     self._init_kwargs_dict = tmp
      72 convert_tuple_to_list(self._init_kwargs_dict)
 ---> 73 f = func(self, *args, **kwargs)
      74 return f
 ...
      35     else:
      36         self._requests[k] = _FunctionWithSchema(self._requests[k].fn, DocList[self._input_schema],
      37                                                 DocList[self._output_schema])
 
 TypeError: _FunctionWithSchema.__new__() missing 3 required positional arguments: 'is_singleton_doc', 'parameters_is_pydantic_model', and 'parameters_model'

I fixed temporarily by reverting to Jina 3.19.0

VectorDB hosted solution takes a lot of time to push vectors

I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs

from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob

class LogoDoc(BaseDoc):
        embedding: NdArray[768]
        id: str

db = HNSWVectorDB[LogoDoc](
     workspace="hnsw_vectordb",
     space = "ip",
     max_elements = 2700000,
     ef_construction = 256,
     M = 16,
     num_threads = 8
)

if __name__=="__main__" :
	with db.serve() as service :
		service.block()

and tried to push my vectors using the client interface

I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)

It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call

db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()

which could replace the

db.index()

and during the build process we could easily block the crud operations with a is_building_tree flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed

Test serving in integration tests

Run tests about serving with replicas, shards, workspace, direct from context manager, blocking, etc ...