Git Product home page Git Product logo

vectordb's Issues

Itterate over items in database

Is there a way to iterate over / retrieve all items in the database.
Let's say I use the database to collect entries and then at some later point I want to do some clustering on the embeddings.

Cannot restore large index

Hello,

I am using Python 3.10.9 and vectordb==0.0.20 (latest as of this date), and I have a trouble when restoring saved data.

I have two large files A and B, and when I index them, snapshot them and restore them separately, everything works fine.

When I read and parse files A and B, index all the documents in both, then save them together, the snapshotting is successful. However, when trying to restore the data, I get the following error:

Traceback (most recent call last):
  ...
  File "~/.local/lib/python3.10/site-packages/vectordb/db/executors/inmemory_exact_indexer.py", line 86, in restore
    self._indexer = InMemoryExactNNIndex[self._input_schema](index_file_path=snapshot_file)
  File "~/.local/lib/python3.10/site-packages/docarray/index/backends/in_memory.py", line 68, in __init__
    self._docs = DocList.__class_getitem__(
  File "~/.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 810, in load_binary
    return cls._load_binary_all(
  File "~ /.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 608, in _load_binary_all
    proto.ParseFromString(d)
google.protobuf.message.DecodeError: Error parsing message

Given previous tests I made and explanation, I suspect the issue is that the index is too large, hence raising the error. Does anyone know what can be done to fix this issue?

Missing `__validators__` property with `pydantic>=2`

Description

When instantiating an InMemoryExactNNVectorDB instance using pydantic>=2, an exception is raised about the __validators__ attribute being missing. This does not happen with pydantic<2.

Environment

  • M1 MacBookPro (M1 Max arm chip)
  • Python 3.11.5 (conda)
  • Packages:
    • vectordb==0.0.20
    • docarray==0.40.0
    • pydantic==2.3.0
    • pydantic-extra-types==2.3.0
    • pydantic-settings==2.1.0
    • pydantic_core==2.6.3

Workaround

When I add the __validators__ property (even with a value of None), the bug is avoided and the database works as expected. I'm not clear if it's functioning with less validation than desired since this line will result in __validators__=None, but the DB is able to store and retrieve documents just fine.

Repro

from docarray import BaseDoc
from vectordb import InMemoryExactNNVectorDB

class BrokenDoc(BaseDoc):
    text: str
    # Uncomment this to work around the bug!
    # __validators__ = None

db = InMemoryExactNNVectorDB[BrokenDoc]()

Error

Traceback (most recent call last):
  File "vdb_repro.py", line 8, in <module>
    db = InMemoryExactNNVectorDB[BrokenDoc]()
         ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 40, in __class_getitem__
    class VectorDBTyped(cls):  # type: ignore
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 42, in VectorDBTyped
    _executor_cls: Type[TypedExecutor] = cls._executor_type[item]
                                         ~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/executors/typed_executor.py", line 71, in __class_getitem__
    output_schema = create_output_doc_type(input_schema)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/utils/create_doc_type.py", line 18, in create_output_doc_type
    __validators__=input_doc_type.__validators__,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/pydantic/_internal/_model_construction.py", line 210, in __getattr__
    raise AttributeError(item)
AttributeError: __validators__

Does vectordb search the document only with embedding?

Thank you for your great project.
It's really good.

My question is "Does vectordb search the document only with embedding?"
I mean the search algorithm.

class ToyDoc(BaseDoc):
    text: str = ""
    embedding: NdArray[1536] 

If I set ToyDoc Class like this. (refer to your examples)
What does the vairable "text" do?

Just for save the information related to embedding?

Or use some search algorithm for "text" too?

Recent Jina update breaks vectordb initialization

The 3.20.0 update on Jina introduced a few new positional arguments to _FunctionWithSchema including 'is_singleton_doc', 'parameters_is_pydantic_model', and 'paramaters_model'. This seems to have broken the initialization of at least the InMemoryExactNNVectorDB and HNSWVectorDB. When trying to create a db the following error message is generated:

 TypeError                                 Traceback (most recent call last)
 Cell In[3], line 6
       3 from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
       5 # Specify your workspace path
 ----> 6 db = InMemoryExactNNVectorDB[ToyDoc](workspace='./workspace_path')
       8 # Index a list of documents with random embeddings
       9 doc_list = [ToyDoc(text=f'toy doc {i}', embedding=np.random.rand(128)) for i in range(1000)]
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/vectordb/db/base.py:61, in VectorDB.__init__(self, *args, **kwargs)
      59 kwargs['requests'] = REQUESTS_MAP
      60 kwargs['runtime_args'] = {'workspace': self._workspace}
 ---> 61 self._executor = self._executor_cls(*args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/executors/decorators.py:61, in avoid_concurrent_lock_cls..avoid_concurrent_lock_wrapper..arg_wrapper(self, *args, **kwargs)
      59     return f
      60 else:
 ---> 61     return func(self, *args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/helper.py:73, in store_init_kwargs..arg_wrapper(self, *args, **kwargs)
      71     self._init_kwargs_dict = tmp
      72 convert_tuple_to_list(self._init_kwargs_dict)
 ---> 73 f = func(self, *args, **kwargs)
      74 return f
 ...
      35     else:
      36         self._requests[k] = _FunctionWithSchema(self._requests[k].fn, DocList[self._input_schema],
      37                                                 DocList[self._output_schema])
 
 TypeError: _FunctionWithSchema.__new__() missing 3 required positional arguments: 'is_singleton_doc', 'parameters_is_pydantic_model', and 'parameters_model'

I fixed temporarily by reverting to Jina 3.19.0

VectorDB hosted solution takes a lot of time to push vectors

I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs

from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob

class LogoDoc(BaseDoc):
        embedding: NdArray[768]
        id: str

db = HNSWVectorDB[LogoDoc](
     workspace="hnsw_vectordb",
     space = "ip",
     max_elements = 2700000,
     ef_construction = 256,
     M = 16,
     num_threads = 8
)

if __name__=="__main__" :
	with db.serve() as service :
		service.block()

and tried to push my vectors using the client interface

I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)

It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call

db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()

which could replace the

db.index()

and during the build process we could easily block the crud operations with a is_building_tree flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.