jina-ai / vectordb Goto Github PK

A Python vector database you just need - no more, no less.

License: Apache License 2.0

Shell 5.95% Python 93.56% Dockerfile 0.49%

vectordb's Introduction

A Python vector database you just need - no more, no less.

vectordb is a Pythonic vector database offers a comprehensive suite of CRUD (Create, Read, Update, Delete) operations and robust scalability options, including sharding and replication. It's readily deployable in a variety of environments, from local to on-premise and cloud. vectordb delivers exactly what you need - no more, no less. It's a testament to effective Pythonic design without over-engineering, making it a lean yet powerful solution for all your needs.

vectordb capitalizes on the powerful retrieval prowess of DocArray and the scalability, reliability, and serving capabilities of Jina. Here's the magic: DocArray serves as the engine driving vector search logic, while Jina guarantees efficient and scalable index serving. This synergy culminates in a robust, yet user-friendly vector database experience - that's vectordb for you.

Install

pip install vectordb

Use vectordb from Jina AI on Jina AI Cloud

Getting started with `vectordb` locally

Kick things off by defining a Document schema with the DocArray dataclass syntax:

from docarray import BaseDoc
from docarray.typing import NdArray

class ToyDoc(BaseDoc):
  text: str = ''
  embedding: NdArray[128]

Opt for a pre-built database (like InMemoryExactNNVectorDB or HNSWVectorDB), and apply the schema:

from docarray import DocList
import numpy as np
from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB

# Specify your workspace path
db = InMemoryExactNNVectorDB[ToyDoc](workspace='./workspace_path')

# Index a list of documents with random embeddings
doc_list = [ToyDoc(text=f'toy doc {i}', embedding=np.random.rand(128)) for i in range(1000)]
db.index(inputs=DocList[ToyDoc](doc_list))

# Perform a search query
query = ToyDoc(text='query', embedding=np.random.rand(128))
results = db.search(inputs=DocList[ToyDoc]([query]), limit=10)

# Print out the matches
for m in results[0].matches:
  print(m)

Since we issued a single query, results contains only one element. The nearest neighbour search results are conveniently stored in the .matches attribute.

Getting started with `vectordb` as a service

vectordb is designed to be easily served as a service, supporting gRPC, HTTP, and Websocket communication protocols.

Server Side

On the server side, you would start the service as follows:

with db.serve(protocol='grpc', port=12345, replicas=1, shards=1) as service:
   service.block()

This command starts vectordb as a service on port 12345, using the gRPC protocol with 1 replica and 1 shard.

Client Side

On the client side, you can access the service with the following commands:

from vectordb import Client

# Instantiate a client connected to the server. In practice, replace 0.0.0.0 to the server IP address.
client = Client[ToyDoc](address='grpc://0.0.0.0:12345')

# Perform a search query
results = client.search(inputs=DocList[ToyDoc]([query]), limit=10)

This allows you to perform a search query, receiving the results directly from the remote vectordb service.

Hosting `vectordb` on Jina AI Cloud

You can seamlessly deploy your vectordb instance to Jina AI Cloud, which ensures access to your database from any location.

Start by embedding your database instance or class into a Python file:

# example.py
from docarray import BaseDoc
from vectordb import InMemoryExactNNVectorDB

db = InMemoryExactNNVectorDB[ToyDoc](workspace='./vectordb') # notice how `db` is the instance that we want to serve

if __name__ == '__main__':
    # IMPORTANT: make sure to protect this part of the code using __main__ guard
    with db.serve() as service:
        service.block()

Next, follow these steps to deploy your instance:

If you haven't already, sign up for a Jina AI Cloud account.
Use the jc command line to login to your Jina AI Cloud account:

jc login

Deploy your instance:

vectordb deploy --db example:db

Connect from the client

After deployment, use the vectordb Client to access the assigned endpoint:

from vectordb import Client

# replace the ID with the ID of your deployed DB as shown in the screenshot above
c = Client(address='grpcs://ID.wolf.jina.ai')

Manage your deployed instances using jcloud

You can then list, pause, resume or delete your deployed DBs with jc command:

jcloud list ID

jcloud pause ID or jcloud resume ID

jcloud remove ID

Advanced Topics

What is a vector database?

Vector databases serve as sophisticated repositories for embeddings, capturing the essence of semantic similarity among disparate objects. These databases facilitate similarity searches across a myriad of multimodal data types, paving the way for a new era of information retrieval. By providing contextual understanding and enriching generation results, vector databases greatly enhance the performance and utility of Language Learning Models (LLM). This underscores their pivotal role in the evolution of data science and machine learning applications.

CRUD support

Both the local library usage and the client-server interactions in vectordb share the same API. This provides index, search, update, and delete functionalities:

index: Accepts a DocList to index.
search: Takes a DocList of batched queries or a single BaseDoc as a single query. It returns either single or multiple results, each with matches and scores attributes sorted by relevance.
delete: Accepts a DocList of documents to remove from the index. Only the id attribute is necessary, so make sure to track the indexed IDs if you need to delete documents.
update: Accepts a DocList of documents to update in the index. The update operation will replace the indexed document with the same index with the attributes and payload from the input documents.

Service endpoint configuration

You can serve vectordb and access it from a client with the following parameters:

protocol: The serving protocol. It can be gRPC, HTTP, websocket or a combination of them, provided as a list. Default is gRPC.
port: The service access port. Can be a list of ports for each provided protocol. Default is 8081.
workspace: The path where the VectorDB persists required data. Default is '.' (current directory).

Scaling your DB

You can set two scaling parameters when serving or deploying your Vector Databases with vectordb:

Shards: The number of data shards. This improves latency, as vectordb ensures Documents are indexed in only one of the shards. Search requests are sent to all shards and results are merged.
Replicas: The number of DB replicas. vectordb uses the RAFT algorithm to sync the index between replicas of each shard. This increases service availability and search throughput, as multiple replicas can respond in parallel to more search requests while allowing CRUD operations. Note: In JCloud deployments, the number of replicas is set to 1. We're working on enabling replication in the cloud.

Vector search configuration

Here are the parameters for each VectorDB type:

InMemoryExactNNVectorDB

This database performs exhaustive search on embeddings and has limited configuration settings:

workspace: The folder where required data is persisted.

InMemoryExactNNVectorDB[MyDoc](workspace='./vectordb')
InMemoryExactNNVectorDB[MyDoc].serve(workspace='./vectordb')

HNSWVectorDB

This database employs the HNSW (Hierarchical Navigable Small World) algorithm from HNSWLib for Approximate Nearest Neighbor search. It provides several configuration options:

workspace: Specifies the directory where required data is stored and persisted.

Additionally, HNSWVectorDB offers a set of configurations that allow tuning the performance and accuracy of the Nearest Neighbor search algorithm. Detailed descriptions of these configurations can be found in the HNSWLib README:

space: Specifies the similarity metric used for the space (options are "l2", "ip", or "cosine"). The default is "l2".
max_elements: Sets the initial capacity of the index, which can be increased dynamically. The default is 1024.
ef_construction: This parameter controls the speed/accuracy trade-off during index construction. The default is 200.
ef: This parameter controls the query time/accuracy trade-off. The default is 10.
M: This parameter defines the maximum number of outgoing connections in the graph. The default is 16.
allow_replace_deleted: If set to True, this allows replacement of deleted elements with newly added ones. The default is False.
num_threads: This sets the default number of threads to be used during index and search operations. The default is 1.

Command line interface

vectordb includes a simple CLI for serving and deploying your database:

Description	Command
Serve your DB locally	`vectordb serve --db example:db`
Deploy your DB on Jina AI Cloud	`vectordb deploy --db example:db`

Features

User-friendly Interface: With vectordb, simplicity is key. Its intuitive interface is designed to accommodate users across varying levels of expertise.
Minimalistic Design: vectordb packs all the essentials, with no unnecessary complexity. It ensures a seamless transition from local to server and cloud deployment.
Full CRUD Support: From indexing and searching to updating and deleting, vectordb covers the entire spectrum of CRUD operations.
DB as a Service: Harness the power of gRPC, HTTP, and Websocket protocols with vectordb. It enables you to serve your databases and conduct insertion or searching operations efficiently.
Scalability: Experience the raw power of vectordb's deployment capabilities, including robust scalability features like sharding and replication. Improve your service latency with sharding, while replication enhances availability and throughput.
Cloud Deployment: Deploying your service in the cloud is a breeze with Jina AI Cloud. More deployment options are coming soon!
Serverless Capability: vectordb can be deployed in a serverless mode in the cloud, ensuring optimal resource utilization and data availability as per your needs.
Multiple ANN Algorithms: vectordb offers diverse implementations of Approximate Nearest Neighbors (ANN) algorithms. Here are the current offerings, with more integrations on the horizon:
- InMemoryExactNNVectorDB (Exact NN Search): Implements Simple Nearest Neighbor Algorithm.
- HNSWVectorDB (based on HNSW): Utilizes HNSWLib

Roadmap

The future of Vector Database looks bright, and we have ambitious plans! Here's a sneak peek into the features we're currently developing:

More ANN Search Algorithms: Our goal is to support an even wider range of ANN search algorithms.
Enhanced Filtering Capabilities: We're working on enhancing our ANN Search solutions to support advanced filtering.
Customizability: We aim to make vectordb highly customizable, allowing Python developers to tailor its behavior to their specific needs with ease.
Expanding Serverless Capacity: We're striving to enhance the serverless capacity of vectordb in the cloud. While we currently support scaling between 0 and 1 replica, our goal is to extend this to 0 to N replicas.
Expanded Deployment Options: We're actively working on facilitating the deployment of vectordb across various cloud platforms, with a broad range of options.

Need help with vectordb? Interested in using it but require certain features to meet your unique needs? Don't hesitate to reach out to us. Join our Discord community to chat with us and other community members.

Contributing

The VectorDB project is backed by Jina AI and licensed under Apache-2.0. Contributions from the community are greatly appreciated! If you have an idea for a new feature or an improvement, we would love to hear from you. We're always looking for ways to make vectordb more user-friendly and effective.

vectordb's People

Contributors

Stargazers

Watchers

vectordb's Issues

Pass find parameters

Does vectordb search the document only with embedding?

Thank you for your great project.
It's really good.

My question is "Does vectordb search the document only with embedding?"
I mean the search algorithm.

class ToyDoc(BaseDoc):
    text: str = ""
    embedding: NdArray[1536]

If I set ToyDoc Class like this. (refer to your examples)
What does the vairable "text" do?

Just for save the information related to embedding?

Or use some search algorithm for "text" too?

Itterate over items in database

Is there a way to iterate over / retrieve all items in the database.
Let's say I use the database to collect entries and then at some later point I want to do some clustering on the embeddings.

Handle properly workspace

ignore this issue

Allow select `search_field`

Think about containerization journey

Fix HNSWVectorDB implementation

HNSWVectorDB does not seem to work because of issues on docarray.

For instance:

docarray/docarray#1589

Test serving in integration tests

Run tests about serving with replicas, shards, workspace, direct from context manager, blocking, etc ...

Think about deploying to cloud journey

Missing `validators` property with `pydantic>=2`

Description

When instantiating an InMemoryExactNNVectorDB instance using pydantic>=2, an exception is raised about the __validators__ attribute being missing. This does not happen with pydantic<2.

Environment

M1 MacBookPro (M1 Max arm chip)
Python 3.11.5 (conda)
Packages:
- vectordb==0.0.20
- docarray==0.40.0
- pydantic==2.3.0
- pydantic-extra-types==2.3.0
- pydantic-settings==2.1.0
- pydantic_core==2.6.3

Workaround

When I add the __validators__ property (even with a value of None), the bug is avoided and the database works as expected. I'm not clear if it's functioning with less validation than desired since this line will result in __validators__=None, but the DB is able to store and retrieve documents just fine.

Repro

from docarray import BaseDoc
from vectordb import InMemoryExactNNVectorDB

class BrokenDoc(BaseDoc):
    text: str
    # Uncomment this to work around the bug!
    # __validators__ = None

db = InMemoryExactNNVectorDB[BrokenDoc]()

Error

Traceback (most recent call last):
  File "vdb_repro.py", line 8, in <module>
    db = InMemoryExactNNVectorDB[BrokenDoc]()
         ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 40, in __class_getitem__
    class VectorDBTyped(cls):  # type: ignore
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/base.py", line 42, in VectorDBTyped
    _executor_cls: Type[TypedExecutor] = cls._executor_type[item]
                                         ~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/db/executors/typed_executor.py", line 71, in __class_getitem__
    output_schema = create_output_doc_type(input_schema)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/vectordb/utils/create_doc_type.py", line 18, in create_output_doc_type
    __validators__=input_doc_type.__validators__,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/env/lib/python3.11/site-packages/pydantic/_internal/_model_construction.py", line 210, in __getattr__
    raise AttributeError(item)
AttributeError: __validators__

Pass all RuntimeConfigs to Docarray Executors

Allow passing `filter` at search time

How to search for particular things stored this can be anything that is stored I would like either time or something and is there a way to earase all data of a particular user

Publish and test e2e

VectorDB hosted solution takes a lot of time to push vectors

I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs

from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob

class LogoDoc(BaseDoc):
        embedding: NdArray[768]
        id: str

db = HNSWVectorDB[LogoDoc](
     workspace="hnsw_vectordb",
     space = "ip",
     max_elements = 2700000,
     ef_construction = 256,
     M = 16,
     num_threads = 8
)

if __name__=="__main__" :
	with db.serve() as service :
		service.block()

and tried to push my vectors using the client interface

I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)

It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call

db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()

which could replace the

db.index()

and during the build process we could easily block the crud operations with a is_building_tree flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed

Recent Jina update breaks vectordb initialization

The 3.20.0 update on Jina introduced a few new positional arguments to _FunctionWithSchema including 'is_singleton_doc', 'parameters_is_pydantic_model', and 'paramaters_model'. This seems to have broken the initialization of at least the InMemoryExactNNVectorDB and HNSWVectorDB. When trying to create a db the following error message is generated:

 TypeError                                 Traceback (most recent call last)
 Cell In[3], line 6
       3 from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
       5 # Specify your workspace path
 ----> 6 db = InMemoryExactNNVectorDB[ToyDoc](workspace='./workspace_path')
       8 # Index a list of documents with random embeddings
       9 doc_list = [ToyDoc(text=f'toy doc {i}', embedding=np.random.rand(128)) for i in range(1000)]
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/vectordb/db/base.py:61, in VectorDB.__init__(self, *args, **kwargs)
      59 kwargs['requests'] = REQUESTS_MAP
      60 kwargs['runtime_args'] = {'workspace': self._workspace}
 ---> 61 self._executor = self._executor_cls(*args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/executors/decorators.py:61, in avoid_concurrent_lock_cls..avoid_concurrent_lock_wrapper..arg_wrapper(self, *args, **kwargs)
      59     return f
      60 else:
 ---> 61     return func(self, *args, **kwargs)
 
 File ~/Documents/projects/git/vc/venv/lib/python3.11/site-packages/jina/serve/helper.py:73, in store_init_kwargs..arg_wrapper(self, *args, **kwargs)
      71     self._init_kwargs_dict = tmp
      72 convert_tuple_to_list(self._init_kwargs_dict)
 ---> 73 f = func(self, *args, **kwargs)
      74 return f
 ...
      35     else:
      36         self._requests[k] = _FunctionWithSchema(self._requests[k].fn, DocList[self._input_schema],
      37                                                 DocList[self._output_schema])
 
 TypeError: _FunctionWithSchema.__new__() missing 3 required positional arguments: 'is_singleton_doc', 'parameters_is_pydantic_model', and 'parameters_model'

I fixed temporarily by reverting to Jina 3.19.0

Using poetry

@JoanFM
Does vectordb use poetry for managing for managing all packages?

Cannot restore large index

Hello,

I am using Python 3.10.9 and vectordb==0.0.20 (latest as of this date), and I have a trouble when restoring saved data.

I have two large files A and B, and when I index them, snapshot them and restore them separately, everything works fine.

When I read and parse files A and B, index all the documents in both, then save them together, the snapshotting is successful. However, when trying to restore the data, I get the following error:

Traceback (most recent call last):
  ...
  File "~/.local/lib/python3.10/site-packages/vectordb/db/executors/inmemory_exact_indexer.py", line 86, in restore
    self._indexer = InMemoryExactNNIndex[self._input_schema](index_file_path=snapshot_file)
  File "~/.local/lib/python3.10/site-packages/docarray/index/backends/in_memory.py", line 68, in __init__
    self._docs = DocList.__class_getitem__(
  File "~/.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 810, in load_binary
    return cls._load_binary_all(
  File "~ /.local/lib/python3.10/site-packages/docarray/array/doc_list/io.py", line 608, in _load_binary_all
    proto.ParseFromString(d)
google.protobuf.message.DecodeError: Error parsing message

Given previous tests I made and explanation, I suspect the issue is that the index is too large, hence raising the error. Does anyone know what can be done to fix this issue?

Instruction for using the dockerfile is not found in readme file

There is no instruction for how use the dockerfiles of vectordb in the readme file, Should we work on that?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.