Git Product home page Git Product logo

epsilla-cloud / vectordb Goto Github PK

View Code? Open in Web Editor NEW
889.0 4.0 38.0 1007 KB

Epsilla is a high performance Vector Database Management System. Try out hosted Epsilla at https://cloud.epsilla.com/

Home Page: https://www.epsilla.com

License: GNU General Public License v3.0

CMake 0.70% Shell 0.79% C++ 97.39% Dockerfile 0.07% Python 1.06%
ai infrastructure llms chatgpt data data-science database embeddings machine-learning rag

vectordb's Introduction

Epsilla Logo

A 10x faster, cheaper, and better vector database

DocumentationDiscordTwitterBlogYouTubeFeedback


Epsilla is an open-source vector database. Our focus is on ensuring scalability, high performance, and cost-effectiveness of vector search. EpsillaDB bridges the gap between information retrieval and memory retention in Large Language Models.

Quick Start using Docker

1. Run Backend in Docker

docker pull epsilla/vectordb
docker run --pull=always -d -p 8888:8888 -v /data:/data epsilla/vectordb

2. Interact with Python Client

pip install pyepsilla
from pyepsilla import vectordb

client = vectordb.Client(host='localhost', port='8888')
client.load_db(db_name="MyDB", db_path="/data/epsilla")
client.use_db(db_name="MyDB")

client.create_table(
    table_name="MyTable",
    table_fields=[
        {"name": "ID", "dataType": "INT", "primaryKey": True},
        {"name": "Doc", "dataType": "STRING"},
    ],
    indices=[
      {"name": "Index", "field": "Doc"},
    ]
)

client.insert(
    table_name="MyTable",
    records=[
        {"ID": 1, "Doc": "Jupiter is the largest planet in our solar system."},
        {"ID": 2, "Doc": "Cheetahs are the fastest land animals, reaching speeds over 60 mph."},
        {"ID": 3, "Doc": "Vincent van Gogh painted the famous work \"Starry Night.\""},
        {"ID": 4, "Doc": "The Amazon River is the longest river in the world."},
        {"ID": 5, "Doc": "The Moon completes one orbit around Earth every 27 days."},
    ],
)

client.query(
    table_name="MyTable",
    query_text="Celestial bodies and their characteristics",
    limit=2
)

# Result
# {
#     'message': 'Query search successfully.',
#     'result': [
#         {'Doc': 'Jupiter is the largest planet in our solar system.', 'ID': 1},
#         {'Doc': 'The Moon completes one orbit around Earth every 27 days.', 'ID': 5}
#     ],
#     'statusCode': 200
# }

Features:

  • High performance and production-scale similarity search for embedding vectors.

  • Full fledged database management system with familiar database, table, and field concepts. Vector is just another field type.

  • Metadata filtering.

  • Hybrid search with a fusion of dense and sparse vectors.

  • Built-in embedding support, with natural language in natural language out search experience.

  • Cloud native architecture with compute storage separation, serverless, and multi-tenancy.

  • Rich ecosystem integrations including LangChain and LlamaIndex.

  • Python/JavaScript/Ruby clients, and REST API interface.

Epsilla's core is written in C++ and leverages the advanced academic parallel graph traversal techniques for vector indexing, achieving 10 times faster vector search than HNSW while maintaining precision levels of over 99.9%.

Epsilla Cloud

Try our fully managed vector DBaaS at Epsilla Cloud

(Experimental) Use Epsilla as a python library without starting a docker image

1. Build Epsilla Python Bindings lib package

cd engine/scripts
(If on Ubuntu, run this first: bash setup-dev.sh)
bash install_oatpp_modules.sh
cd ..
bash build.sh
ls -lh build/*.so

2. Run test with python bindings lib "epsilla.so" "libvectordb_dylib.so in the folder "build" built in the previous step

cd engine
export PYTHONPATH=./build/
export DB_PATH=/tmp/db33
python3 test/bindings/python/test.py

Here are some sample code:

import epsilla

epsilla.load_db(db_name="db", db_path="/data/epsilla")
epsilla.use_db(db_name="db")
epsilla.create_table(
    table_name="MyTable",
    table_fields=[
        {"name": "ID", "dataType": "INT", "primaryKey": True},
        {"name": "Doc", "dataType": "STRING"},
        {"name": "EmbeddingEuclidean", "dataType": "VECTOR_FLOAT", "dimensions": 4, "metricType": "EUCLIDEAN"}
    ]
)
epsilla.insert(
    table_name="MyTable",
    records=[
        {"ID": 1, "Doc": "Berlin", "EmbeddingEuclidean": [0.05, 0.61, 0.76, 0.74]},
        {"ID": 2, "Doc": "London", "EmbeddingEuclidean": [0.19, 0.81, 0.75, 0.11]},
        {"ID": 3, "Doc": "Moscow", "EmbeddingEuclidean": [0.36, 0.55, 0.47, 0.94]}
    ]
)
(code, response) = epsilla.query(
    table_name="MyTable",
    query_field="EmbeddingEuclidean",
    response_fields=["ID", "Doc", "EmbeddingEuclidean"],
    query_vector=[0.35, 0.55, 0.47, 0.94],
    filter="ID < 6",
    limit=10,
    with_distance=True
)
print(code, response)

vectordb's People

Contributors

andriymulyar avatar eric-epsilla avatar jonherke avatar juliuslipp avatar richard-epsilla avatar ricki-epsilla avatar tonyyanga avatar topkeyboard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vectordb's Issues

Could you add in a image video content search feature for epsilla cloud please.

Describe the feature

  • to be filled
    This new feature for epsilla cloud would help users to create and search within image video content databases.
    Motivation and use case
  • to be filled
    So first you should create a new smart Search Database then select image data content search add on then import your
    Data content from your local Files drive and then name your image data content search database using smart search
    For yourself or your team and then run your reverse image or input keyword into the search bar and run a few image searches
    Either way by keyword or by uploaded image or video.

Additional context

  • to be filled
    Any further than less suggestions please let me know if this helps with the future of vector databases for personal use or public use.

API to get schema

Describe the feature

  • to be filled

Motivation and use case

  • to be filled

Additional context

  • to be filled

Support int ID auto incremental

Describe the feature

  • Set auto incremental ID assignment, so that users don't need to manually manage the ID in application layer

Motivation and use case

  • From Discord customer feedback: I also find that supporting auto incremental ID might be helpful (at least for me). Since I spent more of my time focusing on managing the metadatas dan embedding.

Additional context

Support clear data

Describe the feature

  • Clear data support in table level and in DB level

Motivation and use case

  • Keep the schema, but clear data

Additional context

  • to be filled

Handle"already exists" errors in load_db calls gracefully, or with HTTP 409

When using Epsilla database with the docker image, we are required to use load_db method to load the database. The returned status code is HTTP 500 when the db is already loaded by the server. This could happen during normal operations, when multiple clients connect to the same db server.

It causes the clients to handle it with string matches on the error message.

if status_code != HTTPStatus.OK:
                if status_code == HTTPStatus.INTERNAL_SERVER_ERROR and (
                    "Database catalog file is already loaded" in response["message"]
                    or "DB already exists" in response["message"]
                ):
                    self._logger.info(f'{self._db_config.db_name} already loaded.')
                else:
                    raise IOError(
                        f"Failed to load database {self._db_config.db_name}. "
                        f"Error code: {status_code}. Error message: {response}."
                    )

It is better if the server / client library returns HTTP 200. Alternatively, the right HTTP status code (HTTP 409) should be used to avoid string matches on the error message.

It should require a simple change to this section of the code:

Status BasicMetaImpl::LoadDatabase(const std::string& db_catalog_path, const std::string& db_name) {
if (loaded_databases_paths_.find(db_catalog_path) != loaded_databases_paths_.end()) {
return Status(DB_UNEXPECTED_ERROR, "Database catalog file is already loaded: " + db_catalog_path);
}
if (databases_.find(db_name) != databases_.end()) {
return Status(DB_UNEXPECTED_ERROR, "DB already exists: " + db_name);
}
if (!server::CommonUtil::IsValidName(db_name)) {
return Status(DB_UNEXPECTED_ERROR, "DB name should start with a letter or '_' and can contain only letters, digits, and underscores.");
}

Documentation issues

Describe the bug

I'm referring to the documentation at the following URL: https://epsilla-inc.gitbook.io/epsilladb/vector-database

  1. Missing documentation on querying existing databases
  2. Missing documentation on querying existing tables
  3. Missing documentation on querying fields of existing tables
  4. Missing documentation on filtering syntax
  5. Missing documentation on indexing (see additional context).

Additional context

There is some rudimentary documentation on indexing, but some key points are missing and some are unclear:
a) No information on how to create the table with an index on the embedding field with VECTOR_FLOAT dataType, which is not created by "model", but provided as part of the data during insert.

My use case: inserting billions of language sentences (STRINGs), with their embeddings, and query them with embedding vector later on to retrieve a sentence.

b) It is not clear if the "Embedding" name of the field in the table is a keyword or if the embedding vector column can have an arbitrary name.
c) It is also unclear if externally built embedding is inserted along with the data into the table, will it be indexed automatically (by its name "Embedding" or by its type "VECTOR_FLOAT").
d) Is it possible to index any other dataType then STRING? From the documentation:
When creating tables, you can define indices to let Epsilla automatically create embeddings for the STRING fields
And then later on:
Then you can insert records in their raw format and let Epsilla handle the embedding followed by an example with insert of the text data and their embeddings, though the "Embedding" column is not defined in the table (in the previous code snippet) and despite the fact that Epsilla is promised to create the embeddings automatically.

BIGINT primary key being interpreted as a float when returning query

The table used is the format with Python

client.create_table(
    table_name='MyTable',
    table_fields=[
        {'name': 'MessageID', 'dataType': 'BIGINT', 'primaryKey': True},
        {'name': 'PixelVec' , 'dataType': 'VECTOR_FLOAT', 'dimensions': 100*100*3}
    ]
)

After calling

status, response = client.query(
    table_name='MyTable',
    query_field='PixelVec',
    query_vector=myVec,
)

the response[0]['MessageID'] is some float or double
It is 1.157567399403463e+18 when it should be 1157567399403462676.
They are very close, just not quite there. After calling int(1.157567399403463e+18) I get 1157567399403462912. Only the last 3 digits are off, but it makes that makes it unusable.

I wanted to use the MessageID to retrieve the actual message, but this just isn't possible with such a response..

If you need/want any more information, please let me know.

Close the database from python client

Describe the feature

  • Close the database from the Python client.

Motivation and use case

  • I have a huge file with sentences (CCMatrix.en), and I want to put each sentence with its vector in a table. However, I do not have enough RAM to put 1.3B sentences into a single database. The OOM-Killer will wreak havoc on the Docker container and some other stuff. I can have several database instances where I can put these sentences and work with one database at a time. However, I can't be sure that the next database I open will have enough RAM if I didn't close the previous one.

Additional context

  • Sure, I can start/stop the containers for each database and manage these starts/stops from the Python code, but this does not feel right.

setting oat++ paths

Hi there,
I installed oat++ from git so I set the cmakelists like this
IMPORTED_LOCATION "/usr/local/lib/oatpp-1.3.0/liboatpp.a" INTERFACE_INCLUDE_DIRECTORIES "/usr/local/include/oatpp-1.3.0/oatpp/"

recompile with -fPIC

48%] Linking CXX shared library libvectordb_dylib.so /usr/bin/ld: /usr/local/lib/oatpp-1.3.0/liboatpp.a(Environment.cpp.o): relocation R_X86_64_TPOFF32\ against symbol `_ZN5oatpp4base11Environment25m_threadLocalObjectsCountE' can not be used when mak\ing a shared object; recompile with -fPIC /usr/bin/ld: failed to set dynamic section sizes: bad value collect2: error: ld returned 1 exit status make[2]: *** [CMakeFiles/vectordb_dylib.dir/build.make:533: libvectordb_dylib.so] Error 1 make[1]: *** [CMakeFiles/Makefile2:115: CMakeFiles/vectordb_dylib.dir/all] Error 2 make: *** [Makefile:91: all] Error 2

Improve insert/delete API response with detail statistics

Describe the feature

  • insert API: report how many records being inserted, how many skipped
{
    "statusCode": 200,
    "message": "Insert data to MyTable1291 successfully. successfully inserted 3 records. 2 records skipped ...",
    "result": {
        "inserted": 3,
        "skipped": 2
    }
}
  • delete API: how many records being deleted
{
    "statusCode": 200,
    "message": "successfully deleted 2 records.",
    "result": {
        "deleted": 2
    }
}

Motivation and use case

  • Better message for distributed dispatch aggregation, and telemetry

Additional context

  • to be filled

Upsert support

Describe the feature

  • Add a parameter in insert API to support upsert behavior
    What is upsert? For now, when inserting records, the records whose primary key already exist will be skipped.
    We want to support another behavior to update the existing records instead of skipping them

Motivation and use case

  • Easier document update

Additional context

  • to be filled

not localhost

The server says its is listening on localhost but it is listining on 0.0.0.0 not 127.0.0.1 that is confusing.
std::cout << "Server running on http://localhost:" << port << std::endl;

Batch search

Hi, are there plans for the API to support similarity search for multiple vectors in a single request? Afaict this is not currently possible, and it's not in the roadmap.

I'm currently benchmarking Epsilla and I would imagine batch queries would improve performance.

Delete by filter expression not working as expected

Describe the bug
Delete by filter is not always working as expected
Root cause need deep dive, reproduce step not standardized yet

Screenshots

  • If applicable, add screenshots to help explain your problem.

Additional context

  • Add any other context about the problem here.

DB import/export

Describe the feature
Step 1: docker support

  • Add a new API to export db as a tarball and download to local:
GET /api/<DBName>/export
  • Add a new API to import db from a tarball, post as payload:
POST \
     -H "Content-Type: multipart/form-data" \
     -F "file=@/path/to/your/tarball" \
     -F "json={\"name\": \"the DB name\", \"path\": \"path/on/docker/disk\"};type=application/json" \
     /api/import

Step 2. cloud support

  • TBD

Motivation and use case

  • To support easy db migration between docker images

Additional context

  • to be filled

Re. Also add in similarity search to epsilla cloud.

Describe the feature

  • to be filled
    Similarity search is a another tool for searching for similar results from within your vector database.
    Motivation and use case
  • to be filled
    How does this feature works so all you have to do is click on the bottom left corner of the image or video that has a icon of a magnifying 🔍 glass and should show tremendous results from within the vector database.
    Additional context
  • to be filled
    Any questions about this matter please let me know if you have any interest.

Support variable dimension embedding model

Describe the feature

  • Support variable dimension embedding model. If the embedding model vendor supports variable dimensions, we can let use specify a smaller or equal dimension of the model during index creation

Motivation and use case

  • OpenAI text-embedding-3 supports this new feature

Additional context

  • to be filled

@distance returning 0s when multiple threads search with different limit

Describe the bug
When 2 threads do search at the same time, one thread search with limit = 5, 2nd thread search with limit = 10. The @distance will return 0 for 6th - 10th element in 2nd search thread

Screenshots

  • If applicable, add screenshots to help explain your problem.

Additional context

  • Add any other context about the problem here.

Add filter support for @distance

Describe the feature

  • In search filter condition, add support for @distance
{
    "table": "MyTable",
    "query": "This is a document",
    "filter": "@distance < 0.35",
    "limit": 3
}

Motivation and use case

  • To filter out unpromising results from the retrieval

Additional context

  • to be filled

Statistics API

Describe the feature

  • Create a new API to report statistics, start with total number of records
GET /api/<DBName>/statistics
{
  statusCode: 200,
  result: [
    {
      tableName: <tableName>
      totalRecords: 234
    },
    {
      tableName: <tableName>
      totalRecords: 23456
    },
    ...
  ]
}

Motivation and use case

  • For telemetry

Additional context

  • to be filled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.