Git Product home page Git Product logo

paradedb's Introduction

ParadeDB

Postgres for Search and Analytics

WebsiteDocsCommunityBlogChangelog


Publish ParadeDB Docker Pulls pg_analytics Deployments pg_search Deployments Artifact Hub

ParadeDB is an Elasticsearch alternative built on Postgres. We're modernizing the features of Elasticsearch's product suite, starting with real-time search and analytics.

Status

ParadeDB is currently in Public Beta. Star and watch this repository to get notified of updates.

Roadmap

  • Search
  • Analytics
    • Accelerated analytical queries and column-oriented storage with pg_analytics
    • External object store integrations (S3/Azure/GCS/HDFS)
    • External Apache Iceberg and Delta Lake support
    • High-volume data/Kafka ingest
    • Non-Parquet file formats (Avro/ORC)
  • Self-Hosted ParadeDB
  • Cloud Database
    • Managed cloud
    • Cloud Marketplace Images
    • Web-based SQL Editor
  • Specialized Workloads
    • Support for geospatial data with PostGIS
    • Support for cron jobs with pg_cron

Get Started

To get started, please visit our documentation.

Deploying ParadeDB

ParadeDB and its extensions, pg_analytics and pg_search, are available as commercial software for installation on self-hosted Postgres deployment, and via Docker and Kubernetes as standalone images. For more information, including enterprise features and support, please contact us by email.

Extensions

You can find pre-packaged releases for all ParadeDB extensions for both Postgres 15 and Postgres 16 on Ubuntu 22.04 in the GitHub Releases. We officially support Postgres 12 and above, and you can compile the extensions for other versions of Postgres by following the instructions in the respective extension's README.

For official support on non-Debian-based systems, please contact us by email.

Docker Image

To quickly get a ParadeDB instance up and running, simply pull and run the latest Docker image:

docker run --name paradedb paradedb/paradedb

This will start a ParadeDB instance with default user postgres and password postgres. You can then connect to the database using psql:

docker exec -it paradedb psql -U postgres

To install ParadeDB locally or on-premise, we recommend using our docker-compose.yml file. Alternatively, you can pass the appropriate environment variables to the docker run command, replacing the <> with your desired values:

docker run \
  --name paradedb \
  -e POSTGRESQL_USERNAME=<user> \
  -e POSTGRESQL_PASSWORD=<password> \
  -e POSTGRESQL_DATABASE=<dbname> \
  -e POSTGRESQL_POSTGRES_PASSWORD=<superuser_password> \
  -v paradedb_data:/bitnami/postgresql \
  -p 5432:5432 \
  -d \
  paradedb/paradedb:latest

This will start a ParadeDB instance with non-root user <user> and password <password>. The superuser_password will be associated with superuser postgres and is necessary for ParadeDB extensions to install properly.

The -v flag enables your ParadeDB data to persist across restarts in a Docker volume named paradedb_data. The volume needs to be writable by a user with uid = 1001, which is a security requirement of the Bitnami PostgreSQL Docker image. You can do so with:

sudo useradd -u 1001 <user>
sudo chown <user> </path/to/paradedb_data>

You can then connect to the database using psql:

docker exec -it paradedb psql -U <user> -d <dbname> -p 5432 -W

ParadeDB collects anonymous telemetry to help us understand how many people are using the project. You can opt out of telemetry using configuration variables within Postgres:

ALTER SYSTEM SET paradedb.pg_search_telemetry TO 'off';
ALTER SYSTEM SET paradedb.pg_analytics_telemetry TO 'off';

Helm Chart

ParadeDB is also available for Kubernetes via our Helm chart. You can find our Helm chart in the ParadeDB Helm Chart GitHub repository or download it directly from Artifact Hub.

ParadeDB Cloud

At the moment, ParadeDB is not available as a managed cloud service. If you are interested in a ParadeDB Cloud service, please let us know by joining our waitlist.

Support

If you're missing a feature or have found a bug, please open a GitHub Issue.

To get community support, you can:

If you need commercial support, please contact the ParadeDB team.

Contributing

We welcome community contributions, big or small, and are here to guide you along the way. To get started contributing, check our first timer issues or message us in the ParadeDB Community Slack. Once you contribute, ping us in Slack and we'll send you some ParadeDB swag!

For more information on how to contribute, please see our Contributing Guide.

This project is released with a Contributor Code of Conduct. By participating in this project, you agree to follow its terms.

Thank you for helping us make ParadeDB better for everyone ❤️.

License

ParadeDB is licensed under the GNU Affero General Public License v3.0 and as commercial software, with the exception of pg_sparse which is licensed under the PostgreSQL License.

For commercial licensing, please contact us at [email protected].

If you are an open-source project and would like to use ParadeDB under a different license, please contact us at [email protected].

paradedb's People

Contributors

aprilnea avatar aragalie avatar cathrach avatar coderjoshdk avatar dependabot[bot] avatar djsavvy avatar dulacp avatar eduardojm avatar juleskuehn avatar lilit0x avatar maparent avatar mauaraujo avatar neilyio avatar philippemnoel avatar pratheekrebala avatar rebasedming avatar sardination avatar stevelauc avatar vladdoster avatar workingjubilee avatar yihong0618 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paradedb's Issues

Make deploy README tags for cloud platforms

It is possible to create a README banner which serves as a button to seamlessly deploy our product to a specific cloud platform when self-hosting. Airbyte had this, notably. I think we should do this for

  • AWS
  • GCP
  • DigitalOcean (?)
  • Heroku (?)

Potentially others. We can start with AWS and GCP as the two main ones we've heard of people using from our customer conversations.

Use ML node instead of data node for deploying models

Is your feature request related to a problem? Please describe.
Currently we are using the data node to deploy models. We have gotten a "circuit breaker out of memory" error in development and we think that this could be because we aren't using the ML node.

Describe the solution you'd like
Enable the use of ML nodes

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Switch from `models/_upload` to `models/_register`

Is your feature request related to a problem? Please describe.
Got a warning that upload is deprecated and to use register instead

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

ci: Prod Promotion [7/31/23]

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

  • Documentation for new with_semantic and with_neural functions

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Enable developers to disable real-time sync

Is your feature request related to a problem? Please describe.
Users should be able to choose whether real-time sync is enabled.

Describe the solution you'd like
The pgsync CLI provides a no-daemon option we can use and expose to the user

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add database setup instructions to documentation

Is your feature request related to a problem? Please describe.
We need to add instructions on how to set up the Postgres database to our documentation, i.e. editing the .conf file and enabling logical replication.

Introduce `info` method to the `Index` class

Is your feature request related to a problem? Please describe.
Currently it's hard to see the status and details of an index.

Describe the solution you'd like
The Index class should have an info method that returns an object with the following information:

  1. Index mapping i.e. field names and types
  2. Index size i.e. number of documents
  3. Which columns are "neural columns"

Decouple `neural columns` from data upload

Is your feature request related to a problem? Please describe.
Currently the user must set which columns to perform neural search over as part of the add_source function. This can be confusing, and is also non-ideal because they aren't able to specify neural search columns when uploading data from memory via the upsert function.

Describe the solution you'd like
Expose a separate Index.register_neural_search_fields function and remove neural_columns from add_source.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Bring back mypy and flake8

Is your feature request related to a problem? Please describe.
When migrating repos, we disabled flake8 (linting) and mypy (static type checking). We should bring these back and add them to GH actions.

Running list of tests to add

Is your feature request related to a problem? Please describe.
This is a running list of bugs we have encountered and fixed. The idea is that when we have the integration testing framework set up, we will refer to this issue and write tests for each bug fix to ensure that they don't resurface!

  • Throw error if we add more sources than the DB has replication slots
  • Don't allow Debezium to use the same replication slot for two tables

Fix `model_not_deployed` bug when uploading data

Describe the bug
Sometimes when upserting data into an index, it will fail on a model not deployed error

To Reproduce
I can't reproduce it very consistently but I've seen it happen when I close and restart a container and then try to upsert data. Re-running the script to create/load the model doesn't seem to do anything. I've noticed that if I wait a few minutes and upsert data again the error goes away, which makes me think we aren't waiting on something.

Check that all columns exist before creating an index

Is your feature request related to a problem? Please describe.
Currently if the user passes incorrect column names into the index function, no warning happens

Describe the solution you'd like
When creating an index, check to see that all column names exist and are valid

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Opensearch shouldn't be single node in Docker Compose

Describe the bug
Right now when users run docker compose up OpenSearch runs on single node. We should enable multi node, as users may be using the Docker Compose stack for prod-like use cases too and we should mirror this in docker compose up.

To Reproduce
Steps to reproduce the behavior:

  1. Run docker compose up

Expected behavior
Multi-node OpenSearch

Add SSL verification to OpenSearch requests

Is your feature request related to a problem? Please describe.
Currently we don't do SSL verification so our OpenSearch API calls give

Unverified HTTPS request is being made to host 'core'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

Describe the solution you'd like
Enable SSL

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Implement Distributed Search

What
One of the key features of Elastic is the ability to shard to do concurrent processing for search queries. We need to do the same for ParadeDB, which we are planning to use the combination of Citus's schema and Tantivy schema features.

Why
Be able to scale horizontally to distribute load when searching

How
Unclear, current hypothesis is Citus + Tantivy via their schema functionality

Document Python client

Is your feature request related to a problem? Please describe.
We should introduce a new Clients section to the documentation that documents the Python client - all its classes and methods.

Add multi-threading capabilities to the `index` function

Is your feature request related to a problem? Please describe.
Currently, the index function is single threaded - it reads n rows at a time and indexes them. This will take an unreasonably long amount of time as we get to tables with millions of rows.

Describe the solution you'd like
Introduce multi-threading - create lots of connections to the database and index rows in parallel

Describe alternatives you've considered
N/A

Additional context
N/A

Introduce a `Concepts` section to the README

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Introduce a Concepts section to the README that goes over the basics of OpenSearch (indexes, documents, fields, and queries)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add Intercom to docs

Is your feature request related to a problem? Please describe.
Create an Intercom account. In mint.json, add

"integrations": {
   "intercom": "appId"
}

Modify the Python SDK to allow users to customize the search index

Is your feature request related to a problem? Please describe.
Currently the Python SDK maps a table to an index 1:1. This is a problem when users want to search over data that requires two tables to be JOINed - there's no way to search across multiple indexes or put two tables into the same index.

Describe the solution you'd like
We can alter the Python SDK interface to exposing the index to the developer and allow them to attach multiple tables to the same index.

Here's a pseudocode example.

BEFORE:

from retakesearch import Client, Database, Table 

client = Client(api_key=os.getenv("RETAKE_API_KEY"), url=os.getenv("RETAKE_API_URL"))

database = Database(
    host=os.getenv("DATABASE_HOST"),
    port=os.getenv("DATABASE_PORT"),
    user=os.getenv("DATABASE_USER"),
    password=os.getenv("DATABASE_PASSWORD"),
)

table = Table(
    name=os.getenv("DATABASE_TABLE_NAME"),
    primary_key=os.getenv("DATABASE_TABLE_PRIMARY_KEY"),
    columns=json.loads(os.getenv("DATABASE_TABLE_COLUMNS")),
    neural_columns=json.loads(os.getenv("DATABASE_TABLE_COLUMNS")),
)

response = client.index(database, table)

AFTER:

from retakesearch import Client, Database, Table 

client = Client(api_key=os.getenv("RETAKE_API_KEY"), url=os.getenv("RETAKE_API_URL"))
index = client.create_index("my_index")

database = Database(...)
table1 = Table(...)
table2 = Table(...)

index.add_source(database, table1)
index.add_source(database, table2)

Describe alternatives you've considered
Eventually it would be cool to create indices from views, but I believe this is challenging to do in real time without something like Materialize.

Cannot integrate with Supabase

Describe the bug
Logical replication requires superuser priviliges, which Supabase does not grant to users.

To Reproduce
Steps to reproduce the behavior:

  1. Try to connect to Supabase using add_source

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Create custom error for neural search over a non-vectorized field

Is your feature request related to a problem? Please describe.
Currently if the user performs a neural search over a non-vectorized field they would get an error message that looks like this

Exception: "Failed to search documents: RequestError(400, 'search_phase_execution_exception', \"failed to create query: Field 'field_retake_embedding' is not knn_vector type.\")"

Describe the solution you'd like
We should catch this error and return a more understandable error message

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Replace Kafka with pgsync

Is your feature request related to a problem? Please describe.
Kafka is pretty tough to work with and maintain. pgsync seems like a more lightweight alternative that will also save us tons of engineering effort. Will help us close issues like #101

Describe the solution you'd like

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Make documentation clearer by providing an example Table

Is your feature request related to a problem? Please describe.
The documentation uses a lot of random dummy variable names like "faqs" or "column_name." We should standardize all of this and make it clearer by showing the user the example table schema that will be used for all example code blocks.

Sync Postgres `DELETE` events

Is your feature request related to a problem? Please describe.
Right now if a row is deleted in postgres it does not get deleted in OpenSearch. This could result in dead search results that may frustrate users.

Describe the solution you'd like
Listen for _delete=True kafka events and delete the corresponding document in OpenSearch by ID.

Replace UUID w/ commithash + optional email

Is your feature request related to a problem? Please describe.
Currently, we use a UUID to get telemetry for new deploys. This is simple and works well, but doesn't tell us which version developers are running, which can make debugging with them hard. We should switch it to a UUID+commit hash, or ask developers for their emails, if they're willing to share (of course optional

Convert our PyPi deployment workflows to run following a GitHub Release creation

Is your feature request related to a problem? Please describe.
This appears to be the recommended way from GitHub for deploys, and is how we do it for DockerHub and NPM already

Describe the solution you'd like
Just switch the triggers and a thing or two to be after a GitHug tag gets created

Describe alternatives you've considered
N/A

Additional context
N/A

Create a Dockerfile with third-party Postgres Extensions

What
Currently, I only install the pg_bm25 extension. We should install all the other extensions we want there. As part of this, I would like to also structure the Dockerfile a bit better/cleaner, so it's easier to distinguish the build stages visually.

Can also check pgxman, trunk, and pgxn to find other extensions to build/list:

A running list:

Why
Have support for all extensions we need to deliver our product!

How
For each extension, we need to add them to our .json file in the conf/ folder, and then add the version everywhere, and then trigger a build!

`pgsync` service blocks the API service

Describe the bug
I've observed that when pgsync is running, API calls fail until pgsync finishes

To Reproduce
Steps to reproduce the behavior:

  1. Run add_source
  2. Try to run any other command, like Index.describe_index - the API call never gets received/responds

Expected behavior
These two services should be async. Also, we should look into increasing the number of uvicorn workers for a similar reason

Introduce integration testing framework

Is your feature request related to a problem? Please describe.
We need integration tests so things don't break :)

Describe the solution you'd like
An integration testing framework using pytest, very similar to the one I built in our ETL repo.

What it will do:

  1. Spin up a Dockerized Postgres fixture, and load it with some fake tables and data
  2. Spin up our docker compose stack (i.e. opensearch, kafka, and api) as another fixture
  3. Use these two fixtures to test all our core API functions

Types file is not being recognized

Describe the bug
Our npm library isn't getting typed correctly, something to do with our rollup/package.json configuration.

To Reproduce
Steps to reproduce the behavior:

  1. Install retake-search and import the SDK into any Nodejs environment
  2. VSCode will highlight the lack of types

Expected behavior
Our library should be typed

Model gets unloaded when docker container restarts

Describe the bug
The user will get a "model not loaded" or some similar error when the user stops and restarts the docker compose stack. They will need to re-run vectorize to fix this.

To Reproduce

  1. Create an index
  2. Populate it with data
  3. Vectorize some of the fields
  4. Stop docker
  5. Restart docker
  6. Get the same index
  7. Perform a search on the index

Expected behavior
We should ensure that the model is always loaded

Additional context
Add any other context about the problem here.

Remove Sink-specific dependencies

What

Currently the library bundles in all sink dependencies. For instance, installing retake will also install the Python clients for Pinecone, Elastic, etc. We should remove these from the library and ask users to install dependencies themselves based on what sinks they're using.

Why

Makes the library more lightweight and prevents the number of dependencies from exploding as we support more sinks

How

Remove all sink-specific dependencies and instruct users on which dependencies to install in the documentation

Throw error if there aren't enough replication slots

Is your feature request related to a problem? Please describe.
There must be at least as many replication slots as number of tables connect to with Debezium. Right now, the connectors silently fail if there are too few replication slots.

Describe the solution you'd like
Detect and throw an error early if there aren't enough slots.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Dynamic field inference fails on non-string types

Describe the bug
When uploading non-string data using add_source, pgsync throws an error that the field type is not recognized but the error is not returned to the user.

To Reproduce
Steps to reproduce the behavior:

  1. Upload a table with a number column

Expected behavior
User should be asked to specify the field type.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Allow users to specify index field type

Is your feature request related to a problem? Please describe.
Allow users to specify the index field type, for instance keyword vs. text. OpenSearch by default makes all strings "text" type and users might want some fields to be "keyword" for exact matching.

Describe the solution you'd like
Modify the columns attribute of the Table class to allow for users to pass in a list of dicts, where each dict specifies both the column name and the corresponding field type. i.e.

Table(
   columns = [
     {
        "name": "column1",
         "field_type": "keyword"
     {
   ]
)

Enable users to insert custom documents to an index

Is your feature request related to a problem? Please describe.
Currently the only way to add documents to an index is with the add_source function, which attaches an entire Postgres table to the index. I suspect users will have use cases where they need to add add-hoc data that's not in a Postgres table and we should support that.

Describe the solution you'd like
Introduce an upsert function that looks something like

index.upsert(
  ids = [1,2,3]
  data=[{"key1": "value1"}, {"key2": "value2", "key3": "value3"}]
)

Tables containing empty string columns fail to vectorize

Describe the bug
Tables containing empty string columns fail to vectorize.

To Reproduce
Steps to reproduce the behavior:

  1. Call addSource and vectorize on a table with text columns, some of which contain empty values

Additional context
The reason for this is because pgsync creates fields as empty strings if the string is NULL, when it should omit them from the document. Then , because the field contains an empty string, it can't be embedded.

Investigate other embedding models

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Currently we use the pretrained huggingface/sentence-transformers/all-MiniLM-L12-v2 model to generate embeddings - this is hard-coded into the server.

In practice we could use any other HuggingFace model and could also enable users to choose their own model (https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/).

Is there a better model we could be using? Should we enable configurability here or is it better to make an opinionated choice on behalf of the user?

Describe alternatives you've considered

Additional context

Implement a way to notify the client when the Debezium snapshot is finished

Is your feature request related to a problem? Please describe.
Currently we don't know when the initial data upload i.e. Debezium snapshot is finished, so we have no way of notifying the client when they're ready to execute search queries.

Describe the solution you'd like
Read the size of the table and notify the client when the size of documents uploaded matches the size of the table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.