paradedb / paradedb Goto Github PK

View Code? Open in Web Editor NEW

3.8K 26.0 103.0 5.44 MB

Postgres for Search and Analytics

Home Page: https://paradedb.com

License: GNU Affero General Public License v3.0

Dockerfile 0.73% Shell 5.08% Rust 74.14% Python 0.27% PLpgSQL 6.32% Makefile 0.19% C 13.28%

postgres elasticsearch sparse-vector hybrid-search faceting splade database bm25 full-text-search analytics

paradedb's Introduction

paradedb's People

Contributors

Stargazers

Watchers

Forkers

zobryst richardsonjf nanderoo olamilekansi5 jacqlinegeng mrmartech josegron harshithmullapudi tarunamasa dataalchemisttech cteplovs alejandrosuarez qqq-tech gittrabin vickxu ukaserge breezeteam rheehot ehrktia spread0x kokizzu eznj stvhanna krishnasindhur aragalie whisthq datadevopscloud sankar-boro philippemnoel oblonka mmeent stevelauc juliamerz eduardojm agentmishra postgraphdb maparent msutkowski ndinhbang lilit0x avetis74 pzathief joe2hpimn kulcsarb dink10 cometkim digoal workingjubilee necaris yihong0618 juleskuehn xunfeng1980 joshinnis jacktan25 phaelon74 henly09 dblabx lml2468 roman-blinkov mateoyocado alexxdw lateaferoni archiveproject togetherlab bebekim pengrongbo 6293 jade2290 dotprodai pratheekrebala wnch bn9998 gmh5225 frikadellios mingfang mizaro suryatmodulus huangyingting vinicius-ianni mz0in sp1022 wbtlb kevinjqliu lordworms vineetp6 sinomiko makwana-ashish kevinhu aprilnea perrynzhou gaobaiming ashiskumarnaik mohinsandhi dreamsxin abdi-29 vadim0908 awesomedatatool gswan-io rebasedming cxz

paradedb's Issues

Make deploy README tags for cloud platforms

It is possible to create a README banner which serves as a button to seamlessly deploy our product to a specific cloud platform when self-hosting. Airbyte had this, notably. I think we should do this for

AWS
GCP
DigitalOcean (?)
Heroku (?)

Potentially others. We can start with AWS and GCP as the two main ones we've heard of people using from our customer conversations.

Use ML node instead of data node for deploying models

Is your feature request related to a problem? Please describe.
Currently we are using the data node to deploy models. We have gotten a "circuit breaker out of memory" error in development and we think that this could be because we aren't using the ML node.

Describe the solution you'd like
Enable the use of ML nodes

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Switch from `models/_upload` to `models/_register`

Is your feature request related to a problem? Please describe.
Got a warning that upload is deprecated and to use register instead

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Create our own Python/TS classes for the pgsync schema object + add our own documentation

Is your feature request related to a problem? Please describe.
For the sake of time, we tell users to pass in a pgsync schema object into add_source and point them to the pgsync documentation on how to do it. We should eventually create our own classes and add our own documentation so it's a lot more intuitive.

ci: Prod Promotion [7/31/23]

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

Documentation for new with_semantic and with_neural functions

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Enable developers to disable real-time sync

Is your feature request related to a problem? Please describe.
Users should be able to choose whether real-time sync is enabled.

Describe the solution you'd like
The pgsync CLI provides a no-daemon option we can use and expose to the user

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add integration w/ LLaMa Index and LangChain

These will help offer a more "generative search experience"

Add database setup instructions to documentation

Is your feature request related to a problem? Please describe.
We need to add instructions on how to set up the Postgres database to our documentation, i.e. editing the .conf file and enabling logical replication.

Introduce `info` method to the `Index` class

Is your feature request related to a problem? Please describe.
Currently it's hard to see the status and details of an index.

Describe the solution you'd like
The Index class should have an info method that returns an object with the following information:

Index mapping i.e. field names and types
Index size i.e. number of documents
Which columns are "neural columns"

Decouple `neural columns` from data upload

Is your feature request related to a problem? Please describe.
Currently the user must set which columns to perform neural search over as part of the add_source function. This can be confusing, and is also non-ideal because they aren't able to specify neural search columns when uploading data from memory via the upsert function.

Describe the solution you'd like
Expose a separate Index.register_neural_search_fields function and remove neural_columns from add_source.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Bring back mypy and flake8

Is your feature request related to a problem? Please describe.
When migrating repos, we disabled flake8 (linting) and mypy (static type checking). We should bring these back and add them to GH actions.

Add integration tests for the Typescript client

Describe the solution you'd like
Mirroring the pytest integration test, write integration tests for our Typescript client

Running list of tests to add

Is your feature request related to a problem? Please describe.
This is a running list of bugs we have encountered and fixed. The idea is that when we have the integration testing framework set up, we will refer to this issue and write tests for each bug fix to ensure that they don't resurface!

Throw error if we add more sources than the DB has replication slots
Don't allow Debezium to use the same replication slot for two tables

Fix `model_not_deployed` bug when uploading data

Describe the bug
Sometimes when upserting data into an index, it will fail on a model not deployed error

To Reproduce
I can't reproduce it very consistently but I've seen it happen when I close and restart a container and then try to upsert data. Re-running the script to create/load the model doesn't seem to do anything. I've noticed that if I wait a few minutes and upsert data again the error goes away, which makes me think we aren't waiting on something.

Check that all columns exist before creating an index

Is your feature request related to a problem? Please describe.
Currently if the user passes incorrect column names into the index function, no warning happens

Describe the solution you'd like
When creating an index, check to see that all column names exist and are valid

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Opensearch shouldn't be single node in Docker Compose

Describe the bug
Right now when users run docker compose up OpenSearch runs on single node. We should enable multi node, as users may be using the Docker Compose stack for prod-like use cases too and we should mirror this in docker compose up.

To Reproduce
Steps to reproduce the behavior:

Run docker compose up

Expected behavior
Multi-node OpenSearch

Add SSL verification to OpenSearch requests

Is your feature request related to a problem? Please describe.
Currently we don't do SSL verification so our OpenSearch API calls give

Unverified HTTPS request is being made to host 'core'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

Describe the solution you'd like
Enable SSL

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Implement Distributed Search

What
One of the key features of Elastic is the ability to shard to do concurrent processing for search queries. We need to do the same for ParadeDB, which we are planning to use the combination of Citus's schema and Tantivy schema features.

Why
Be able to scale horizontally to distribute load when searching

How
Unclear, current hypothesis is Citus + Tantivy via their schema functionality

Document Python client

Is your feature request related to a problem? Please describe.
We should introduce a new Clients section to the documentation that documents the Python client - all its classes and methods.

Add Typescript linting to GH workflows

Describe the solution you'd like
Run eslint (with Prettier configured) in GH workflows on our Typescript client

Add Navbar to the header

Hello, Can we add some Navs Features to the Header to make it look more prettier

Add multi-threading capabilities to the `index` function

Is your feature request related to a problem? Please describe.
Currently, the index function is single threaded - it reads n rows at a time and indexes them. This will take an unreasonably long amount of time as we get to tables with millions of rows.

Describe the solution you'd like
Introduce multi-threading - create lots of connections to the database and index rows in parallel

Describe alternatives you've considered
N/A

Additional context
N/A

Introduce a `Concepts` section to the README

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Introduce a Concepts section to the README that goes over the basics of OpenSearch (indexes, documents, fields, and queries)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add Intercom to docs

Is your feature request related to a problem? Please describe.
Create an Intercom account. In mint.json, add

"integrations": {
   "intercom": "appId"
}

Modify the Python SDK to allow users to customize the search index

Is your feature request related to a problem? Please describe.
Currently the Python SDK maps a table to an index 1:1. This is a problem when users want to search over data that requires two tables to be JOINed - there's no way to search across multiple indexes or put two tables into the same index.

Describe the solution you'd like
We can alter the Python SDK interface to exposing the index to the developer and allow them to attach multiple tables to the same index.

Here's a pseudocode example.

BEFORE:

from retakesearch import Client, Database, Table 

client = Client(api_key=os.getenv("RETAKE_API_KEY"), url=os.getenv("RETAKE_API_URL"))

database = Database(
    host=os.getenv("DATABASE_HOST"),
    port=os.getenv("DATABASE_PORT"),
    user=os.getenv("DATABASE_USER"),
    password=os.getenv("DATABASE_PASSWORD"),
)

table = Table(
    name=os.getenv("DATABASE_TABLE_NAME"),
    primary_key=os.getenv("DATABASE_TABLE_PRIMARY_KEY"),
    columns=json.loads(os.getenv("DATABASE_TABLE_COLUMNS")),
    neural_columns=json.loads(os.getenv("DATABASE_TABLE_COLUMNS")),
)

response = client.index(database, table)

AFTER:

from retakesearch import Client, Database, Table 

client = Client(api_key=os.getenv("RETAKE_API_KEY"), url=os.getenv("RETAKE_API_URL"))
index = client.create_index("my_index")

database = Database(...)
table1 = Table(...)
table2 = Table(...)

index.add_source(database, table1)
index.add_source(database, table2)

Describe alternatives you've considered
Eventually it would be cool to create indices from views, but I believe this is challenging to do in real time without something like Materialize.

Cannot integrate with Supabase

Describe the bug
Logical replication requires superuser priviliges, which Supabase does not grant to users.

To Reproduce
Steps to reproduce the behavior:

Try to connect to Supabase using add_source

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Create custom error for neural search over a non-vectorized field

Is your feature request related to a problem? Please describe.
Currently if the user performs a neural search over a non-vectorized field they would get an error message that looks like this

Exception: "Failed to search documents: RequestError(400, 'search_phase_execution_exception', \"failed to create query: Field 'field_retake_embedding' is not knn_vector type.\")"

Describe the solution you'd like
We should catch this error and return a more understandable error message

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Replace Kafka with pgsync

Is your feature request related to a problem? Please describe.
Kafka is pretty tough to work with and maintain. pgsync seems like a more lightweight alternative that will also save us tons of engineering effort. Will help us close issues like #101

Describe the solution you'd like

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Make documentation clearer by providing an example Table

Is your feature request related to a problem? Please describe.
The documentation uses a lot of random dummy variable names like "faqs" or "column_name." We should standardize all of this and make it clearer by showing the user the example table schema that will be used for all example code blocks.

Sync Postgres `DELETE` events

Is your feature request related to a problem? Please describe.
Right now if a row is deleted in postgres it does not get deleted in OpenSearch. This could result in dead search results that may frustrate users.

Describe the solution you'd like
Listen for _delete=True kafka events and delete the corresponding document in OpenSearch by ID.

Replace UUID w/ commithash + optional email

Is your feature request related to a problem? Please describe.
Currently, we use a UUID to get telemetry for new deploys. This is simple and works well, but doesn't tell us which version developers are running, which can make debugging with them hard. We should switch it to a UUID+commit hash, or ask developers for their emails, if they're willing to share (of course optional

Convert our PyPi deployment workflows to run following a GitHub Release creation

Is your feature request related to a problem? Please describe.
This appears to be the recommended way from GitHub for deploys, and is how we do it for DockerHub and NPM already

Describe the solution you'd like
Just switch the triggers and a thing or two to be after a GitHug tag gets created

Describe alternatives you've considered
N/A

Additional context
N/A

Make release workflows also deploy on `dev`, to test properly before merging

This will enable us to have a fully replicable dev environment to our main. It should be really quick, I'll handle it when the more pressing issues are covered.

Running list of sources of truth integrations

Over time, we'll want to integrate with various sources of truth. I think we can follow the example that MindsDB has done, they've done a phenomenal job there.

Postgres
...!

Create a Dockerfile with third-party Postgres Extensions

What
Currently, I only install the pg_bm25 extension. We should install all the other extensions we want there. As part of this, I would like to also structure the Dockerfile a bit better/cleaner, so it's easier to distinguish the build stages visually.

Can also check pgxman, trunk, and pgxn to find other extensions to build/list:

pgxn
pgxman
trunk
https://gist.github.com/joelonsql/e5aa27f8cc9bd22b8999b7de8aee9d47

A running list:

Why
Have support for all extensions we need to deliver our product!

How
For each extension, we need to add them to our .json file in the conf/ folder, and then add the version everywhere, and then trigger a build!

`pgsync` service blocks the API service

Describe the bug
I've observed that when pgsync is running, API calls fail until pgsync finishes

To Reproduce
Steps to reproduce the behavior:

Run add_source
Try to run any other command, like Index.describe_index - the API call never gets received/responds

Expected behavior
These two services should be async. Also, we should look into increasing the number of uvicorn workers for a similar reason

Introduce integration testing framework

Is your feature request related to a problem? Please describe.
We need integration tests so things don't break :)

Describe the solution you'd like
An integration testing framework using pytest, very similar to the one I built in our ETL repo.

What it will do:

Spin up a Dockerized Postgres fixture, and load it with some fake tables and data
Spin up our docker compose stack (i.e. opensearch, kafka, and api) as another fixture
Use these two fixtures to test all our core API functions

Types file is not being recognized

Describe the bug
Our npm library isn't getting typed correctly, something to do with our rollup/package.json configuration.

To Reproduce
Steps to reproduce the behavior:

Install retake-search and import the SDK into any Nodejs environment
VSCode will highlight the lack of types

Expected behavior
Our library should be typed

Model gets unloaded when docker container restarts

Describe the bug
The user will get a "model not loaded" or some similar error when the user stops and restarts the docker compose stack. They will need to re-run vectorize to fix this.

To Reproduce

Create an index
Populate it with data
Vectorize some of the fields
Stop docker
Restart docker
Get the same index
Perform a search on the index

Expected behavior
We should ensure that the model is always loaded

Additional context
Add any other context about the problem here.

Add more documentation about search features like highlighting

Describe the solution you'd like
There are a few things I'd like to add docs for:

Highlighting
Nested queries
Adding and retrieving custom dense_vector fields

Remove Sink-specific dependencies

What

Currently the library bundles in all sink dependencies. For instance, installing retake will also install the Python clients for Pinecone, Elastic, etc. We should remove these from the library and ask users to install dependencies themselves based on what sinks they're using.

Why

Makes the library more lightweight and prevents the number of dependencies from exploding as we support more sinks

How

Remove all sink-specific dependencies and instruct users on which dependencies to install in the documentation

Throw error if there aren't enough replication slots

Is your feature request related to a problem? Please describe.
There must be at least as many replication slots as number of tables connect to with Debezium. Right now, the connectors silently fail if there are too few replication slots.

Describe the solution you'd like
Detect and throw an error early if there aren't enough slots.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Dynamic field inference fails on non-string types

Describe the bug
When uploading non-string data using add_source, pgsync throws an error that the field type is not recognized but the error is not returned to the user.

To Reproduce
Steps to reproduce the behavior:

Upload a table with a number column

Expected behavior
User should be asked to specify the field type.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Allow users to specify index field type

Is your feature request related to a problem? Please describe.
Allow users to specify the index field type, for instance keyword vs. text. OpenSearch by default makes all strings "text" type and users might want some fields to be "keyword" for exact matching.

Describe the solution you'd like
Modify the columns attribute of the Table class to allow for users to pass in a list of dicts, where each dict specifies both the column name and the corresponding field type. i.e.

Table(
   columns = [
     {
        "name": "column1",
         "field_type": "keyword"
     {
   ]
)

Enable users to insert custom documents to an index

Is your feature request related to a problem? Please describe.
Currently the only way to add documents to an index is with the add_source function, which attaches an entire Postgres table to the index. I suspect users will have use cases where they need to add add-hoc data that's not in a Postgres table and we should support that.

Describe the solution you'd like
Introduce an upsert function that looks something like

index.upsert(
  ids = [1,2,3]
  data=[{"key1": "value1"}, {"key2": "value2", "key3": "value3"}]
)

MariaDB compatibility

Any chance you could make this compatible with MariaDB?

Provide `docker-compose` and `helm` chart for deployment

Is your feature request related to a problem? Please describe.
We should provide the user with a docker-compose file and helm chart for deployment.

Tables containing empty string columns fail to vectorize

Describe the bug
Tables containing empty string columns fail to vectorize.

To Reproduce
Steps to reproduce the behavior:

Call addSource and vectorize on a table with text columns, some of which contain empty values

Additional context
The reason for this is because pgsync creates fields as empty strings if the string is NULL, when it should omit them from the document. Then , because the field contains an empty string, it can't be embedded.

Investigate other embedding models

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like
Currently we use the pretrained huggingface/sentence-transformers/all-MiniLM-L12-v2 model to generate embeddings - this is hard-coded into the server.

In practice we could use any other HuggingFace model and could also enable users to choose their own model (https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/).

Is there a better model we could be using? Should we enable configurability here or is it better to make an opinionated choice on behalf of the user?

Describe alternatives you've considered

Additional context

Implement a way to notify the client when the Debezium snapshot is finished

Is your feature request related to a problem? Please describe.
Currently we don't know when the initial data upload i.e. Debezium snapshot is finished, so we have no way of notifying the client when they're ready to execute search queries.

Describe the solution you'd like
Read the size of the table and notify the client when the size of documents uploaded matches the size of the table.

paradedb / paradedb Goto Github PK

paradedb's Introduction

Website • Docs • Community • Blog • Changelog

Status

Roadmap

Get Started

Deploying ParadeDB

Extensions

Docker Image

Helm Chart

ParadeDB Cloud

Support

Contributing

License

paradedb's People

Contributors

Stargazers

Watchers

Forkers

paradedb's Issues

Recommend Projects

Recommend Topics

Recommend Org