Git Product home page Git Product logo

glaredb's Introduction

About

Data exists everywhere: your laptop, Postgres, Snowflake and as files in S3. It exists in various formats such as Parquet, CSV and JSON. Regardless, there will always be multiple steps spanning several destinations to get the insights you need.

GlareDB is designed to query your data wherever it lives using SQL that you already know.

Install

Install/update glaredb in the current directory:

curl https://glaredb.com/install.sh | sh

It may be helpful to install the binary in a location on your PATH. For example, ~/.local/bin.

If you prefer manual installation, download, extract and run the GlareDB binary from a release in our releases page.

Getting Started

After Installing, get up and running with:

Local CLI

To start a local session, run the binary:

./glaredb

Or, you can execute SQL and immediately return (try it out!):

# Query a CSV on Hugging Face
./glaredb --query "SELECT * FROM \
'https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/raw/main/prompts.csv';"

To see all options use --help:

./glaredb --help

Hybrid Execution

  1. Sign up at https://console.glaredb.com for a free fully-managed deployment of GlareDB

  2. Copy the connection string from GlareDB Cloud, for example:

    ./glaredb --cloud-url="glaredb://user:pass@host:port/deployment"
    # or
    ./glaredb
    > \open "glaredb://user:pass@host:port/deployment

Read our announcement on Hybrid Execution for more information.

Using GlareDB in Python

  1. Install the official GlareDB Python library

    pip install glaredb
  2. Import and use glaredb.

    import glaredb
    con = glaredb.connect()
    con.sql("select 'hello world';").show()

To use Hybrid Execution, sign up at https://console.glaredb.com and use the connection string for your deployment. For example:

import glaredb
con = glaredb.connect("glaredb://user:pass@host:port/deployment")
con.sql("select 'hello hybrid exec';").show()

GlareDB work with Pandas and Polars DataFrames out of the box:

import glaredb
import polars as pl

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

con = glaredb.connect()

df = con.sql("select * from df where fruits = 'banana'").to_polars();

print(df)

Local Server

The server subcommand can be used to launch a server process for GlareDB:

./glaredb server

To see all options for running in server mode, use --help:

./glaredb server --help

When launched as a server process, GlareDB can be reached on port 6543 using a Postgres client. The following example uses psql to connect to a locally running server:

psql "host=localhost user=glaredb dbname=glaredb port=6543"

Configure the First Data Source

You can use a demo Postgres instance at pg.demo.glaredb.com. Adding this Postgres instance as data source is as easy as running the following command:

CREATE EXTERNAL DATABASE my_pg
    FROM postgres
    OPTIONS (
        host = 'pg.demo.glaredb.com',
        port = '5432',
        user = 'demo',
        password = 'demo',
        database = 'postgres',
    );

Once the data source has been added, it can be queried using fully qualified table names:

SELECT *
FROM my_pg.public.lineitem
WHERE l_shipdate <= date '1998-12-01' - INTERVAL '90'
LIMIT 5;

Check out the docs to learn about all supported data sources. Many data sources can be connected to the same GlareDB instance.

Done with this data source? Remove it with the following command:

DROP DATABASE my_pg;

Supported Data Sources

Source Read INSERT INTO COPY TO Table Function External Table External Database
Databases -- -- -- -- -- --
MySQL
PostgreSQL
MariaDB (via mysql)
MongoDB
Microsoft SQL Server 🚧 🚧
Snowflake 🚧 🚧
BigQuery 🚧 🚧
Cassandra/ScyllaDB 🚧 🚧
ClickHouse 🚧 🚧
Oracle 🚧 🚧 🚧 🚧 🚧 🚧
ADBC 🚧 🚧 🚧 🚧 🚧 🚧
ODBC 🚧 🚧 🚧 🚧 🚧 🚧
Database Files -- -- -- -- -- --
SQLite 🚧
Microsoft Excel 🚧 🚧
DuckDB 🚧 🚧 🚧 🚧 🚧 🚧
File Formats -- -- -- -- -- --
Apache Arrow 🚧
Apache Parquet 🚧
CSV 🚧
JSON 🚧
BSON 🚧
Apache Avro 🚧 🚧 🚧 🚧 🚧
Apache ORC 🚧 🚧 🚧 🚧 🚧
Table Formats -- -- -- -- -- --
Lance
Delta
Iceberg 🚧 🚧

✅ = Supported ➖ = Not Applicable 🚧 = Not Yet Supported

Building from Source

Building GlareDB requires Rust/Cargo to be installed. Check out rustup for an easy way to install Rust on your system.

Running the following command will build a release binary:

just build --release

The compiled release binary can be found in target/release/glaredb.

Documentation

Browse GlareDB documentation on our docs.glaredb.com.

Contributing

Contributions welcome! Check out CONTRIBUTING.md for how to get started.

License

See LICENSE. Unless otherwise noted, this license applies to all files in this repository.

Acknowledgements

GlareDB is proudly powered by Apache Datafusion and Apache Arrow. We are grateful for the work of the Apache Software Foundation and the community around these projects.

glaredb's People

Contributors

0xf333 avatar adhish20 avatar cmoog avatar dependabot[bot] avatar eitsupi avatar f0ssel avatar greyscaled avatar gruuya avatar hailelagi avatar hemanth94 avatar judahrand avatar justinrubek avatar lilit0x avatar rustomms avatar scsmithr avatar seve-martinez avatar talagluck avatar tychoish avatar universalmind303 avatar vrongmeal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glaredb's Issues

all: Remove unused/old crates

In particular, the coretypes crate has been replaced with the lemur crate.

lemur and sqlengine are the two core crates we need to make sure to keep. diststore has an accord module that might be useful to use/revamp in the future.

sqlengine: Return column names in query result

Currently only the resulting dataframe from executing a query is returned.

I don't know how we want to return the columns yet, but all the information required is in the the Scope struct.

lemur: Implement "And" and "Or" for scalar expressions

Will require and, and_scalar, or and or_scalar implementations in the column compute module.

E.g.

pub fn and(left: &Column, right: &Column) -> Result<Column> {
    let left = left
        .try_downcast_bool()
        .ok_or_else(|| internal!("left not a bool column"))?;
    let right = right
        .try_downcast_bool()
        .ok_or_else(|| internal!("right not a bool column"))?;
    let out = arrow2::compute::boolean::and(left.0, right.0);
    Ok(out.into())
}

sqlengine: Add rewrite rule for constant folding

For each scalar expression, we should to constant fold as much as possible. This pretty much entails walking scalar expressions and folding subtrees that don't contain a Column or Aggregate into a Constant.

A method should be added to PlanExpr for this which takes self and produces a new PlanExpr. And then during plan rewriting, we can just call that method for each expression we encounter.

sqlengine: Design initial system catalogs

The primary goal here is to get a database catalog up and running to allow for simple queries (create, insert, select, query). Adding index information should come later.

A previous version of the sqlengine crate had a very minimal proof-of-concept catalog.

sqlengine: Implement a subset of Postgres wire protocol

Implement a subset of the Postgres wire protocol (here) to allow for execution of simple queries.

There's currently no concept of users/authentication, so when implementing AuthenticationCleartextPassword, it's sufficient to use a hardcoded username and an empty password.

Inspirations:

See also msql-srv for how this could be structured in a way to allow stubbing out the database itself. This isn't necessary to do, but just might make testing easier.

sqlengine: Push filtering expressions past joins

The currently filter pushdown rewriter only pushes a filtering expression down if the child node is a ScanSource node. We should be doing similar with joins.

There's two things of note here; we should be able to push down predicates in actual Filter nodes, as well as predicates in the Join node as well.

For example, this query produces the following plan.

select * from t1 inner join t2 on t1.b = t2.b where t1.a > 5
plan: Read(
    Project(
        Project {
            columns: [
                Column(
                    0,
                ),
                Column(
                    1,
                ),
                Column(
                    2,
                ),
                Column(
                    3,
                ),
            ],
            input: Filter(
                Filter {
                    predicate: Binary {
                        op: Gt,
                        left: Column(
                            0,
                        ),
                        right: Constant(
                            Int8(
                                Some(
                                    5,
                                ),
                            ),
                        ),
                    },
                    input: Join(
                        Join {
                            left: ScanSource(
                                ScanSource {
                                    table: TableReference {
                                        catalog: "catalog",
                                        schema: "schema",
                                        table: "t1",
                                    },
                                    filter: None,
                                    schema: Schema {
                                        types: [
                                            Int8,
                                            Int8,
                                        ],
                                    },
                                },
                            ),
                            right: ScanSource(
                                ScanSource {
                                    table: TableReference {
                                        catalog: "catalog",
                                        schema: "schema",
                                        table: "t2",
                                    },
                                    filter: None,
                                    schema: Schema {
                                        types: [
                                            Int8,
                                            Int8,
                                        ],
                                    },
                                },
                            ),
                            join_type: Inner,
                            on: Binary {
                                op: Eq,
                                left: Column(
                                    1,
                                ),
                                right: Column(
                                    2,
                                ),
                            },
                        },
                    ),
                },
            ),
        },
    ),
)

Note that the filter is above the join, and no filters are provided to any of the scans. We (I think) have enough context in the plan to be able to push the above filters and predicates into their respective scans.

This will require extracting sub expressions for more complex expressions that reference both sides of the join. For example, the filtering expresion t1.a > 5 and t2.b > 10 would need to be split to pass the relevant sub expressions in the source table scans.

lemur(arrow2): Add execution context for query executors

The primary use case will be for capping memory usage during query execution. Right now, we can be very coarse with how we determine how much memory an executor is using.

This will also be used to allow for disk spillage once we run out memory. Disk spillage won't be too fancy, and we'll make use of arrow2's io utilities for writing out to files.

This will need to be done after the majority of the query executors are implemented so that we have a good understanding of what we need from the execution context. In the meantime, we should note in the code where we should be spilling.

This will require a minimal design doc prior to implementation.

lemur: Add scan_values_equal method to ReadTx trait

/// A readable source is able read dataframes and dataframe schemas.
#[async_trait]
pub trait ReadTx: Sync + Send {
/// Read from a source, returning a stream of dataframes.
///
/// An optional filter expression can be provided.
///
/// Returns `None` if the table doesn't exist.
async fn scan(
&self,
table: &RelationKey,
filter: Option<ScalarExpr>,
) -> Result<Option<DataFrameStream>>;
/// Get the schema for a given table.
///
/// Returns `None` if the table doesn't exist.
async fn get_schema(&self, table: &RelationKey) -> Result<Option<Schema>>;
}
/// A writeable source is able to write dataframes to underlying tables, as well
/// as create, alter, and delete tables.
#[async_trait]
pub trait WriteTx: ReadTx + Sync + Send {
async fn commit(self) -> Result<()>;
async fn rollback(self) -> Result<()>;
/// Create a table with the given schema. Errors if the table already
/// exists.
async fn create_table(&self, table: RelationKey, schema: Schema) -> Result<()>;
/// Drop a table. Errors if the table doesn't exist.
async fn drop_table(&self, table: &RelationKey) -> Result<()>;
/// Insert data into a table. Errors if the table doesn't exist.
async fn insert(&self, table: &RelationKey, data: DataFrame) -> Result<()>;
}

The approximate api should look something like the following:

async fn scan_values_equal(&self, table: &RelationKey, values: &[(usize, Value)]) -> Result<Option<DataFrameStream>>

values here represents a list of tuples with a column index and value expected to be found for that column.

This this can be added as a default implementation on the trait, since it can make use of the existing scan method.

storageengine: Implement transactions on top of RocksDB

The current RocksStore has some stuff related to transactions. It currently uses an atomic u64 for generating transaction ids. These ids are not currently referenced when inserting or retrieving data from RocksDB.

This issue encompasses:

  • Introducing key representations that include transaction timestamps for MVCC.
  • Introducing/modifying the RocksTx type.
    • Holds the transaction timestamp.
    • Implements the new lemur execution interfaces.
  • Introduces "write intents" to allow for interactive transactions.

A RocksTx should be able to be instantiated with an arbitrary timestamp. Timestamps will be generated from a hybrid logical clock. This can be used both for running locally as well as running in a cluster. When running locally, the "node id" portion of the timestamp can just be 0.

When converting the HLC to bytes, order must be maintained (such that byte representations of HLCs can be compared directly).

This will require a minimal design doc (particularly for key/value layouts and how those will fit in with secondary indexes).

sqlengine: Add rewrite rule to push down projections

Followup from #22

This will allow us to make use of the projection fields on table scan nodes added in the above issue.

This will inspect Project nodes, extract column indices, place those in the table scan nodes.

As example, let's say we had table t1 with the following columns:

Col Type Index
a int 0
b int 1
c int 2

A plan for select b, b + 100 from t1 would have a Projection node with the expressions Column(1), BinaryOp{op: Add, left: Column(1), right: Constant(100)}. Pushing this down is relatively straightforward, we go through each expression and pull out every column index that we find, deduplicate, then add those indices to the projection field on table scan node. In this case, our projection field would look like Some(vec![1]).

Once we have the indeces we want to project, it's important that we rewrite the original expression to reference the appropriate columns after projection. In this case, our expressions would be Column(0), BinaryOp{op: Add, left: Column(0), right: Constant(100)} since the table scan will only be returning a dataframe with a single column (which happens to be column b).

Note that joins and aggregates will add complexity. The above example is the one of the simpler cases.

lemur: Implement scalar expression execution for rows

In some cases (particularly on the storage side), we don't have a full "data frame" available, and instead we're working with sets of rows. In those cases, it's inefficient having to transform a row into data frame before being able execute an expression on it.

This may require some changes to the traits related to execution, but I'm unsure of how much.

This issue should be completed in conjunction with building out the storage engine, as that's the primary area where we'll want to execute on individual rows.

ci: Add miri to test unsafe code

Miri is quite slow, so we need to be selective about which crates to test.

Also, miri requires recompiling everything, so we'll have to figure out the caching situation there. I don't want it overwriting the build cache that we're using for testing.

sqlengine: Plan other join types

Planning currently just plans inner and cross joins.

left = match join.join_operator {
ast::JoinOperator::Inner(constraint) => {
let on = self.translate_expr(scope, on_expr(constraint)?)?;
ReadPlan::Join(Join {
left: Box::new(left),
right: Box::new(right),
join_type: JoinType::Inner,
on: on.lower_scalar()?,
})
}
ast::JoinOperator::CrossJoin => ReadPlan::CrossJoin(CrossJoin {
left: Box::new(left),
right: Box::new(right),
}),
other => return Err(anyhow!("unsupported join operator: {:?}", other)),
};

We should properly plan left, right, and full outer joins as well.

I'm currently unsure what it would take to represent these joins as a CrossJoin followed by a Filter.

ci: Setup dependabot

Ideally we're as up to date as possible for the major crates we're building around (datafusion).

lemur+sqlengine: Add optional "projection" argument for scans

/// Read from a source, returning a stream of dataframes.
///
/// An optional filter expression can be provided.
///
/// Returns `None` if the table doesn't exist.
async fn scan(
&self,
table: &RelationKey,
filter: Option<ScalarExpr>,
) -> Result<Option<DataFrameStream>>;

Queries often times only need a subset of columns, so we should be able to pass in Option<&[usize]> (or Option<Vec<usize>>) to select which columns we want to return. Note that this is using column indices and not a ScalarExpression since we want to try to return a minimal amount of data (since this will eventually be going over the network). An expression might produce more data than necessary, and that extra work is better off being done on the "compute" side of things.

E.g. if we were to pass in an expression for the query select a, a * 10 from t, we would have to return a data frame containing both the values for a and a * 10. Whereas passing in indices means the storage source is only returning a, and a * 10 can be computed later.

This'll required adding an additional field to the scan nodes in lemur's RelationExpr and sqlengine's ReadPlan. During initial planning, this field will be initialized to None.

sqlengine: Ensure system tables exist on startup

Should be enough to call ensure_system_tables for now.

Ideally I should be able to create a test table in the client repl:

~/Code/github.com/glaredb/glaredb [2] $ cargo run --bin glaredb -- client localhost:6543
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/glaredb client 'localhost:6543'`
connected to GlareDB
glaredb> create table test (a int)
error: unexpected response: Error("missing table system.gl_internal.columns")
glaredb>

Note: In the future, we might want to look into more synchronization to ensure only one node in the cluster is creating these tables, either with an explicit "bootstrap" mode, or by relying on transactional semantics.

lemur: Implement other join types

Currently there's only cross join. When lowering to lemur, all joins are converted to a cross join followed by a filter node. This is obviously not efficient.

The current cross join implementation keeps all dataframes in memory, and we'll want to do the same for these other joins as well for the time being.

Nested loop would probably be the most straightforward to implement (the existing cross join implementation is pretty much a nest loop join with no predicate). Merge joins should also be somewhat straightforward.

Hash joins will require a bit more design work. We'll also need to be able to detect when a join predicate is an equality predicate between two tables.

ci: Fail CI on warnings

Ideally this is a separate step from the "test" step so that it's easy to see at a glance if tests pass on PRs or not.

storageengine: Implement Raft networking

Implement the RaftNetwork and RaftNetworkFactory traits for openraft.

Openraft has a lot of types that need to be implemented, but for networking, these are relatively straightforward. They provide a BasicNode implementation that we can use for the Node type in the type config. NodeId can just be a type alias to u64.

Note that we want to use the main branch of openraft (pinned to specific commit in Cargo.toml). There's some api changes between what's in main and the docs on docs.rs. I would recommend pulling down openraft and running cargo doc locally.

dev-docs: Write up how to use nix for noobs

I like the idea of nix, and we already a few nix files in the repo. Unfortunately, nix can be hard to pick up which might cause a bit a chafing as new devs are onboarded.

I'm proposing that we have section in the README (or a separate CONTRIBUTING doc) detailing how to use nix (e.g. opening a dev shell, creating a docker image, etc). This should be aimed at someone who hasn't used nixed before, so we need to make sure things like needing to enable flakes and the new nix3 command are accounted for.

lemur: Implement `Filter` for arrow2 refactor

The current implementation is here:
https://github.com/GlareDB/glaredb/blob/main/crates/lemur/src/execute/stream/read.rs#L258-L275

We'll need to implement it for the arrow2 refactor:
https://github.com/GlareDB/glaredb/blob/arrow2/crates/lemur/src/arrow/queryexec/filter.rs

Everything will be working on Chunk, and the Chunk type has an appropriate filter method. The flow is as follows:

  • For each chunk, execute the scalar expression against that chunk.
  • Get the column result from the expression result, and down cast to a BoolColumn.
  • Call filter with the downcasted BoolColumn.
  • Return the result.

https://docs.rs/futures/latest/futures/prelude/stream/trait.StreamExt.html#method.map will likely be the method of choice to use on the input stream.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.