bmsuisse / lakeapi Goto Github PK

API for distributing Data Lake Data

License: MIT License

Python 99.98% Batchfile 0.02%

api data-lake duckdb fastapi polars rest-api sql deltalake lakehouse columnar

lakeapi's Introduction

The BMS LakeAPI

A FastAPI plugin that allows you to expose your data lake as an API, allowing several output formats such as Parquet, Csv, Json, Excel, ...

The LakeAPI also includes a minimal security layer for convenience (Basic Auth), but you can also bring your own.

Unlike roapi, we intentionally do not expose most SQL by default, but limit the possible queries with a config. This makes it easy for you to control what happens to your data. If you want the SQL endpoint, you can enable it. And because we built LakeAPI on the shoulders of giants like FastAPI. A lot of things like documentation and authentication are built in and we don't have to reinvent the wheel.

To run the application with the default config, just do it:

app = FastAPI()
bmsdna.lakeapi.init_lakeapi(app)

To adjust the config, you can do like this:

import dataclasses
import bmsdna.lakeapi

def_cfg = bmsdna.lakeapi.get_default_config() # Get default startup config
cfg = dataclasses.replace(def_cfg, enable_sql_endpoint=True, data_path="tests/data") # Use dataclasses.replace to set the properties you want
sti = bmsdna.lakeapi.init_lakeapi(app, cfg, "config_test.yml") # Enable it. The first parameter is the FastAPI instance, the 2nd one is the basic config and the third one the config of the tables

Installation

Pypi Package bmsdna-lakeapi can be installed like any python package : pip install bmsdna-lakeapi

Basic Idea

Based on a YAML configuration and the data source, LakeAPI will automatically generate GET and/or POST endpoints. Calling the endpoint turns the query into an SQL statement that can be executed with the engine of your choice (duckdb or polars). The result is then seralised into the requested format (Json, CSV, Arrow etc).

This makes it super easy to distribute your data lake data to other systems. We use it internally to feed data to MS SQL Server, SQLite, Streamlit and Web Apps. You can host it wherever you want, we use Azure Websites which works fine even for very large data amounts.

OpenAPI

Of course everything works with OpenAPI and FastAPI. This means you can add other FastAPI routes, you can use the /docs and /redoc endpoints.

So everything will be fully documented automatically, which is really cool. 🔥🔥

Engine

DuckDB is the default query engine. Polars is also supported, but lack some features. The query engine can be specified at the route level and at the query level with the hidden parameter $engine="duckdb|polars". If you want polars, add the required extra.

At the moment, DuckDB seems to have an advantage and performs the best. Also features like full text search are only available with DuckDB.

A note on Deltalake Support

Deltalake is supported WITH Column Mapping, thanks to deltalake2db in Polars and DuckDB. Deletion Vectors are not supported yet.

Default Security

By Default, Basic Authentication is enabled. To add a user, simply run add_lakeapi_user YOURUSERNAME --yaml-file config.yml. This will add the user to your config yaml (argon2 encrypted). The generated Password is printed. If you do not want this logic, you can overwrite the username_retriver of the Default Config

Standalone Mode

If you just want to run this thing, you can run it with a webserver:

Uvicorn: uvicorn bmsdna.lakeapi.standalone:app --host 0.0.0.0 --port 8080

Gunicorn: gunicorn bmsdna.lakeapi.standalone:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:80

Of course you need to adjust your http options as needed. Also, you need to pip install uvicorn/gunicorn

You can still use environment variables for configuration

Environment Variables

CONFIG_PATH: The path of the config file, defaults to config.yml. If you want to split the config, you can specify a folder, too
DATA_PATH: The path of the data files, defaults to data. Paths in config.yml are relative to DATA_PATH
ENABLE_SQL_ENDPOINT: Set this to 1 to enable the SQL Endpoint

Config File

The application by default relies on a Config file beeing present at the root of your project that's call config.yml.

The config file looks something like this, see also our test yaml:

tables:
  - name: fruits
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
    datasource:
      uri: delta/fruits
      file_type: delta

  - name: fruits_partition
    tag: test
    version: 1
    api_method:
      - get
      - post
    params:
      - name: cars
        operators:
          - "="
          - in
      - name: fruits
        operators:
          - "="
          - in
      - name: pk
        combi:
          - fruits
          - cars
      - name: combi
        combi:
          - fruits
          - cars
    datasource:
      uri: delta/fruits_partition
      file_type: delta
      select:
        - name: A
        - name: fruits
        - name: B
        - name: cars

  - name: fake_delta
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  # use duckdb database as file
  - name: fruits_duck
    tag: test
    version: 1
    api_method:
      - get
      - post
    datasource:
      uri: duckdb/fruits.db
      file_type: duckdb
      table_name: fruits
    params:
      - name: fruits
        operators:
          - "="
          - in
      - name: cars
        operators:
          - "="
          - "in"

  - name: fake_delta_partition
    tag: test
    version: 1
    allow_get_all_pages: true
    api_method:
      - get
      - post
    params:
      - name: name
        operators:
          - "="
      - name: name1
        operators:
          - "="
    datasource:
      uri: delta/fake
      file_type: delta

  - name: "*" # We're lazy and want to expose all in that folder. Name MUST be * and nothing else
    tag: startest
    version: 1
    api_method:
      - post
    datasource:
      uri: startest/* # Uri MUST end with /*
      file_type: delta

  - name: fruits # But we want to overwrite this one
    tag: startest
    version: 1
    api_method:
      - get
    datasource:
      uri: startest/fruits/${IN_URI_YOU_CAN_HAVE_ENVIRONMENT_VARIABLES}
      file_type: delta

  - name: mssql_department
    tag: mssql
    api_method: get
    engine: odbc #requires the odbc extra
    datasource:
      uri: DRIVER={ODBC Driver 17 for SQL Server};SERVER=127.0.0.1,1439;ENCRYPT=yes;TrustServerCertificate=Yes;UID=sa;PWD=${MY_SQL_PWD};Database=AdventureWorks
      table_name: "HumanResources.Department"
    params:
      - GroupName

Partitioning for awesome performance

To use partitions, you can either

Partition by a column that you filter on. Obviously
partition on a special column called columnname_md5_prefix_2, which means that you're partitioning on the first two characters of your hex-coded md5 hash. of the hexadecimal md5 hash. If you now filter by columnname, this will greatly reduce the number of files you search for. The number of characters used is up to you, we found two to be useful
partition on a special column called columnname_md5_mod_NRPARTITIONS, where your partition value is str(int(hashlib.md5(COLUMNNAME).hexdigest(), 16) % NRPARTITIONS). This might look a bit complicated, but it's not that hard :) You're just doing a modulo on your md5 hash, which allows you to which allows you to set the exact number of partitions. Filtering is still done correctly on columnname.

Why partition by MD5 hash? Imagine you have a product id where most id's start with a 1 and some newer ones start with a 2. Most of the data will be in the first partition. If you use an MD5 hash, the data will be spread evenly across the partitions.

With this hack you can get sub-second results on very large data. 🚀🚀

You need to use deltalake to use partitions, and you only need str partition columns for now.

Z-ordering can also help a lot :).

Even more features

Built-in paging, you can use limit/offset to control what you get
Full-text search using DuckDB's full-text search feature
jsonify_complex parameter to convert structs/lists to json, the client cannot handle structs/lists
Metadata endpoints to retrieve data types, string lengths and more
Easily expose entire folders by using a "*" wildcard in both the name and the datasource.uri config, see example in the config above
Config in delta table as json, by using the lakeapi.config Table property. Example in the tests. This allows to have the config for the lakeapi along with your data
Good test coverage

ODBC Support / Sqlite Support

If you want to use ODBC , install the required extra (named odbc).

ODBC Tables cannot be queried through the sql endpoint for now because of Security Implications. We might allow this later if needed via explicit flag.

ODBC is tested against MS SQL Server using ODBC Driver 17 for MS SQL Server

To use Sqlite, you will need to install adbc_driver_sqlite yourself (pip install adbc_driver_sqlite).

Azure Support

Azure Support is tested and supported. S3/GCP could work in theory but is not tested. Supported configs are taken from object_store create and fsspec:

Supported settings are only those which are supported by BOTH projects. In practice you can currently use:

account_name
account_key
use_emulator (we add that one for fsspec)
anon (fsspec/adlfs for anonymous). Adlfs default this to true, we default to false. Set to true if you really want to.

If you want to use Azure Default Credential, just specify account_name without any further option.

Further projects

lakeapi2sql Allows you to read from lake api and write to MS SQL Server

Work in progress

Please note that this is a work in progress, changes will be made and things may break. Especially at this early stage. We recommend only using the export from bmsdna.lakeapi for now and not to use the submodules.

Limitations

You can currently only serve local files. We might add support for fsspec later on. You can still mount your data if you use Linux, which is what we do in production

Contribution

Feel free to contribute, report bugs or request enhancements.

lakeapi's People

Contributors

Stargazers

Watchers

Forkers

dominikpeter mvandermeulen

lakeapi's Issues

Pyright blocking release

@aersam the newest Pydantic version is causing problem with Pyright. Can you take a look?

Add support for ODBC Source

Like ROAPI is doing it:

https://roapi.github.io/docs/config/databases.html

Md5 hash with Integer is not working

3-06-01T13:34:33.402792616Z ^^^^^^^^^^^^^^^^^^^^^^^^
2023-06-01T13:34:33.402797816Z File "/tmp/8db629d97f7a394/antenv/lib/python3.11/site-packages/bmsdna/lakeapi/core/dataframe.py", line 158, in get_partition_filter
2023-06-01T13:34:33.402801816Z hashvl = hashlib.md5(value.encode("utf8")).hexdigest()
2023-06-01T13:34:33.402810116Z ^^^^^^^^^^^^
2023-06-01T13:34:33.402814016Z AttributeError: 'int' object has no attribute 'encode'

Value can be integer.

hashvl = hashlib.md5(str(value).encode("utf8")).hexdigest()

Should solve the problem.

Cache response

As we store the response as files, we could use it for a certain period and return it directly for the same request.

Important to respect response header:
https://github.com/long2ice/fastapi-cache/blob/8bfe814c3662343d2ecc39fcb6e31a2575ebfe9d/fastapi_cache/decorator.py#L198

In general, fastapi-cache is a good reference.

Better code structure

SQL Execution is split between dataframe and endpoint, sometimes diff is a bit weird
Do we really need groupby/joins/sortby? If yes: test and document

Sort by direction desc doesn't work

Combine Metadata tag under one tag "Metadata" to clean up OpenAPI /doc

Correct implementation of combi params in OpenAPI doc

Use of Azure Authentication

Do we need it to integrate the API with Streamlit or other Azure hosted Apps?

Correctly implement arrow-stream

Get rid of arrow-stream as a response type.

Add additional context for Databases (ODBC)

Directly read delta files has its limitation. I tried it for CIP and the response time is not good enough for a web application.

I think we should store the data in a traditional database with indexes. However, we could still use lakeapi to serve the data by extending the context class.

Switch to Robyn https://robyn.tech/

Long term goal as soon as Robyn catches up with FastAPI

Test different input file format

We should also test for CSV, Excel, Json inputs. Otherwise, we should not claim to be able to load different file formats in the docs :-).

Sorting with Null is always first (for desc and asc ordering)

Drop create delta table can result in internal error

DuckDB probably holds information in memory. Drop re-create a delta table can therefore cause an internal error (file not there anymore). In this case we should catch the error and do a manual refresh.

Get rid of polars extensions

At the moment used because of this bug in polars:

polars issue 7627

Considering in memory

For optimal performance, it would still sometimes make sense to keep small data in memory.

We can discuss that. If we implement it right, we should not have issues.

Initializing DuckDB can lead to Error 500

Probably this happens when writing to the duckdb data store for the first time.

Update to Pydantic V2 and FastAPI 0.100.0

Pydantic V2 is out:
https://github.com/pydantic/pydantic/releases/tag/v2.0

FastAPI should follow soon:
https://github.com/tiangolo/fastapi/releases/tag/0.99.0

Documentation with Sphinx

Too many items in a combi parameter leads to recursion error

Related to Pypika Bug

It is an old issue and won't be solved. There is probably another approach needed.

Bump duckdb to 0.8.1

A lot of bugfixes in this release
https://github.com/duckdb/duckdb/releases/tag/v0.8.1

Add MD5 Partition example to doc

As I think it is a killer feature to get sub second results on a large dataset

Remove Datafusion from LakeAPI Architecture Image

Perf Issue. Something seems to block after multiple calls of the same endpoint with the same parameter

Drop Datafusion

Datafusion does not support Json Reading nor is there a way to register a pyarrow.Table instance. therefore, we cannot enable json tests for datafusion

Also the docs are ... not as good as in polars and duckdb

Option to always use duckdb storage backend

Use storage backend not only in search but always if option is enabled. Maybe also use duckdb's primary key option for indexing. Could help further with performance.

primary key

Should also not be a problem if duckdb storage is not stable as we would use it on the fly.

Provide basic example on how to set it up on an Azure Web Service

Might be helpful.

A dedicated Repo should be used for that.

Architecture of LakeAPi; Image transparent

Doesn‘t look good in dark mode.

Enforce schema in Post request

At the moment you can input arbitrary parameter and we just return the unfiltered dataset if the parameter does not exist in the schema. We should return an error here.

Implicit add parameters for partition columns

Filters on partitions are always fast, therefore we should enable them by default. You could still hide those be giving them an empty operators array:

params:
      - name: partition_col
        operators: []

Bug with latest Polars Version

TypeError: 'bytes' object is not callable

Version 0.10.0 after 0.9.0

We will go for 0.10.0 after 0.9.0 and not 1.0.0.

Enable security layer also on the metadata endpoints

Enable security layer on all endpoints

rename dataframe to dataset in config

Add "sort by" to example

Combi fields do not take data type into account

This works:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": "0",
"priceperunit": "0",
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

This does not work:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": 0,
"priceperunit": 0,
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

But the output and schema is correct.

Release Version 1.0.0

What do we need to have for version 1.0.0?

Arrow format and get rid of IPC
Test DuckDB storage backend

Drop support for Avro

We do not use it, nor test it. Only polars provides support for it, DuckDB does not

Change versioning path

Per default, API Management wants to add the versioning at the end. Therefore, we should change the pattern
from api/v1 to v1/api.

But this would mean a breaking change, and we have to be careful and also change all clients relying on the existing endpoints.

Prefix trick

Do we have the option to define a prefix partition without converting it to MD5. For example, if the Key is already an MD5 hash key, it doesn't really make sense to convert it again to another MD5 hash key.

Ditch support for Polars

The latest release has a "bug" and there is also a need for the Polars extension because the Polars delta reader also has a bug.

Does it make sense to keep support for Polars? Seems to add unnecessary complexity for no real benefit (speed is apart with DuckDB)

Of course, we now use Polars to serialise data into the various formats. We may have to look for solutions there.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.