bmsuisse / lakeapi Goto Github PK

View Code? Open in Web Editor NEW

7.0 2.0 2.0 15.18 MB

API for distributing Data Lake Data

License: MIT License

Python 99.98% Batchfile 0.02%

api data-lake duckdb fastapi polars rest-api sql deltalake lakehouse columnar

lakeapi's Issues

It can block

I was able to block the API. We might have to reverse and still not use copy into.

Option to always use duckdb storage backend

Use storage backend not only in search but always if option is enabled. Maybe also use duckdb's primary key option for indexing. Could help further with performance.

primary key

Should also not be a problem if duckdb storage is not stable as we would use it on the fly.

Correctly implement arrow-stream

Get rid of arrow-stream as a response type.

Drop support for Avro

We do not use it, nor test it. Only polars provides support for it, DuckDB does not

FileNotFoundError: [Errno 2] No such file or directory: 'config.yml'

config.yml must be present even if you want to change to another later.

Bug with latest Polars Version

TypeError: 'bytes' object is not callable

Sorting with Null is always first (for desc and asc ordering)

Enable security layer also on the metadata endpoints

Enable security layer on all endpoints

Architecture of LakeAPi; Image transparent

Doesn‘t look good in dark mode.

Considering in memory

For optimal performance, it would still sometimes make sense to keep small data in memory.

We can discuss that. If we implement it right, we should not have issues.

Release Version 1.0.0

What do we need to have for version 1.0.0?

Arrow format and get rid of IPC
Test DuckDB storage backend

Update Readme to use Datasource intead of Dataframe

Test different input file format

We should also test for CSV, Excel, Json inputs. Otherwise, we should not claim to be able to load different file formats in the docs :-).

Update to Pydantic V2 and FastAPI 0.100.0

Pydantic V2 is out:
https://github.com/pydantic/pydantic/releases/tag/v2.0

FastAPI should follow soon:
https://github.com/tiangolo/fastapi/releases/tag/0.99.0

Sort by direction desc doesn't work

Combine Metadata tag under one tag "Metadata" to clean up OpenAPI /doc

Initializing DuckDB can lead to Error 500

Probably this happens when writing to the duckdb data store for the first time.

rename dataframe to dataset in config

Bump duckdb to 0.8.1

A lot of bugfixes in this release
https://github.com/duckdb/duckdb/releases/tag/v0.8.1

Too many items in a combi parameter leads to recursion error

Related to Pypika Bug

It is an old issue and won't be solved. There is probably another approach needed.

Cache response

As we store the response as files, we could use it for a certain period and return it directly for the same request.

Important to respect response header:
https://github.com/long2ice/fastapi-cache/blob/8bfe814c3662343d2ecc39fcb6e31a2575ebfe9d/fastapi_cache/decorator.py#L198

In general, fastapi-cache is a good reference.

Add support for ODBC Source

Like ROAPI is doing it:

https://roapi.github.io/docs/config/databases.html

Input should be a valid string

When parameter is optional and not used as a filter.

2023-07-18T06:49:45.330570936Z Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
2023-07-18T06:49:45.330576836Z For further information visit https://errors.pydantic.dev/2.1.2/v/string_type

Switch to Robyn https://robyn.tech/

Long term goal as soon as Robyn catches up with FastAPI

Md5 hash with Integer is not working

3-06-01T13:34:33.402792616Z ^^^^^^^^^^^^^^^^^^^^^^^^
2023-06-01T13:34:33.402797816Z File "/tmp/8db629d97f7a394/antenv/lib/python3.11/site-packages/bmsdna/lakeapi/core/dataframe.py", line 158, in get_partition_filter
2023-06-01T13:34:33.402801816Z hashvl = hashlib.md5(value.encode("utf8")).hexdigest()
2023-06-01T13:34:33.402810116Z ^^^^^^^^^^^^
2023-06-01T13:34:33.402814016Z AttributeError: 'int' object has no attribute 'encode'

Value can be integer.

hashvl = hashlib.md5(str(value).encode("utf8")).hexdigest()

Should solve the problem.

Drop create delta table can result in internal error

DuckDB probably holds information in memory. Drop re-create a delta table can therefore cause an internal error (file not there anymore). In this case we should catch the error and do a manual refresh.

Add additional context for Databases (ODBC)

Directly read delta files has its limitation. I tried it for CIP and the response time is not good enough for a web application.

I think we should store the data in a traditional database with indexes. However, we could still use lakeapi to serve the data by extending the context class.

Version 0.10.0 after 0.9.0

We will go for 0.10.0 after 0.9.0 and not 1.0.0.

Better code structure

SQL Execution is split between dataframe and endpoint, sometimes diff is a bit weird
Do we really need groupby/joins/sortby? If yes: test and document

Drop Datafusion

Datafusion does not support Json Reading nor is there a way to register a pyarrow.Table instance. therefore, we cannot enable json tests for datafusion

Also the docs are ... not as good as in polars and duckdb

Documentation with Sphinx

Multi parameter fails with more than 1000 parameters

Current approach is not optimal as more than 1000 parameters results in recursion overflow.

Provide basic example on how to set it up on an Azure Web Service

Might be helpful.

A dedicated Repo should be used for that.

Remove Datafusion from LakeAPI Architecture Image

Perf Issue. Something seems to block after multiple calls of the same endpoint with the same parameter

Install dependencies takes ages

@aersam I fixed the blocking bug but somehow messed up the dependencies. They take now ages to install. Everything works though.

Please have a look if you have some time.

Prefix trick

Do we have the option to define a prefix partition without converting it to MD5. For example, if the Key is already an MD5 hash key, it doesn't really make sense to convert it again to another MD5 hash key.

Use of Azure Authentication

Do we need it to integrate the API with Streamlit or other Azure hosted Apps?

Correct implementation of combi params in OpenAPI doc

Get rid of polars extensions

At the moment used because of this bug in polars:

polars issue 7627

Change versioning path

Per default, API Management wants to add the versioning at the end. Therefore, we should change the pattern
from api/v1 to v1/api.

But this would mean a breaking change, and we have to be careful and also change all clients relying on the existing endpoints.

Pyright blocking release

@aersam the newest Pydantic version is causing problem with Pyright. Can you take a look?

Add MD5 Partition example to doc

As I think it is a killer feature to get sub second results on a large dataset

Enforce schema in Post request

At the moment you can input arbitrary parameter and we just return the unfiltered dataset if the parameter does not exist in the schema. We should return an error here.

Ditch support for Polars

The latest release has a "bug" and there is also a need for the Polars extension because the Polars delta reader also has a bug.

Does it make sense to keep support for Polars? Seems to add unnecessary complexity for no real benefit (speed is apart with DuckDB)

Of course, we now use Polars to serialise data into the various formats. We may have to look for solutions there.

Combi fields do not take data type into account

This works:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": "0",
"priceperunit": "0",
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

This does not work:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": 0,
"priceperunit": 0,
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

But the output and schema is correct.

Implicit add parameters for partition columns

Filters on partitions are always fast, therefore we should enable them by default. You could still hide those be giving them an empty operators array:

params:
      - name: partition_col
        operators: []

bmsuisse / lakeapi Goto Github PK

lakeapi's Issues

Recommend Projects

Recommend Topics

Recommend Org