Git Product home page Git Product logo

lakeapi's Issues

It can block

I was able to block the API. We might have to reverse and still not use copy into.

Option to always use duckdb storage backend

Use storage backend not only in search but always if option is enabled. Maybe also use duckdb's primary key option for indexing. Could help further with performance.

primary key

Should also not be a problem if duckdb storage is not stable as we would use it on the fly.

Drop support for Avro

We do not use it, nor test it. Only polars provides support for it, DuckDB does not

Considering in memory

For optimal performance, it would still sometimes make sense to keep small data in memory.

We can discuss that. If we implement it right, we should not have issues.

Release Version 1.0.0

What do we need to have for version 1.0.0?

  • Arrow format and get rid of IPC
  • Test DuckDB storage backend

Test different input file format

We should also test for CSV, Excel, Json inputs. Otherwise, we should not claim to be able to load different file formats in the docs :-).

Md5 hash with Integer is not working

3-06-01T13:34:33.402792616Z ^^^^^^^^^^^^^^^^^^^^^^^^
2023-06-01T13:34:33.402797816Z File "/tmp/8db629d97f7a394/antenv/lib/python3.11/site-packages/bmsdna/lakeapi/core/dataframe.py", line 158, in get_partition_filter
2023-06-01T13:34:33.402801816Z hashvl = hashlib.md5(value.encode("utf8")).hexdigest()
2023-06-01T13:34:33.402810116Z ^^^^^^^^^^^^
2023-06-01T13:34:33.402814016Z AttributeError: 'int' object has no attribute 'encode'

Value can be integer.

hashvl = hashlib.md5(str(value).encode("utf8")).hexdigest()

Should solve the problem.

Add additional context for Databases (ODBC)

Directly read delta files has its limitation. I tried it for CIP and the response time is not good enough for a web application.

I think we should store the data in a traditional database with indexes. However, we could still use lakeapi to serve the data by extending the context class.

Better code structure

  • SQL Execution is split between dataframe and endpoint, sometimes diff is a bit weird
  • Do we really need groupby/joins/sortby? If yes: test and document

Drop Datafusion

Datafusion does not support Json Reading nor is there a way to register a pyarrow.Table instance. therefore, we cannot enable json tests for datafusion

Also the docs are ... not as good as in polars and duckdb

Install dependencies takes ages

@aersam I fixed the blocking bug but somehow messed up the dependencies. They take now ages to install. Everything works though.

Please have a look if you have some time.

Prefix trick

Do we have the option to define a prefix partition without converting it to MD5. For example, if the Key is already an MD5 hash key, it doesn't really make sense to convert it again to another MD5 hash key.

Change versioning path

Per default, API Management wants to add the versioning at the end. Therefore, we should change the pattern
from api/v1 to v1/api.

But this would mean a breaking change, and we have to be careful and also change all clients relying on the existing endpoints.

Enforce schema in Post request

At the moment you can input arbitrary parameter and we just return the unfiltered dataset if the parameter does not exist in the schema. We should return an error here.

Ditch support for Polars

The latest release has a "bug" and there is also a need for the Polars extension because the Polars delta reader also has a bug.

Does it make sense to keep support for Polars? Seems to add unnecessary complexity for no real benefit (speed is apart with DuckDB)

Of course, we now use Polars to serialise data into the various formats. We may have to look for solutions there.

Combi fields do not take data type into account

This works:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": "0",
"priceperunit": "0",
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

This does not work:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": 0,
"priceperunit": 0,
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

But the output and schema is correct.

Implicit add parameters for partition columns

Filters on partitions are always fast, therefore we should enable them by default. You could still hide those be giving them an empty operators array:

params:
      - name: partition_col
        operators: []

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.