Git Product home page Git Product logo

asavinov / prosto Goto Github PK

View Code? Open in Web Editor NEW
90.0 90.0 4.0 2 MB

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

Home Page: https://linkedin.com/in/alexandrsavinov/

License: MIT License

Python 100.00%
business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

prosto's Introduction

prosto's People

Contributors

asavinov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

prosto's Issues

Initial value for aggregations

Problem: currently it is not possible to set initial value for aggregations (a default value is always used). For example, if we use sum for aggregation then default initial value 0.0 is meaningful. But if we use product, then we would want to specify 1.0 as an initial value.

Implement a possibility to specify custom initial value to be used in aggregation operations like rolling aggregation or normal aggregation.

Notes:

  • Consider also its relation to fillna. We need some value if the group is empty.
  • It should work for both API and Column-SQL
  • Create or modify unit test for initial values
  • Modify notebooks and documentation to demonstrate the use of initial values

Validate the results of parsing in the context of each operation

The parser returns a syntactic structure. The task is to validate this structure.

The main alternative is

  • Validate without name resolution and bindings. It is relatively simple because we simply check the presence of elements against expected (template) structure for this operation
  • Validation with name resolution and bindings. Here we also check the query structure against the existing data schema and its names (existing columns, tables etc.) Such validation could be part of the translator or another more complex function.
  • Validation within topology translator where new operations could be added and columns or column paths could be augmented (say, using inheritance).

The minimum version could do very simple checks and executed just after the parser in order to detect simple errors. Later on, it could be integrated into the topology translator.

Grammar and full-featured parser for Column-SQL

Currently Column-SQL is parsed using a simple function. Although it words, it has some limitations, for example, with respect to literals, enforcing more flexible syntactic rules etc. The goal of this item is to define a formal grammar for Column-SQL and implement a full-featured parser. Initially, it should work as an alternative to the existing (simple) parser but later on it should replace the existing parser. Currently, it is assumed that antlr will be used.

Define antlr grammar for Column-SQL with the corresponding unit tests where appropriate. The following syntax elements should be covered:

  • Column names with spaces and other special characters using name delimiters like brackets: [My Column Name]
  • Short names (prefixes) for operations like CALC instead of CALCULATE or FUNC instead of FUNCTION
  • Maybe allow for alternative keywords like ARGS or MODEL
  • Case insensitive keywords
  • If possible arbitrary source code for functions and json for arguments. Simply do not parse whatever is after some keyword. One approach might be to introduce special tags to distinguish between function definition categories
  • If possible and relevant, try to determine the role names, for example, if it is supposed to be a table or column (particularly, depending on the operation)
  • Alternative: either parse column paths or treat them as complex names to be parsed by the topology translator if necessary (topology translator can resolve inherited columns and add new operations depending on the context)

Rework project operation by removing the need in a separate link definition and by uniting project with link

Problem: Currently to define a projection it is necessary to define a link operation, that is, in order to project a table we need to create two definition: one project operation and one link operation. We do it for several reasons:

  • [Reuse] The project definition specifies only a link column definition the role of which is storing some parameters necessary for the projection. Instead of having everything in the projection operation, we store something in the column operation.
  • [Explicit column declaration] Another reason is that the project operation is treated as a table operation which produces only a new table and not a column. In reality, it produces also a new column and this new column does not have a separate definition. We wanted to solve this problem by requiring to define explicitly a link column even though it will never be executed by itself - this column will be produce in the project operation. Note that the link column in the context of the project operation will be never executed and cannot be executed just because the target table does not exist when this operation is defined.

In this task we want to eliminate this split of responsibilities between project and link operations. The solution is to treat project as an operation which produces two elements: one table and one column (so it is not a purely table operation anymore).

The project operation should be organized differently:

  • We need to ensure that the missing link definition in the project operation does not break other operations which rely on the link definition. For example, all such operations need to get the necessary data about link columns from data schema (catalogue of columns, tables etc.) and not from the operation definitions (which define code, that is, how data is generated)
  • It should work without the need to define a link operation (but the link column will have to be created from the parameters of this project operation)
  • In its implementation, it should rely on the same code which generates a link column
  • The only difference to the link operation is that links assumes that the target table exists already so they skip one step in comparison to project
  • In future, we could unite these two operations: if the target table exists, then it is a link, and otherwise, it is project operation which will create this missing target table
  • Optionally, think about missing link column name in the project operation as meaning that the link is actually not needed. Probably it is not meaningful because the whole point is to have links - if they are not needed then it is a strange situation - a use case is needed. The main benefit is only not losing time on computing unnecessary columns.

The main result and benefit is that we will have clean separation between two operation definitions because now it is conceptually difficult and somewhat controversial to define projections. It will also simplify the implementation.

Generalize aggregation on primitive (non-link) grouping columns

Problem: the aggregation operation works only with link columns for grouping. It is a column operation which adds a new aggregate column to the group table. The group table must exist, and a link from a fact table (with data to be aggregated) must also exist. However, it cannot be applied for a tradition use case of groupby where we take a fact table by specifying one of its column as a grouping criterion. The problem is that the grouping table does not exist and hence we cannot define a new column for it.

In this task, we want to make the aggregate operation work in the case the table where a new aggregate column has to be added does not exist:

  • Since the table for a new aggregate column does not exist, it has to be created, and hence it becomes a table-column operation which produces three new elements: one table, one link column and one aggregate column
  • The grouping criterion can be an attribute of the fact table with the source data to be aggregated (not a link)
  • We distinguish two cases:
    • Our currently implemented use case where the grouping criterion is an already existing link column
    • To be implemented use case where the grouping criterion is a list of attributes (or columns?) without an existing target table
  • We actually need to combine two definition parameters:
    • define a projection (source attributes, link name, group table name)
    • define an aggregation (measure columns, link name, group table name, new aggregate column name)

It seems that it is difficult and not natural to combine two operations: projection and aggregation. Therefore, the groupby use case probably should be indeed implemented as two operations: project and aggregate. In this case, this task has not to be implemented.

Print graph of operations and data schema

Implement helper functions:

  • Print current data schema, that is, list of tables, list of columns and their connections. For example, names print_schema
  • Print graph of operations with dependencies. Here we could print original operations as defined by the user, or the whole graph of operations as produced by the topology translator. For example, named print_operations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.