asavinov / prosto Goto Github PK

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

Home Page: https://linkedin.com/in/alexandrsavinov/

License: MIT License

Python 100.00%

business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

prosto's Introduction

Webpage

prosto's People

Contributors

Stargazers

Watchers

Forkers

shishironline wallrat mustafaayilmas webclinic017

prosto's Issues

Initial value for aggregations

Problem: currently it is not possible to set initial value for aggregations (a default value is always used). For example, if we use sum for aggregation then default initial value 0.0 is meaningful. But if we use product, then we would want to specify 1.0 as an initial value.

Implement a possibility to specify custom initial value to be used in aggregation operations like rolling aggregation or normal aggregation.

Notes:

Consider also its relation to fillna. We need some value if the group is empty.
It should work for both API and Column-SQL
Create or modify unit test for initial values
Modify notebooks and documentation to demonstrate the use of initial values

Validate the results of parsing in the context of each operation

The parser returns a syntactic structure. The task is to validate this structure.

The main alternative is

Validate without name resolution and bindings. It is relatively simple because we simply check the presence of elements against expected (template) structure for this operation
Validation with name resolution and bindings. Here we also check the query structure against the existing data schema and its names (existing columns, tables etc.) Such validation could be part of the translator or another more complex function.
Validation within topology translator where new operations could be added and columns or column paths could be augmented (say, using inheritance).

The minimum version could do very simple checks and executed just after the parser in order to detect simple errors. Later on, it could be integrated into the topology translator.

Create a getting started notebook with Column-SQL

Currently getting started notebook uses programmatic operations. Do the same using Column-SQL. Mention that it is work in progress and requires newest version to be checked out.

Grammar and full-featured parser for Column-SQL

Currently Column-SQL is parsed using a simple function. Although it words, it has some limitations, for example, with respect to literals, enforcing more flexible syntactic rules etc. The goal of this item is to define a formal grammar for Column-SQL and implement a full-featured parser. Initially, it should work as an alternative to the existing (simple) parser but later on it should replace the existing parser. Currently, it is assumed that antlr will be used.

Define antlr grammar for Column-SQL with the corresponding unit tests where appropriate. The following syntax elements should be covered:

Column names with spaces and other special characters using name delimiters like brackets: [My Column Name]
Short names (prefixes) for operations like CALC instead of CALCULATE or FUNC instead of FUNCTION
Maybe allow for alternative keywords like ARGS or MODEL
Case insensitive keywords
If possible arbitrary source code for functions and json for arguments. Simply do not parse whatever is after some keyword. One approach might be to introduce special tags to distinguish between function definition categories
If possible and relevant, try to determine the role names, for example, if it is supposed to be a table or column (particularly, depending on the operation)
Alternative: either parse column paths or treat them as complex names to be parsed by the topology translator if necessary (topology translator can resolve inherited columns and add new operations depending on the context)

Rework project operation by removing the need in a separate link definition and by uniting project with link

Problem: Currently to define a projection it is necessary to define a link operation, that is, in order to project a table we need to create two definition: one project operation and one link operation. We do it for several reasons:

[Reuse] The project definition specifies only a link column definition the role of which is storing some parameters necessary for the projection. Instead of having everything in the projection operation, we store something in the column operation.
[Explicit column declaration] Another reason is that the project operation is treated as a table operation which produces only a new table and not a column. In reality, it produces also a new column and this new column does not have a separate definition. We wanted to solve this problem by requiring to define explicitly a link column even though it will never be executed by itself - this column will be produce in the project operation. Note that the link column in the context of the project operation will be never executed and cannot be executed just because the target table does not exist when this operation is defined.

In this task we want to eliminate this split of responsibilities between project and link operations. The solution is to treat project as an operation which produces two elements: one table and one column (so it is not a purely table operation anymore).

The project operation should be organized differently:

We need to ensure that the missing link definition in the project operation does not break other operations which rely on the link definition. For example, all such operations need to get the necessary data about link columns from data schema (catalogue of columns, tables etc.) and not from the operation definitions (which define code, that is, how data is generated)
It should work without the need to define a link operation (but the link column will have to be created from the parameters of this project operation)
In its implementation, it should rely on the same code which generates a link column
The only difference to the link operation is that links assumes that the target table exists already so they skip one step in comparison to project
In future, we could unite these two operations: if the target table exists, then it is a link, and otherwise, it is project operation which will create this missing target table
Optionally, think about missing link column name in the project operation as meaning that the link is actually not needed. Probably it is not meaningful because the whole point is to have links - if they are not needed then it is a strange situation - a use case is needed. The main benefit is only not losing time on computing unnecessary columns.

The main result and benefit is that we will have clean separation between two operation definitions because now it is conceptually difficult and somewhat controversial to define projections. It will also simplify the implementation.

Generalize aggregation on primitive (non-link) grouping columns

Problem: the aggregation operation works only with link columns for grouping. It is a column operation which adds a new aggregate column to the group table. The group table must exist, and a link from a fact table (with data to be aggregated) must also exist. However, it cannot be applied for a tradition use case of groupby where we take a fact table by specifying one of its column as a grouping criterion. The problem is that the grouping table does not exist and hence we cannot define a new column for it.

In this task, we want to make the aggregate operation work in the case the table where a new aggregate column has to be added does not exist:

Since the table for a new aggregate column does not exist, it has to be created, and hence it becomes a table-column operation which produces three new elements: one table, one link column and one aggregate column
The grouping criterion can be an attribute of the fact table with the source data to be aggregated (not a link)
We distinguish two cases:
- Our currently implemented use case where the grouping criterion is an already existing link column
- To be implemented use case where the grouping criterion is a list of attributes (or columns?) without an existing target table
We actually need to combine two definition parameters:
- define a projection (source attributes, link name, group table name)
- define an aggregation (measure columns, link name, group table name, new aggregate column name)

It seems that it is difficult and not natural to combine two operations: projection and aggregation. Therefore, the groupby use case probably should be indeed implemented as two operations: project and aggregate. In this case, this task has not to be implemented.

Print graph of operations and data schema

Implement helper functions:

Print current data schema, that is, list of tables, list of columns and their connections. For example, names print_schema
Print graph of operations with dependencies. Here we could print original operations as defined by the user, or the whole graph of operations as produced by the topology translator. For example, named print_operations

asavinov / prosto Goto Github PK

prosto's Introduction

prosto's People

Contributors

Stargazers

Watchers

Forkers

prosto's Issues

Initial value for aggregations

Validate the results of parsing in the context of each operation

Create a getting started notebook with Column-SQL

Grammar and full-featured parser for Column-SQL

Rework project operation by removing the need in a separate link definition and by uniting project with link

Generalize aggregation on primitive (non-link) grouping columns

Print graph of operations and data schema

Write simple introduction to Column-SQL

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent