matts966 / alphasql Goto Github PK

AlphaSQL provides Integrated Type and Schema Check and Parallelization for SQL file set mainly for BigQuery

License: Apache License 2.0

Starlark 5.02% Makefile 4.33% Dockerfile 1.33% C++ 89.31%

bigquery zetasql typechecker parallelization graphviz dot-language datawarehouse datawarehouseautomation docker workflow-automation

alphasql's People

Contributors

Stargazers

Watchers

Forkers

hotoku lukas-at-harren tryweirdier kuniyuta

alphasql's Issues

Implement function schema reader

Dependency resolution of functions

We can simply resolve dependencies of functions in the same way with tables safely.

alphacheck should ignore tables and functions

Implement model dependency resolution

Escaping identifieres makes them different

`dataset.table` and dataset.table are recognized as different identifieres.

Unifying style in queries prevent this problem for a while.

Docker images tagged latest are not latest images

External Required Tables spec by yaml file?

Implement Warning for queries without create disposition

Filter temporary tables from referenced tables list because they are local

alphadag ignore output_path if the specified path's dirname does not exists

Add warning when side effects and their targets are not in the same script

Discussed in #5

Directory whose name ends with .sql should be ignored

Add --with-tables option

To illustrate required tables and generated tables.

required tables are not in EXTERNAL_REQUIRED_TABLES if the tables are created with recursive DDL

example:

create or replace table t1 -- ← created
AS (
select *
from t1 -- ← but referenced
union all
select *
from t2
)

For now, we can use INSERT or UPDATE statements.

Implement parameter type reader

For a while, we can use UDF instead of parameterized queries.

For example

CREATE FUNCTION IF NOT EXISTS `constants.get_parameter_a`() AS (100);
SELECT * FROM `x.x` WHERE a = `constants.get_parameter_a`();

You can replace the constants.get_parameter_a before executing the query.

How to build/test AlphaSQL?

Hopefully this is the right place to pose such a question.
I was wondering on how I could build AlphaSQL myself as well as run its tests.

Currently my efforts got stuck at building the docker/dev.Dockerfile.
Unfortunately those ended up failing for me on a fresh try:

$ docker build -t alphasql-dev -f docker/dev.Dockerfile .
$ docker run -ti --entrypoint /usr/bin/make alphasql-dev build

(I tried both on v1.3.4 and master branch)

I would be happy to follow along and setup myself up to build AlphaSQL (at least on Docker), as well as update the documentation with what I learned to contribute.

Arbitrary dependency expression

Current AlphaSQL (v0.6.0) can't simply handle arbitrary dependency expression for the sake of its simplicity (only dependency between table references and create table statements). #2

For example, we can't simply insert values into some tables before selecting from them if there are not create table statements.

There are 4 ways to solve this problem.

We can use a table for dependency by creating the table in a SQL file we want to process early and selecting from it in a SQL file we want to process after. We can use functions after #4 is implemented.
We can separate dag and run them serially.
We can serialize all the queries by SQL script, but it's not efficient.
We can implement a new flag to resolve arbitrary dependencies, but it will be complicated.

We can already use 1 and it's simple, I think.
Example implementation is in ./samples/sample-arbitrary-dependency-graph-with-drop-statement.

Change binary file names

pipeline_type_checker to alphacheck

dag to alphadag

to be more clear.

The same name for table, UDF, TVF should not be allowed

The expected error is Already Exists: Routine test

some sql with `declare` does not generate any references (although it should)

Given a query 'child_table.sql`:

declare foo_var int64;

create or replace table dataset.child_table as
select * from dataset.parent_table

Running alphadag against it, produces unexpected results, not recognising dataset.parent_table as the precursor to dataset.child_table.

$ alphadag --with_tables child_table.sql
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "foo.sql"
digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
0->1 ;
}
EXTERNAL REQUIRED TABLES:

Expected results:

digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
2 [label="dataset.parent_table", shape=box, type=table];
0->1 ;
2->1 ;
}
EXTERNAL REQUIRED TABLES:
dataset.parent_table

I expected for dataset.parent_table to be the precursor to the table created in child_table.sql.

Creating multiple tables whose name is the same should cause error

Creating multiple tables whose name is the same not only can cause a cycle, but also cause undefined behavior. We can visualize graphs and report errors if there are cycles, but it is more desirable to report errors if multiple tables which has the same name.

Also, function names should be checked in the same way if #4 implemented.

create view statements are not recognized

Hi,

I wanted to open a discussion about "CREATE VIEW" statements, as I think they should show up in the dependency graph.

What do you think? Using alphadag with the option --with_tables. Maybe another option --with_views would be a non breaking backwards compatible change to add these statements to the graph.

Type checking after topological sorting is not enough

Type checking after topological sorting can't check all the possible problems and we can check all the possible orders by full search, but it's not efficient and almost all the problems can be checked by topological sort if the dependency resolution is correct.

Support for `_TABLE_SUFFIX`

Sometimes there are cycles in extracted graph

We can prevent cycles by making dependency only between table references and create table statements if a table is not created by multiple statements and serial execution can create required tables.
However, we are making dependency of INSERT on CREATE, UPDATE on INSERT and other DMLs on UPDATE for it is useful in preprocessing (before v0.5).
If it is needed, we should implement an option to make dependency only between table references and create table statements simply.

Add --warning-as-error option

To be more scrict.
Target warnings are as follows

UPDATE and INSERT targets are not in the same script #53
Dependency cycle