Git Product home page Git Product logo

matts966 / alphasql Goto Github PK

View Code? Open in Web Editor NEW
48.0 2.0 4.0 291.03 MB

AlphaSQL provides Integrated Type and Schema Check and Parallelization for SQL file set mainly for BigQuery

License: Apache License 2.0

Starlark 5.02% Makefile 4.33% Dockerfile 1.33% C++ 89.31%
bigquery zetasql typechecker parallelization graphviz dot-language datawarehouse datawarehouseautomation docker workflow-automation

alphasql's People

Contributors

matts966 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

alphasql's Issues

Implement parameter type reader

For a while, we can use UDF instead of parameterized queries.

For example

CREATE FUNCTION IF NOT EXISTS `constants.get_parameter_a`() AS (100);
SELECT * FROM `x.x` WHERE a = `constants.get_parameter_a`();

You can replace the constants.get_parameter_a before executing the query.

How to build/test AlphaSQL?

Hopefully this is the right place to pose such a question.
I was wondering on how I could build AlphaSQL myself as well as run its tests.

Currently my efforts got stuck at building the docker/dev.Dockerfile.
Unfortunately those ended up failing for me on a fresh try:

$ docker build -t alphasql-dev -f docker/dev.Dockerfile .
$ docker run -ti --entrypoint /usr/bin/make alphasql-dev build

(I tried both on v1.3.4 and master branch)

I would be happy to follow along and setup myself up to build AlphaSQL (at least on Docker), as well as update the documentation with what I learned to contribute.

Arbitrary dependency expression

Current AlphaSQL (v0.6.0) can't simply handle arbitrary dependency expression for the sake of its simplicity (only dependency between table references and create table statements). #2

For example, we can't simply insert values into some tables before selecting from them if there are not create table statements.

There are 4 ways to solve this problem.

  1. We can use a table for dependency by creating the table in a SQL file we want to process early and selecting from it in a SQL file we want to process after. We can use functions after #4 is implemented.
  2. We can separate dag and run them serially.
  3. We can serialize all the queries by SQL script, but it's not efficient.
  4. We can implement a new flag to resolve arbitrary dependencies, but it will be complicated.

We can already use 1 and it's simple, I think.
Example implementation is in ./samples/sample-arbitrary-dependency-graph-with-drop-statement.

some sql with `declare` does not generate any references (although it should)

Given a query 'child_table.sql`:

declare foo_var int64;

create or replace table dataset.child_table as
select * from dataset.parent_table

Running alphadag against it, produces unexpected results, not recognising dataset.parent_table as the precursor to dataset.child_table.

$ alphadag --with_tables child_table.sql
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "foo.sql"
digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
0->1 ;
}
EXTERNAL REQUIRED TABLES:

Expected results:

digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
2 [label="dataset.parent_table", shape=box, type=table];
0->1 ;
2->1 ;
}
EXTERNAL REQUIRED TABLES:
dataset.parent_table

I expected for dataset.parent_table to be the precursor to the table created in child_table.sql.

Creating multiple tables whose name is the same should cause error

Creating multiple tables whose name is the same not only can cause a cycle, but also cause undefined behavior. We can visualize graphs and report errors if there are cycles, but it is more desirable to report errors if multiple tables which has the same name.

Also, function names should be checked in the same way if #4 implemented.

create view statements are not recognized

Hi,

I wanted to open a discussion about "CREATE VIEW" statements, as I think they should show up in the dependency graph.

What do you think? Using alphadag with the option --with_tables. Maybe another option --with_views would be a non breaking backwards compatible change to add these statements to the graph.

Type checking after topological sorting is not enough

Type checking after topological sorting can't check all the possible problems and we can check all the possible orders by full search, but it's not efficient and almost all the problems can be checked by topological sort if the dependency resolution is correct.

Sometimes there are cycles in extracted graph

  • We can prevent cycles by making dependency only between table references and create table statements if a table is not created by multiple statements and serial execution can create required tables.
  • However, we are making dependency of INSERT on CREATE, UPDATE on INSERT and other DMLs on UPDATE for it is useful in preprocessing (before v0.5).
  • If it is needed, we should implement an option to make dependency only between table references and create table statements simply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.