dataform-co / dataform Goto Github PK

View Code? Open in Web Editor NEW

792.0 19.0 146.0 16.04 MB

Dataform is a framework for managing SQL based data operations in BigQuery

Home Page: https://cloud.google.com/dataform/docs

License: Apache License 2.0

TypeScript 88.45% Shell 0.34% JavaScript 1.05% Starlark 10.16%

data-pipelines elt data-engineering business-intelligence analytics etl hacktoberfest

dataform's Introduction

Dataform Core

Dataform Core is an open source meta-language to create SQL tables and workflows in BigQuery. Dataform Core extends SQL by providing a dependency management system, automated data quality testing, and data documentation.

Using Dataform Core, data teams can build scalable SQL data transformation pipelines following software engineering best practices, like version control and testing.

For more details, see how Dataform works.

Get started

In Google Cloud Platform

Dataform in Google Cloud Platform provides a fully managed experience to build scalable data transformations pipelines in BigQuery using SQL. It includes:

A cloud development environment to develop data assets with SQL and Dataform Core and version control code with GitHub, GitLab, and other Git providers.
A fully managed, serverless orchestration environment for data pipelines, fully integrated in Google Cloud Platform.

Follow the quickstart guide!

With the CLI

You can run Dataform locally using the Dataform CLI tool, which can be installed using the following command line. Follow the CLI guide to get started.

npm i -g @dataform/cli

Note: this readme can also be viewed on https://dataform-co.github.io/dataform.

Example Projects

Want to report a bug or request a feature?

For Dataform Core / open source requests, you can open an issue in GitHub.
For Dataform in Google Cloud Platform, you can file a bug here, and file feature requests here.

Want to contribute?

Check out our contributors guide to get started with setting up the repo.

dataform's People

Contributors

Stargazers

Watchers

Forkers

hhtpcd lewishtest bechirzf kennethkenneth pplonski 1a-auto eordano alexeagle cuulee esya dkontorovskyy grandemayta maniacs-oss paynecodes andrecsq sibenye szhorizon seanpowell tylerwilliams rbalsick curioustauseef jwnewman12 gnee-io will-misslin cocsmile speedaemon saadatqadri siggisim mallik-g pgherveou quanliang2000 admariner scostello sivakums mbrukman hugoatuncar sew1ng upnrunnhq db-magnus dbrtly rumbin aleksandergondek kayrnt bpm1993 auemoto bkarski kloudio mescanne jonasholtkamp thinhha mpaggi aman-ebay guptam brianye22 dyng bitsnaps giancarlobergamaschi frans-snyders awesomedatatool xandercage434 dzintars wayneseguin mfilipiak-ccg y42danghh elyobo dreamdata-io ryota548 tiero jacobjohansen binartist wuchunfu mizeram prabhaarya stankiewicz james-mead eodenyire punkch friendkak chaoukiiiii mahlats davidzollo nmohseny-atz skydio lumiqai snowmoon-dev andres-lowrie zareefahmed doronrosenberg joshuabrandon daoudfares melleb justinmyerson yemiajayi stulew93 yfumero felixqian stevenkaspar nhbigithub gunnarlotz xyzlat

dataform's Issues

Add documentation for config() API

Currently there is no documentation for the config API calls such as:

${config({
  type: "table",
  schema: {
    sample: "Sample field.",
    foobar: "Foobar field"
  }
})}

Create a compilation e2e test

Create a new sample project, possibly derived from typescript/example-bigquery, that uses most of the primary features currently available, and check that it compiles.

The test should check that the project compiles without failure.

It should define a materialization, operation, and assertion through both SQL files and a JS file, and check for equality for each type between each method.

Gracefully fail compilation when encountering compilation and validation errors

Right now, errors in individual files, or validation errors at compile stage throw an error which stops the entire project from compiling. This is probably overkill!

Instead we should:

Add an compile_errors field to the compiled graph proto that stores the file-name, and error message for serious any errors throw during compilation in specific files (actual JS errors).
Wrap each file require in a try/catch and populate the error field above
Add a validation_errors field to the Materialization proto message, for validation errors.
Populate the validation errors field instead of throwing exceptions during compilation. This includes missing dependencies, missing where clause in incremental tables, so on.
Add tests for both compilation errors and validation errors

Validate incremental materializations

Check that if a materialization type is set to "incremental" then a where clause is also specified.

Moved from tada-science/dataform-co#61
/cc @lewish

Query results should automatically open when results return

Currently when a query finishes, the footer doesn't automatically expand.

We should also show a toaster saying that the query finished.

Implement and test an example package

See the design doc for more info: https://www.notion.so/dataform/Packages-DD-e699917a173d4ede85ec7567d7920181

Show compilation error line numbers

JavaScript errors thrown during project compilation don't show the corresponding line numbers.
This is due to the way compilation runs in NodeVM.

Make sure that dataform init uses the currently installed version when it creates a new project

Currently we are not doing a good job of keeping the version up to date in the init command.

Moved from tada-science/dataform-co#48
/cc @lewish

Migrate bigquery adapter to start query jobs instead of just queries

Currently we run "interactive" style queries instead of starting query jobs.

The upside of this is it's simpler, the downside is we can't cancel a running query, so when a deployment is cancelled, any currently running queries will continue to execute.

We should change this, so that when Executor.cancel is called, it actually stops any query jobs that are in progress.

Implement and document protected() API

Incremental materializations can be protected, which means the tables can't be re-written from scratch when using --full-refresh.

Implement the protected() API on Materialization and MaterializationContext.

This should set the protobuf protected value to true.

Don't actually run graphs with critical validation errors

One of my scripts depended on itself, it kept deploying forever

Moved from tada-science/dataform-co#88
/cc @G2H

Update terminology to match that in the design doc

https://www.notion.so/dataform/Terminology-DD-d2ee07baee9548a686befa0e80069d04

Support ref()'ing an operation

Currently, using ref() will only look up tables produced by materializations.

In the near future it will be possible for ops.sql or calls to operate() to generate outputs. We should be able to reference an operation in the same way as we do a materialization.

Add a (Target target) field to the Operation proto, just like in the Materialization proto
Parse the name of the node like we do in materialization, and set the protobuffer field above
Add a method to Operation and Operation context to mark it as producing an output, such as 'hasOutput(boolean: hasOutput)`
Change the ref() function to also look at operations and resolve the operation target to a queryable string, only if the operation is marked as hasOutput

Add documentation for descriptor() and describe() API

Incremental table select statements should only select described fields

Currently for incremental tables in all warehouse types, when we first create a table, we use ALL fields from the source query.

For example, an incremental table defined such as:

--js type("incremental");
--js where(`timestamp > (select max(timestamp) from ${self()})`);
--js descriptor(["timestamp", "action"]);

select timestamp, action, user_id
from weblogs.user_actions

Will generate a statement like the following to create the table if it doesn't exist:

create or replace table default_schema.example_incremental as
  select timestamp, user_action, user_id
  from weblogs.user_actions;

However, as only the fields "timestamp" and "action" are part of the descriptor, this is not the intended behaviour, as "user_id" is also populated in the table. We should instead generate the table when it doesn't exist with only the fields in the descriptor, instead generating a query like:

create or replace table default_schema.example_incremental as
select timestamp, user_action from (
  select timestamp, user_action, user_id
  from weblogs.user_actions);

This outer select statement can be automatically generated, using the keys of the proto.descriptor object (which must be provided for incremental tables).

This needs to be fixed for all warehouse types, and the code probably refactored to use the same function for generating these queries.

Materialization type is not validated

The compiler should check that predefined values have been entered

Example:
${type("table")} should work
${type("ta ble")} shouldn't work

Failed tests don't cause pipeline to fail

Failing tests should cause the pipeline to fail.

Extend the DbAdapter interface to have a method for validating statements

There should be a way to validate statements against the warehouse.

For example, in BigQuery this can be accomplished using the dryRun flag when running a query.

https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#dryRun

This can be accomplished in redshift using the EXPLAIN syntax. https://www.postgresql.org/docs/current/sql-explain.html

Add a new method to the DbAdapter such as validate(query: string)
Add a new CLI command to validate a query such as dataform query-validate
Add a new CLI command to validate all queries in a project such as dataform validate
Add an option called --validate that can be supplied to both the build and run commands and will cause the queries to be validated before they are run.

Moved from tada-science/dataform-co#69
/cc @lewish

Add a command to override the output schema of a materialization

Currently there is no way to change the output schema of a materialization, they will always be created in the default schema.

There is already an API called schema() that is used for something else, so we need to choose a different name. Possibly something like targetSchema() or just target({ table: table, schema: schema}).

Add an API to Materialization and MaterializationContext to change the target schema
Add documentation covering the new API
Add test cases to make sure the target is set in the underlying materialization proto
Add tests to make sure that ref() uses the correct target

Implement ref() function for query compilation

Currently calls to ref() fail in ad-hoc queries.
This requires the graph to be fully compiled before compiling queries in case any targets have been overridden for specific models.

Support setting table expiration date as part of the BigQuery table properties

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language

For example, setting a table expiration date

Moved from tada-science/dataform-co#111
/cc @G2H

Implement disabled() API for materializations

It should be possible to mark any materialization as disabled. This means they won't be run, and are removed from the execution graph during the build step.

Add a disabled boolean field to the Materialization protobuf message
Implement disabled() method on the Materialization and MaterializationContext classes that sets the proto field
Add disabled field to the MConfig interface so it can also be set through config() calls and pipe it through
Remove disabled materializations from the graph during the build step. See ts/api/commands/build.js
Add tests to make sure disabled materializations are removed from the execution graph. See ts/tests/api.js

Implement "on this page" sidebar in docs site

On the right of the documentation, we should add an additional side navigation bar lists all the headings on the page and adds links to them.

Pretty print run status in real-time

The command dataform run prints nothing until the run is finished, which may take several hours.
We should provide information about the status of the run as it progresses.

Print that the run is starting, and the number of nodes and tasks to run
Print a line when any node completes, showing whether it is failed, successful, or skipped
Add option to write the ExecutedGraph json to a file (what it currently prints to STDOUT)

Compilation times out on reasonably large projects

Compilation should be fast, work out what's causing it to be slow. Current timeout is set to 5 seconds.

Current best guess: SQL parsing.

Moved from tada-science/dataform-co#39
/cc @lewish

Move BigQuery partitionBy field into a bq specific message

Currently we store the partitionBy setting in the top level of the Materialization proto.

As this is bigquery specific, we should move it into a sub-field.

{
  bigquery: {
    partitionBy: "expression"
  }
}

We should also change the API for setting this value, similar to the proposal in #15

${bigquery({ partitionBy: "expression" })}

Add boilerplate for local UI

For a number of future features, we plan to serve a UI on localhost that can aid with a number of tasks such as:

Viewing the compiled output of a project (and recompiling automatically when it changes)
Running new deployments
Viewing the results of a run
Setting up a warehouse connection

To do all of this, we should add a new command, dataform serve that starts up a local web server, and serves a basic react app that we can add these features to.

Implement and document the new inline JS format

See design doc:
https://www.notion.so/Configuration-statements-DD-65f765cbbc3f4c9d9cde79f278d6e0a1

Implement block JS statements
Implement single line JS statements
Implement config statements

Moved from tada-science/dataform-co#70
/cc @lewish

Implement Postgres adapter and runner

See #4 for steps.

Documentation site

Bugs related to the documentation site https://docs.dataform.co

Terminology updates, rename materialization(s) to table(s) and materialize() to publish()

See terminology design doc here for complete explanation:

https://www.notion.so/dataform/Terminology-DD-d2ee07baee9548a686befa0e80069d04

Rename the global materialize(...) method to publish(...), but keep existing methods in place for backwards compatibility.
Rename CompiledGraph.materializations to CompiledGraph.tables
Rename Materialization proto and related classes to Table
Rename the existing Table protobuf to TableMetadata

Check for any other dangling references to the phrase "materialize" and update as appropriate.

Support sortkey, distkey, and diststyle for Redshift materializations

It should be possible to specify distkey, sortkey, and diststyle for redshift tables.

Proposed APIs

materialize("example", {
  redshift: {
    distkey: "column1",
    diststyle: "even",
    sortkeys: ["column1", "column2"],
    sortstyle: "compound"
  }
});

This would result in a create table statement such as:

CREATE TABLE example
DISTSTYLE EVEN
DISTKEY (column1)
COMPOUND SORTKEY (column1, column2)   
AS ...

Add new message and fields to the Materialization proto
Implement a generic redshift() method on Materialization and MaterializationContext to specify these settings
Update the RedshiftAdapter to generate the create the correct create table statement
Add tests for the API
Add tests to make sure the correct SQL is generated

Preserve information about node types in the execution graph

Moved from tada-science/dataform-co#126
/cc @G2H

Implement Redshift adapter and runner

Implement and test RedshiftAdapter interface for connecting to redshift.

Implement the tables() and schema() methods on the RedshiftRunner class.

Write integration tests for redshift.

Open-beta documentation

All bugs related to the new documentation site and content that needs to be written.

Automatically create datasets/schemas if they don't exist

This will require user to choose a dataset location as part of their warehouse profile for bigquery.

"Deploy this model" option on the materialization outline doesn't work

When clicking the button, nothing happens.

Should instead open the run button modal.

Support wildcards in dependent/dependency APIs

For example, dependencies("*") should mean this node runs before everything else.

Add CLI command to generate data warehouse profiles

Currently data warehouse profiles must be created manually.
We should provide commands to generate warehouse profiles, that collect user, password, port etc and write the profile json to disk.

BigQuery - Requires project-id, and path to a service-account key file.
Redshift - IP/host, port, user, password, threads

Links in the reference section of the docs site don't render

Our code is getting escaped, as a result markdown, nor html gets processed.

Automatically escape back-ticks in SQL queries

Backticks are used in BigQuery for full table names, as the SQL contents get's parsed directly as a JS template string, this means the user has to write ` everywhere, which is quite annoying.

We can automatically escape non escaped backticks in the non-JS SQL file contents, now that the JS blocks are implemented. Any function calls that use backticks in the non JS block context calls will no longer work however (but using JS blocks instead is the recommended way).

Get node type in run graph (dataset, test, ops...)

Moved from tada-science/dataform-co#126
/cc @G2H

Implement a way to specify reverse dependencies

Add a dependent() API for all node types.
Turn them to dependencies during compilation.

Implement Snowflake adapter and runner

See #4 for steps.

Move validation into a standalone operation

Currently validation happens in a number of different places, during compilation, but also during API calls to the core framework.

As a result, it's hard/impossible for the other parts of the library to validate a graph (for example, if we want to run validation before we build or run a graph).

We should untangle the current validation code into a single methods that takes a CompiledGraph proto and adds returns a copy with validation errors added appropriately.

Add integration tests for the example projects

Implement an integration test that executes a simple example project on both bigquery and redshift.

Add a --watch flag to the dataform compile command

When --watch is specified, any changes to the file system within the project directory should trigger a recompilation of the project, and any errors should be printed to the output.

CLI should print out help if run without any args

Running dataform without a command such as compile or run should print help content.

Mock all context functions during query compilation

Currently calls to methods such as type() fail during query compilation. These should be mocked out for testing compilation of queries.

Implement declare() API

It should be possible to declare external tables that your project depends on, but are not generated by the dataform project. For example, if I automatically push my web logs to BigQuery to the table weblogs.logs then in my dataform project I can create a file:

definitions/external_tables.js:

declare("weblogs.logs");

And reference them in other queries, e.g. definitions/example.sql:

select * from ${ref("weblogs.logs")}

This would allow us to show external tables as part of the DAG, and would be useful for debugging, as well as simplifying queries (in BigQuery).

Add a new proto type Declaration
Add declarations to CompiledGraph proto
Make it possible to ref() a declaration
Implement a Declaration class similar to Materialization that just stores a Target which is derived from the provided name
Make sure declarations complete immediately as a no-op during execution of the graph