Git Product home page Git Product logo

dataform's Introduction

Dataform Core

Dataform Core is an open source meta-language to create SQL tables and workflows in BigQuery. Dataform Core extends SQL by providing a dependency management system, automated data quality testing, and data documentation.

Using Dataform Core, data teams can build scalable SQL data transformation pipelines following software engineering best practices, like version control and testing.

For more details, see how Dataform works.

Data collections and integrations feed into Dataform, which exports this data to BI and analytics tools.

Get started

In Google Cloud Platform

Dataform in Google Cloud Platform provides a fully managed experience to build scalable data transformations pipelines in BigQuery using SQL. It includes:

  • A cloud development environment to develop data assets with SQL and Dataform Core and version control code with GitHub, GitLab, and other Git providers.
  • A fully managed, serverless orchestration environment for data pipelines, fully integrated in Google Cloud Platform.

Follow the quickstart guide!

With the CLI

You can run Dataform locally using the Dataform CLI tool, which can be installed using the following command line. Follow the CLI guide to get started.

npm i -g @dataform/cli

Useful Links

Note: this readme can also be viewed on https://dataform-co.github.io/dataform.

Example Projects

Want to report a bug or request a feature?

  • For Dataform Core / open source requests, you can open an issue in GitHub.
  • For Dataform in Google Cloud Platform, you can file a bug here, and file feature requests here.

Want to contribute?

Check out our contributors guide to get started with setting up the repo.

dataform's People

Contributors

a2wd avatar aleksandergondek avatar andres-lowrie avatar benbirt avatar bmagyarkuti avatar canberkkoparal avatar claydiffrient avatar dependabot[bot] avatar diasdauletov avatar dolanp avatar dwl285 avatar ekrekr avatar faiyaz26 avatar g2h avatar gjmcgowan avatar josiehall avatar kennethkenneth avatar kolina avatar lewish avatar maverickjoy avatar mescanne avatar mpaggi avatar nmohseny-atz avatar pokutuna avatar probot-auto-merge[bot] avatar saadatqadri avatar shevtsovy avatar stankiewicz avatar vlad-ogol avatar will-misslin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataform's Issues

Add documentation for config() API

Currently there is no documentation for the config API calls such as:

${config({
  type: "table",
  schema: {
    sample: "Sample field.",
    foobar: "Foobar field"
  }
})}

Create a compilation e2e test

Create a new sample project, possibly derived from typescript/example-bigquery, that uses most of the primary features currently available, and check that it compiles.

The test should check that the project compiles without failure.

It should define a materialization, operation, and assertion through both SQL files and a JS file, and check for equality for each type between each method.

Gracefully fail compilation when encountering compilation and validation errors

Right now, errors in individual files, or validation errors at compile stage throw an error which stops the entire project from compiling. This is probably overkill!

Instead we should:

  • Add an compile_errors field to the compiled graph proto that stores the file-name, and error message for serious any errors throw during compilation in specific files (actual JS errors).
  • Wrap each file require in a try/catch and populate the error field above
  • Add a validation_errors field to the Materialization proto message, for validation errors.
  • Populate the validation errors field instead of throwing exceptions during compilation. This includes missing dependencies, missing where clause in incremental tables, so on.
  • Add tests for both compilation errors and validation errors

Show compilation error line numbers

JavaScript errors thrown during project compilation don't show the corresponding line numbers.
This is due to the way compilation runs in NodeVM.

Migrate bigquery adapter to start query jobs instead of just queries

Currently we run "interactive" style queries instead of starting query jobs.

The upside of this is it's simpler, the downside is we can't cancel a running query, so when a deployment is cancelled, any currently running queries will continue to execute.

We should change this, so that when Executor.cancel is called, it actually stops any query jobs that are in progress.

Implement and document protected() API

Incremental materializations can be protected, which means the tables can't be re-written from scratch when using --full-refresh.

Implement the protected() API on Materialization and MaterializationContext.

This should set the protobuf protected value to true.

Support ref()'ing an operation

Currently, using ref() will only look up tables produced by materializations.

In the near future it will be possible for ops.sql or calls to operate() to generate outputs. We should be able to reference an operation in the same way as we do a materialization.

  • Add a (Target target) field to the Operation proto, just like in the Materialization proto
  • Parse the name of the node like we do in materialization, and set the protobuffer field above
  • Add a method to Operation and Operation context to mark it as producing an output, such as 'hasOutput(boolean: hasOutput)`
  • Change the ref() function to also look at operations and resolve the operation target to a queryable string, only if the operation is marked as hasOutput

Incremental table select statements should only select described fields

Currently for incremental tables in all warehouse types, when we first create a table, we use ALL fields from the source query.

For example, an incremental table defined such as:

--js type("incremental");
--js where(`timestamp > (select max(timestamp) from ${self()})`);
--js descriptor(["timestamp", "action"]);

select timestamp, action, user_id
from weblogs.user_actions

Will generate a statement like the following to create the table if it doesn't exist:

create or replace table default_schema.example_incremental as
  select timestamp, user_action, user_id
  from weblogs.user_actions;

However, as only the fields "timestamp" and "action" are part of the descriptor, this is not the intended behaviour, as "user_id" is also populated in the table. We should instead generate the table when it doesn't exist with only the fields in the descriptor, instead generating a query like:

create or replace table default_schema.example_incremental as
select timestamp, user_action from (
  select timestamp, user_action, user_id
  from weblogs.user_actions);

This outer select statement can be automatically generated, using the keys of the proto.descriptor object (which must be provided for incremental tables).

This needs to be fixed for all warehouse types, and the code probably refactored to use the same function for generating these queries.

Extend the DbAdapter interface to have a method for validating statements

There should be a way to validate statements against the warehouse.

For example, in BigQuery this can be accomplished using the dryRun flag when running a query.

https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#dryRun

This can be accomplished in redshift using the EXPLAIN syntax. https://www.postgresql.org/docs/current/sql-explain.html

  • Add a new method to the DbAdapter such as validate(query: string)
  • Add a new CLI command to validate a query such as dataform query-validate
  • Add a new CLI command to validate all queries in a project such as dataform validate
  • Add an option called --validate that can be supplied to both the build and run commands and will cause the queries to be validated before they are run.

Moved from tada-science/dataform-co#69
/cc @lewish

Add a command to override the output schema of a materialization

Currently there is no way to change the output schema of a materialization, they will always be created in the default schema.

There is already an API called schema() that is used for something else, so we need to choose a different name. Possibly something like targetSchema() or just target({ table: table, schema: schema}).

  • Add an API to Materialization and MaterializationContext to change the target schema
  • Add documentation covering the new API
  • Add test cases to make sure the target is set in the underlying materialization proto
  • Add tests to make sure that ref() uses the correct target

Implement disabled() API for materializations

It should be possible to mark any materialization as disabled. This means they won't be run, and are removed from the execution graph during the build step.

  • Add a disabled boolean field to the Materialization protobuf message
  • Implement disabled() method on the Materialization and MaterializationContext classes that sets the proto field
  • Add disabled field to the MConfig interface so it can also be set through config() calls and pipe it through
  • Remove disabled materializations from the graph during the build step. See ts/api/commands/build.js
  • Add tests to make sure disabled materializations are removed from the execution graph. See ts/tests/api.js

Pretty print run status in real-time

The command dataform run prints nothing until the run is finished, which may take several hours.
We should provide information about the status of the run as it progresses.

  • Print that the run is starting, and the number of nodes and tasks to run
  • Print a line when any node completes, showing whether it is failed, successful, or skipped
  • Add option to write the ExecutedGraph json to a file (what it currently prints to STDOUT)

Move BigQuery partitionBy field into a bq specific message

Currently we store the partitionBy setting in the top level of the Materialization proto.

As this is bigquery specific, we should move it into a sub-field.

{
  bigquery: {
    partitionBy: "expression"
  }
}

We should also change the API for setting this value, similar to the proposal in #15

${bigquery({ partitionBy: "expression" })}

Add boilerplate for local UI

For a number of future features, we plan to serve a UI on localhost that can aid with a number of tasks such as:

  • Viewing the compiled output of a project (and recompiling automatically when it changes)
  • Running new deployments
  • Viewing the results of a run
  • Setting up a warehouse connection

To do all of this, we should add a new command, dataform serve that starts up a local web server, and serves a basic react app that we can add these features to.

Terminology updates, rename materialization(s) to table(s) and materialize() to publish()

See terminology design doc here for complete explanation:

https://www.notion.so/dataform/Terminology-DD-d2ee07baee9548a686befa0e80069d04

  • Rename the global materialize(...) method to publish(...), but keep existing methods in place for backwards compatibility.
  • Rename CompiledGraph.materializations to CompiledGraph.tables
  • Rename Materialization proto and related classes to Table
  • Rename the existing Table protobuf to TableMetadata

Check for any other dangling references to the phrase "materialize" and update as appropriate.

Support sortkey, distkey, and diststyle for Redshift materializations

It should be possible to specify distkey, sortkey, and diststyle for redshift tables.

Proposed APIs

materialize("example", {
  redshift: {
    distkey: "column1",
    diststyle: "even",
    sortkeys: ["column1", "column2"],
    sortstyle: "compound"
  }
});

This would result in a create table statement such as:

CREATE TABLE example
DISTSTYLE EVEN
DISTKEY (column1)
COMPOUND SORTKEY (column1, column2)   
AS ...
  • Add new message and fields to the Materialization proto
  • Implement a generic redshift() method on Materialization and MaterializationContext to specify these settings
  • Update the RedshiftAdapter to generate the create the correct create table statement
  • Add tests for the API
  • Add tests to make sure the correct SQL is generated

Implement Redshift adapter and runner

Implement and test RedshiftAdapter interface for connecting to redshift.

Implement the tables() and schema() methods on the RedshiftRunner class.

Write integration tests for redshift.

Add CLI command to generate data warehouse profiles

Currently data warehouse profiles must be created manually.
We should provide commands to generate warehouse profiles, that collect user, password, port etc and write the profile json to disk.

BigQuery - Requires project-id, and path to a service-account key file.
Redshift - IP/host, port, user, password, threads

Automatically escape back-ticks in SQL queries

Backticks are used in BigQuery for full table names, as the SQL contents get's parsed directly as a JS template string, this means the user has to write ` everywhere, which is quite annoying.

We can automatically escape non escaped backticks in the non-JS SQL file contents, now that the JS blocks are implemented. Any function calls that use backticks in the non JS block context calls will no longer work however (but using JS blocks instead is the recommended way).

Move validation into a standalone operation

Currently validation happens in a number of different places, during compilation, but also during API calls to the core framework.

As a result, it's hard/impossible for the other parts of the library to validate a graph (for example, if we want to run validation before we build or run a graph).

We should untangle the current validation code into a single methods that takes a CompiledGraph proto and adds returns a copy with validation errors added appropriately.

Implement declare() API

It should be possible to declare external tables that your project depends on, but are not generated by the dataform project. For example, if I automatically push my web logs to BigQuery to the table weblogs.logs then in my dataform project I can create a file:

definitions/external_tables.js:

declare("weblogs.logs");

And reference them in other queries, e.g. definitions/example.sql:

select * from ${ref("weblogs.logs")}

This would allow us to show external tables as part of the DAG, and would be useful for debugging, as well as simplifying queries (in BigQuery).

  • Add a new proto type Declaration
  • Add declarations to CompiledGraph proto
  • Make it possible to ref() a declaration
  • Implement a Declaration class similar to Materialization that just stores a Target which is derived from the provided name
  • Make sure declarations complete immediately as a no-op during execution of the graph

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.