Git Product home page Git Product logo

data-diff's Introduction

Datafold

data-diff: Compare datasets fast, within or across SQL databases

data-diff-logo


Use Cases

Data Migration & Replication Testing

Compare source to target and check for discrepancies when moving data between systems:

  • Migrating to a new data warehouse (e.g., Oracle > Snowflake)
  • Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
  • Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)

Data Development Testing

Test SQL code and preview changes by comparing development/staging environment data to production:

  1. Make a change to some SQL code
  2. Run the SQL code to create a new dataset
  3. Compare the dataset with its production version or another iteration

dbt

data-diff integrates with dbt Core to seamlessly compare local development to production datasets

data-development-testing

dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration

๐Ÿ‘€ Watch 4-min demo video

Get started with data-diff & dbt

Also available in a VS Code Extension

Reach out on the dbt Slack in #tools-datafold for advice and support

How it works

When comparing the data, data-diff utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

joindiff

  • Recommended for comparing data within the same database
  • Uses the outer join operation to diff the rows as efficiently as possible within the same database
  • Fully relies on the underlying database engine for computation
  • Requires both datasets to be queryable with a single SQL query
  • Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

hashdiff

  • Recommended for comparing datasets across different databases
  • Can also be helpful in diffing very large tables with few expected differences within the same database
  • Employs a divide-and-conquer algorithm based on hashing and binary search
  • Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
  • Time complexity approximates COUNT(*) operation when there are few differences
  • Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found here

Get started

Validating dbt model changes between dev and prod

โšก Looking to use data-diff in dbt development? Head over to our data-diff + dbt documentation to get started!

Compare data tables between databases

๐Ÿ”€ To compare data between databases, install data-diff with specific database adapters, e.g.:

pip install data-diff 'data-diff[postgresql,snowflake]' -U

Run data-diff with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:

data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>

Run data-diff with a toml configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:

## DATABASE CONNECTION ##
[database.duckdb_connection] 
  driver = "duckdb"
  # filepath = "datafold_demo.duckdb" # local duckdb file example
  # filepath = "md:" # default motherduck connection example
  filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
  database = "datafold_demo"

[database.snowflake_connection]
  driver = "snowflake"
  database = "DEV"
  user = "sung"
  password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
  # the info below is only required for snowflake
  account = "${ACCOUNT}" # by33919
  schema = "DEVELOPMENT"
  warehouse = "DEMO"
  role = "DEMO_ROLE"

## RUN PARAMETERS ##
[run.default]
  verbose = true

## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
  # Source 1 ("left")
  1.database = "duckdb_connection"
  1.table = "development.raw_orders"

  # Source 2 ("right")
  2.database = "snowflake_connection"
  2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive

  verbose = false
# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>

# run the configured data-diff job
data-diff --conf datadiff.toml \
  --run demo_xdb_diff \
  -k "id" \
  -c status

# output example
- 1, completed
+ 1, returned

Check out documentation for the full command reference.

Supported databases

Database Status Connection string
PostgreSQL >=10 ๐ŸŸข postgresql://<user>:<password>@<host>:5432/<database>
MySQL ๐ŸŸข mysql://<user>:<password>@<hostname>:5432/<database>
Snowflake ๐ŸŸข "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"
BigQuery ๐ŸŸข bigquery://<project>/<dataset>
Redshift ๐ŸŸข redshift://<username>:<password>@<hostname>:5439/<database>
DuckDB ๐ŸŸข duckdb://<dbname>@<filepath>
MotherDuck ๐ŸŸข duckdb://<dbname>@<filepath>
Oracle ๐ŸŸก oracle://<username>:<password>@<hostname>/servive_or_sid
Presto ๐ŸŸก presto://<username>:<password>@<hostname>:8080/<database>
Databricks ๐ŸŸก databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>
Trino ๐ŸŸก trino://<username>:<password>@<hostname>:8080/<database>
Clickhouse ๐ŸŸก clickhouse://<username>:<password>@<hostname>:9000/<database>
Vertica ๐ŸŸก vertica://<username>:<password>@<hostname>:5433/<database>
ElasticSearch ๐Ÿ“
Planetscale ๐Ÿ“
Pinot ๐Ÿ“
Druid ๐Ÿ“
Kafka ๐Ÿ“
SQLite ๐Ÿ“
  • ๐ŸŸข: Implemented and thoroughly tested.
  • ๐ŸŸก: Implemented, but not thoroughly tested yet.
  • โณ: Implementation in progress.
  • ๐Ÿ“: Implementation planned. Contributions welcome.

Your database not listed here?


Contributors

We thank everyone who contributed so far!


Analytics


License

This project is licensed under the terms of the MIT License.

data-diff's People

Contributors

attsun1031 avatar bjoernhaeuser avatar cfernhout avatar daniel-leicht avatar danieldiamond avatar danthelion avatar dave-connors-3 avatar dbeatty10 avatar dlawin avatar erezsh avatar franloza avatar glebmezh avatar ivan-toriya avatar jardayn avatar jaypeedevlin avatar kylemcnair avatar leoebfolsom avatar mattdelac avatar nicolasaldecoa avatar nolar avatar pik94 avatar roderickjdunn avatar sar009 avatar sebaap avatar sirupsen avatar stefankeidel avatar sungchun12 avatar vvkh avatar williebsweet avatar yecnj avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.