Git Product home page Git Product logo

dbt-audit-helper's Introduction

dbt-audit-helper

Useful macros when performing data audits

Contents

Installation instructions

New to dbt packages? Read more about them here.

  1. Include this package in your packages.yml file — check here for the latest version number.
  2. Run dbt deps to install the package.

Compare Data Outputs

compare_and_classify_query_results (source)

Generates a row-by-row comparison of two queries, as well as summary stats of added, removed, identical and modified records. This prevents you from having to query your comparison tables multiple times to get raw data and summary data.

Output

order_id order_date status dbt_audit_in_a dbt_audit_in_b dbt_audit_row_status dbt_audit_num_rows_in_status dbt_audit_sample_number
1 2024-01-01 completed True True identical 1 1
2 2024-01-02 completed True False modified 2 1
2 2024-01-02 returned False True modified 2 1
3 2024-01-03 completed True False modified 2 2
3 2024-01-03 completed False True modified 2 2
4 2024-01-04 completed False True added 1 1

Note that there are 4 rows with the modified status, but dbt_audit_num_rows_in_status says 2. This is because it is counting each primary key only once.

Arguments

  • a_query and b_query: The queries you want to compare.
  • primary_key_columns (required): A list of primary key column(s) used to join the queries together for comparison.
  • columns (required): The columns present in the two queries you want to compare.
  • sample_limit: Number of sample records to return per status. Defaults to 20.

Usage

{% set old_query %}
  select
    id as order_id,
    amount,
    customer_id
  from old_database.old_schema.fct_orders
{% endset %}

{% set new_query %}
  select
    order_id,
    amount,
    customer_id
  from {{ ref('fct_orders') }}
{% endset %}

{{ 
  audit_helper.compare_and_classify_query_results(
    old_query, 
    new_query, 
    primary_key_columns=['order_id'], 
    columns=['order_id', 'amount', 'customer_id']
  )
}}

compare_and_classify_relation_rows (source)

A wrapper to compare_which_query_columns_differ, except it takes two Relations (instead of two queries).

Each relation must have the same columns with the same names, but they do not have to be in the same order.

Arguments

  • a_relation and b_relation: The relations you want to compare.
  • primary_key_columns (required): A list of primary key column(s) used to join the queries together for comparison.
  • columns (optional): The columns present in the two queries you want to compare. Build long lists with a few exclusions with dbt_utils.get_filtered_columns_in_relation, or pass None and the macro will find all intersecting columns automatically.
  • sample_limit: Number of sample records to return per status. Defaults to 20.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_and_classify_relation_rows(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key_columns = ["order_id"],
    columns = None
) }}

quick_are_queries_identical (source)

On supported adapters (currently Snowflake and BigQuery), take a hash of all rows in two queries and compare them.

This can be calculated relatively quickly compared to other macros in this package and can efficiently provide reassurance that a refactor introduced no changes.

Output

are_tables_identical
true

Arguments

  • a_query and b_query: The queries you want to compare.
  • columns (required): The columns present in the two queries you want to compare.

Usage

{% set old_query %}
    select * from old_database.old_schema.dim_product
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_column_values(
    a_query = old_query,
    b_query = new_query,
    columns=['order_id', 'amount', 'customer_id']
  ) 
}}

quick_are_relations_identical (source)

A wrapper to quick_are_queries_identical, except it takes two Relations (instead of two queries).

Each relation must have the same columns with the same names, but they do not have to be in the same order. Build long lists with a few exclusions with dbt_utils.get_filtered_columns_in_relation, or pass None and the macro will find all intersecting columns automatically.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.quick_are_relations_identical(
    a_relation = old_relation,
    b_relation = dbt_relation,
    columns = None
) }}

compare_row_counts (source)

This macro does a simple comparison of the row counts in two relations.

Output

Calling this macro on two different relations will return a very simple table comparing the row counts in each relation.

relation_name total_records
target_database.target_schema.my_a_relation 34,231
target_database.target_schema.my_b_relation 24,789

Arguments

  • a_relation and b_relation: The Relations you want to compare.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_row_counts(
    a_relation = old_relation,
    b_relation = dbt_relation
) }}

Compare Columns

compare_which_query_columns_differ (source)

This macro generates SQL that can be used to detect which columns returned by two queries contain any value level changes.

It does not return the magnitude of the change, only whether or not a difference has occurred. Only records that exist in both queries (as determined by the primary key) are considered.

Output

The generated query returns whether or not each column has any differences:

column_name has_difference
order_id False
customer_id False
order_date True
status False
amount True

Arguments

  • a_query and b_query: The queries to compare
  • primary_key_columns (required): A list of primary key column(s) used to join the queries together for comparison.
  • columns (required): The columns present in the two queries you want to compare.

compare_which_relation_columns_differ (source)

A wrapper to compare_which_query_columns_differ, except it takes two Relations (instead of two queries).

Each relation must have the same columns with the same names, but they do not have to be in the same order. Build long lists with a few exclusions with dbt_utils.get_filtered_columns_in_relation, or pass None and the macro will find all intersecting columns automatically.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_which_columns_differ(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key_columns = ["order_id"],
    columns = None
) }}
{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{% set columns = dbt_utils.get_filtered_columns_in_relation(old_relation, exclude=["loaded_at"]) %}

{{ audit_helper.compare_which_columns_differ(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key_columns = ["order_id"],
    columns = columns
) }}

compare_column_values (source)

This macro generates SQL that can be used to compare a column's values across two queries. This macro is useful when you've used the compare_which_query_columns_differ macro to identify a column with differing values and want to understand how many discrepancies are caused by that column.

Output

The generated query returns a summary of the count of rows where the column's values:

  • match perfectly
  • differ
  • are null in a or b or both
  • are missing from a or b
match_status count percent_of_total
✅: perfect match 37,721 79.03
✅: both are null 5,789 12.13
🤷: missing from a 5 0.01
🤷: missing from b 20 0.04
🤷: value is null in a only 59 0.12
🤷: value is null in b only 73 0.15
❌: ‍values do not match 4,064 8.51

Arguments

  • a_query and b_query: The queries you want to compare.
  • primary_key: The primary key of the model. Used to sort unmatched results for row-by-row validation. Must be a unique key (unqiue and never null) in both tables, otherwise the join won't work as expected.
  • column_to_compare: The column you want to compare.
  • emojis (optional): Boolean argument that defaults to true and displays ✅, 🤷 and ❌ for easier visual scanning. If you don't want to include emojis in the output, set it to false.
  • a_relation_name and b_relation_name (optional): Names of the queries you want displayed in the output. Default is a and b.

Usage

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_column_values(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    column_to_compare = "status"
) }}

compare_all_columns (source)

Similar to compare_column_values, except it can be used to compare all columns' values across two relations. This macro is useful when you've used the compare_queries macro and found that a significant number of your records don't match and want to understand how many discrepancies are caused by each column.

Output

By default, the generated query returns a summary of the count of rows where the each column's values:

  • match perfectly
  • differ
  • are null in a or b or both
  • are missing from a or b
column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
order_id 10 0 0 0 0 0
order_date 2 0 0 0 0 8
order_status 6 4 4 0 0 0

Setting the summarize argument to false lets you check the match status of a specific column value of a specifc row:

primary_key column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
1 order_id true false false false false false
1 order_date false false false false false true
1 order_status false true true false false false
... ... ... ... ... ... ... ...

Arguments

  • a_relation and b_relation: The relations you want to compare. Any two relations that have the same columns can be used.
  • primary_key: The primary key of the model (or concatenated sql to create the primary key). Used to sort unmatched results for row-by-row validation. Must be a unique key (unique and never null) in both tables, otherwise the join won't work as expected.
  • exclude_columns (optional): Any columns you wish to exclude from the validation.
  • summarize (optional): Allows you to switch between a summary or detailed view of the compared data. Defaults to true.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_all_columns(
    a_relation = old_relation,
    b_relation = dbt_relation,
    primary_key = "order_id"
) }}

compare_relation_columns (source)

This macro generates SQL that can be used to compare the schema (ordinal position and data types of columns) of two relations. This is especially useful when:

  • Comparing a new version of a relation with an old one, to make sure that the structure is the same
  • Helping figure out why a union of two relations won't work (often because the data types are different)

Output

column_name a_ordinal_position b_ordinal_position a_data_type b_data_type has_ordinal_position_match has_data_type_match in_a_only in_b_only in_both
order_id 1 1 integer integer True True False False True
customer_id 2 2 integer integer True True False False True
order_date 3 3 timestamp date True False False False True
status 4 5 character varying character varying False True False False True
amount 5 4 bigint bigint False True False False True

Note: For adapters other than BigQuery, Postgres, Redshift, and Snowflake, the ordinal position is inferred based on the response from dbt Core's adapter.get_columns_in_relation(), as opposed to being loaded from the information schema.

Arguments

  • a_relation and b_relation: The relations you want to compare.

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_relation_columns(
    a_relation=old_relation,
    b_relation=dbt_relation
) }}

Advanced Usage

Print Output To Logs

You may want to print the output of the query generated by an audit helper macro to your logc (instead of previewing the results).

To do so, you can alternatively store the results of your query and print it to the logs.

For example, using the compare_column_values macro:

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{% set audit_query = audit_helper.compare_column_values(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    column_to_compare = "status"
) %}

{% set audit_results = run_query(audit_query) %}

{% if execute %}
{% do audit_results.print_table() %}
{% endif %}

The .print_table() function is not compatible with dbt Cloud, so an adjustment needs to be made in order to print the results. Add the following code to a new macro file:

{% macro print_audit_output() %}
{%- set columns_to_compare=adapter.get_columns_in_relation(ref('fct_orders'))  -%}

{% set old_etl_relation_query %}
    select * from public.dim_product
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('fct_orders') }}
{% endset %}

{% if execute %}
    {% for column in columns_to_compare %}
        {{ log('Comparing column "' ~ column.name ~'"', info=True) }}
        {% set audit_query = audit_helper.compare_column_values(
                a_query=old_etl_relation_query,
                b_query=new_etl_relation_query,
                primary_key="order_id",
                column_to_compare=column.name
        ) %}

        {% set audit_results = run_query(audit_query) %}

        {% do log(audit_results.column_names, info=True) %}
            {% for row in audit_results.rows %}
                  {% do log(row.values(), info=True) %}
            {% endfor %}
    {% endfor %}
{% endif %}

{% endmacro %}

To run the macro, execute dbt run-operation print_audit_output() in the command bar.

Use Output For Custom Singular Test

If desired, you can use the audit helper macros to add a dbt test to your project to protect against unwanted changes to your data outputs.

For example, using the compare_all_columns macro, you could set up a test that will fail if any column values do not match.

Users can configure what exactly constitutes a value match or failure. If there is a test failure, results can be inspected in the warehouse. The primary key and the column name can be included in the test output that gets written to the warehouse. This enables the user to join test results to relevant tables in your dev or prod schema to investigate the error.

Note: this test should only be used on (and will only work on) models that have a primary key that is reliably unique and not_null. Generic dbt tests should be used to ensure the model being tested meets the requirements of unique and not_null.

To create a test for the stg_customers model, create a custom test in the tests subdirectory of your dbt project that looks like this:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'), -- in a test, this ref will compile as your dev or PR schema.
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), -- you can explicitly write a relation to select your production schema, or any other db/schema/table you'd like to use for comparison testing.
    exclude_columns=['updated_at'], 
    primary_key='id'
  ) 
}}
where not perfect_match

The where not perfect_match statement is an example of a filter you can apply to define whatconstitutes a test failure. The test will fail if any rows don't meet the requirement of a perfect match. Failures would include:

  • If the primary key exists in both relations, but one model has a null value in a column.
  • If a primary key is missing from one relation.
  • If the primary key exists in both relations, but the value conflicts.

If you'd like the test to only fail when there are conflicting values, you could configure it like this:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'), 
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'),
    primary_key='id'
  ) 
}}
where conflicting_values

If you want to create test results that include columns from the model itself for easier inspection, that can be written into the test:

{{ 
  audit_helper.compare_all_columns(
    a_relation=ref('stg_customers'),
    b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), 
    exclude_columns=['updated_at'], 
    primary_key='id'
  ) 
}}
left join {{ ref('stg_customers') }} using(id)

This structure also allows for the test to group or filter by any attribute in the model or in the macro's output as part of the test, for example:

with base_test_cte as (
  {{ 
    audit_helper.compare_all_columns(
      a_relation=ref('stg_customers'),
      b_relation=api.Relation.create(database='dbt_db', schema='analytics_prod', identifier='stg_customers'), 
      exclude_columns=['updated_at'], 
      primary_key='id'
    ) 
  }}
  left join {{ ref('stg_customers') }} using(id)
  where conflicting_values
)
select
  status, -- assume there's a "status" column in stg_customers
  count(distinct case when conflicting_values then id end) as conflicting_values
from base_test_cte
group by 1

You can write a compare_all_columns test on individual table; and the test will be run as part of a full test suite run - dbt test --select stg_customers.

If you want to store results in the warehouse for further analysis, add the --store-failures flag.

Legacy Macros

compare_queries (source)

Tip

Consider compare_and_classify_query_results instead

This macro generates SQL that can be used to do a row-by-row comparison of two queries. This macro is particularly useful when you want to check that a refactored model (or a model that you are moving over from a legacy system) are identical. compare_queries provides flexibility when:

  • You need to filter out records from one of the relations.
  • You need to rename or recast some columns to get them to match up.
  • You only want to compare a small number of columns, so it's easier to write the columns you want to compare, rather than the columns you want to exclude.

Output

By default, the generated query returns a summary of the count of rows that are unique to a, unique to b, and identical:

in_a in_b count percent_of_total
True True 6870 99.74
True False 9 0.13
False True 9 0.13

Setting the summarize argument to false lets you check which rows do not match between relations:

order_id order_date status in_a in_b
1 2018-01-01 completed True False
1 2018-01-01 returned False True
2 2018-01-02 completed True False
2 2018-01-02 returned False True

Arguments

  • a_query and b_query: The queries you want to compare.
  • primary_key (optional): The primary key of the model (or concatenated sql to create the primary key). Used to sort unmatched results for row-by-row validation.
  • summarize (optional): Allows you to switch between a summary or detailed view of the compared data. Accepts true or false values. Defaults to true.
  • limit (optional): Allows you to limit the number of rows returned when summarize = False. Defaults to None (no limit).

Usage

{% set old_query %}
  select
    id as order_id,
    amount,
    customer_id
  from old_database.old_schema.fct_orders
{% endset %}

{% set new_query %}
  select
    order_id,
    amount,
    customer_id
  from {{ ref('fct_orders') }}
{% endset %}

{{ audit_helper.compare_queries(
    a_query = old_query,
    b_query = new_query,
    primary_key = "order_id"
) }}

compare_relations (source)

Tip

Consider compare_and_classify_relation_rows instead

A wrapper to compare_queries, except it takes two Relations (instead of two queries).

Each relation must have the same columns with the same names, but they do not have to be in the same order. Use exclude_columns if some columns only exist in one relation.

Arguments

  • a_relation and b_relation: The relations you want to compare.
  • primary_key (optional): The primary key of the model (or concatenated sql to create the primary key). Used to sort unmatched results for row-by-row validation.
  • exclude_columns (optional): Any columns you wish to exclude from the validation.
  • summarize (optional): Allows you to switch between a summary or detailed view of the compared data. Accepts true or false values. Defaults to true.
  • limit (optional): Allows you to limit the number of rows returned when summarize = False. Defaults to None (no limit).

Usage

{% set old_relation = adapter.get_relation(
      database = "old_database",
      schema = "old_schema",
      identifier = "fct_orders"
) -%}

{% set dbt_relation = ref('fct_orders') %}

{{ audit_helper.compare_relations(
    a_relation = old_relation,
    b_relation = dbt_relation,
    exclude_columns = ["loaded_at"],
    primary_key = "order_id"
) }}

Internal Macros

Macros prefixed with an _ (such as those in the utils/ subdirectory) are for private use. They are not documented or contracted and can change without notice.

dbt-audit-helper's People

Contributors

amychen1776 avatar chenyulinx avatar christineberger avatar clrcrl avatar dave-connors-3 avatar dbeatty10 avatar graciegoheen avatar jeremyyeo avatar joellabes avatar jtcohen6 avatar lbenezra-fa avatar leoebfolsom avatar patkearns10 avatar samharting avatar wasilaq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-audit-helper's Issues

add context to readme to clearly explain differences between macros

Describe the bug

Currently, the macros have names such that it's difficult to differentiate between them / understand the purpose of each.

compare_column_values - compares a single column's values
compare_all_columns - does the same thing as compare_column_values but for all columns and only gives the count of records rather than percentages
compare_relation_columns - compares the metadata of all columns
I think compare_column_values_verbose is the non-summarized view of compare_column_values - essentially the same as setting summarize=false for the other macros. Why is this it's own macro then?

We should add additional examples to the README showing how to use compare_all_columns - both as a test and as a one off query - as well as example outputs.

Additional context

Are you interested in contributing the fix?

create compare_all_columns macro that can be used in dbt test suite

Describe the feature

I would like to build a macro that can be used repeatedly to create dbt tests that test whether all column values match between two versions of dbt code.

This code should abstract away code that would be used for all such tests.

The idea is that I can run a single dbt test on a single table and say "ok, all those columns match up"; and like any dbt test, a single table is created in the warehouse for inspecting the results. (Writing the results to the log/console is great, but doesn't allow for programmatic/automatic checks.) Importantly, this lets dbt_audit_helper become part of a test suite in our production jobs and PR CI/CD.

Going forward, I would like to allow the test to receive arguments excluding columns that are expected to change, but this will provide useful functionality in the general case, and lead to expected test failures when a PR causes a data change.

Describe alternatives you've considered

  • Writing more verbose testing code
  • Running macros manually to test all columns and inspect results visually

Additional context

I will be building for Snowflake as this is the context in which we work--not sure if there will be db-specific needs.

Who will this benefit?

Anyone who wants to very rigorously test their changes as part of a dbt test suite.

Are you interested in contributing this feature?

Yup!

dbt is not working

Describe the feature

When I run the "dbt run" it gives me an error

Describe alternatives you've considered

I have upgraded all the versions for py, dbt, and sqlserver

below is the error I am getting.

Found 2 models, 4 tests, 0 snapshots, 0 analyses, 161 macros, 0 operations, 0 seed files, 0 sources, 0 exposures

ERROR: Database Error
('08001', '[08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]SQL Server does not exist or access denied. (17) (SQLDriverConnect); [08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (Connect()). (53); [08001] [Microsoft][ODBC SQL Server Driver]Invalid connection string attribute (0)')

Can anyone please help me with this.

BigQuery compare_queries should materialize temporary tables for large queries

Describe the feature

A clear and concise description of what you want to happen.

Right now, when you use compare_queries in BQ you may hit BQ limits for view definition. The work around is to materialize those queries as tables, then use the compare_relations macro.

Instead, compare_queries should materialize the queries as temporary tables in BQ, perform the comparison, then drop the tables.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.

This is specific to BigQuery.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

Are you interested in contributing this feature?

Compare_Queries macro not working on Exasol

Describe the feature

On Exasol the compare_queries macro has two errors, preventing its usage:

  1. Exasol does not support a statement like
    select *, some_column from tablename
    In this case * has to be prefixed with an alias, even if only one table is used in the query, like this:
    select tn.*, some_column from tablename tn
  2. final is a reserved word in Exasol, therefore using it as a name for a cte does not work.

Describe alternatives you've considered

Do not use the macro and write comparisons manually.

Additional context

See feature description, for us this is Exasol only, however there may be other databases where one or both of the described problems exist.

Who will this benefit?

Everyone using audit helper and dbt on Exasol.

Are you interested in contributing this feature?

I am willing to contribute, I already fixed the macro locally so it is working for me.
However, to have a permanent fix and also for version upgrades it would be great to include the change in the code.
But I am not familiar with github collaboration across companies, so I would need some hints, if I can just create a branch and open a PR here.

Here is the fixed code for the macro, if that helps:

{% macro compare_queries(a_query, b_query, primary_key=None, summarize=true) -%}
  {{ return(adapter.dispatch('compare_queries', 'audit_helper')(a_query, b_query, primary_key, summarize)) }}
{%- endmacro %}

{% macro default__compare_queries(a_query, b_query, primary_key=None, summarize=true) %}

with a as (

    {{ a_query }}

),

b as (

    {{ b_query }}

),

a_intersect_b as (

    select * from a
    {{ dbt.intersect() }}
    select * from b

),

a_except_b as (

    select * from a
    {{ dbt.except() }}
    select * from b

),

b_except_a as (

    select * from b
    {{ dbt.except() }}
    select * from a

),

all_records as (

    select
        a_intersect_b.*,
        true as in_a,
        true as in_b
    from a_intersect_b

    union all

    select
        a_except_b.*,
        true as in_a,
        false as in_b
    from a_except_b

    union all

    select
        b_except_a.*,
        false as in_a,
        true as in_b
    from b_except_a

),

{%- if summarize %}

summary_stats as (

    select

        in_a,
        in_b,
        count(*) as count

    from all_records
    group by 1, 2

),

finalresult as (

    select

        summary_stats.*,
        round(100.0 * count / sum(count) over (), 2) as percent_of_total

    from summary_stats
    order by in_a desc, in_b desc

)

{%- else %}

finalresult as (
    
    select * from all_records
    where not (in_a and in_b)
    order by {{ primary_key ~ ", " if primary_key is not none }} in_a desc, in_b desc

)

{%- endif %}

select * from finalresult

{% endmacro %}

Package tidy up

Repo checklist:

Urgent:

  • A require-dbt-version parameter in the dbt_project.yml:
require-dbt-version: [">=0.17.0", "<0.18.0"]
  • The dbt_project.yml is in version 2 format (upgrading docs)
  • Any sample dbt_project.yml snippets in the README.md are in v2 syntax (especially for packages that use vars
  • Release is of format major.minor.patch, e.g. 1.0.0
  • A license (example)
  • If this package relies on dbt-utils, make sure the range allows v0.4.x
packages:
  - package: fishtown-analytics/dbt_utils
    version: ["???", "<0.5.0"]

Integration tests

If this package has integration tests:

  • The dbt_project.yml of the integration tests is in v2 format
  • CircleCI uses python v3.6.3
  • The dbt_project.yml does not have a require-dbt-version config (it should be inherited from the package)

Quick wins:

  • Issue and PR templates (examples — make sure you update for the right package name)
  • A codeowner (example)
  • If required, a level of support header in the README

    ⚠️ This package is currently not actively maintained.

More in-depth improvements

  • Installation instructions (see segment), including:
    • Link to packages docs
    • Variables that need to be included in the dbt_project.yml file
    • Listed database support
  • A link to our package contribution guide (Might formalize this as a CONTRIBUTING.md file)
  • Models are documented
  • Macros are documented
  • Integration tests
  • Move model logic intro model files, rather than entirely in adapter macros

Requires admin rights / Claire

  • Repo is named dbt-<package_name>
  • The hub URL listed on the repo

Screen Shot 2020-03-12 at 9 55 03 AM

  • A Zap to the #package-releases channel in dbt Slack

Wrong example query in documentation for `compare_column_values`

Describe the bug

The example query of how to use compare_columns_values is wrong and will give an error if copying.
Missing a

{{ audit_query }}

https://github.com/dbt-labs/dbt-audit-helper#compare_column_values-source

Steps to reproduce

Copy paste the example query from https://github.com/dbt-labs/dbt-audit-helper#compare_column_values-source

Expected results

Test will run without errors.

Actual results

Copy pasting and tweaking the examplequery will give the output of

Database Error in test <test name> (<path to test>.sql)
Syntax error: Unexpected ")" at [21:5]

Screenshots and log output

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing the fix?

This is fairly straight forward. Entire fix can be done by replacing the code snippet with:

{# in dbt Develop #}

{% set old_etl_relation_query %}
    select * from public.dim_product
    where is_latest
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{% set audit_query = audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="product_id",
    column_to_compare="status"
) %}

{{ audit_query }} 

{% set audit_results = run_query(audit_query) %}

{% if execute %}
{% do audit_results.print_table() %}
{% endif %}

Add tox-based Integration Testing support for this package

Description

dbt Labs is running integration tests for popular open source packages as part of continuous testing and release of dbt.

Add support for testing this package using tox, as described in Standardizing integration tests for popular dbt packages.

Acceptance Criteria

  1. Tests reside in the /integration_tests directory.
  2. Supported adapters are defined in supported_adapters.env at the project root. The variable SUPPORTED_ADAPTERS lists the supported adapters, separated by commas. (Note: Only adapters running in dbt Cloud are currently supported.)
  3. tox is used for running tests.
  4. Each adapter test is defined in the tox.ini file using the naming pattern testenv:dbt_integration_<ADAPTER>.
  5. The profiles.yml for integration tests uses specific naming conventions for the target and environment variables. Examples can be found in dbt-package-testing.

Who will this benefit?

This change will enable:

  • dbt Labs to test this package, ensuring that dbt Cloud customers can confidently receive continuous dbt updates ("Versionless")
  • dbt Labs to proactively identify any incompatibilities or functional regressions before they are released in final versions of dbt Core OSS.
  • Package maintainers to consistently test changes in the same manner as dbt Labs will test new releases.

Final Note

dbt Labs will commence running tests only after the new tests have been released and are available in the latest package version on the hub.

README.md is ineffective for dbt cloud

Describe the bug

The section compare_column_values -> Advanced usage - dbt Cloud contains a code snippet for replacing the .print_table() function, but the code snippet does not work in dbt cloud.

Steps to reproduce

Use the code snippet

{# in dbt Develop #}

{% set old_etl_relation_query %}
    select * from public.dim_product
    where is_latest
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{% set audit_query = audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="product_id",
    column_to_compare="status"
) %}

{% set audit_results = run_query(audit_query) %}

{% do log(audit_results.column_names, info=True) %}
{% for row in audit_results.rows %}
    {% do log(row.values(), info=True) %}
{% endfor %}

As suggested by the readme, where the last 4 lines include the suggested code change to make this compatible with dbt cloud.

The actual expected results can be obtained by instead directly invoking the macro:

{# in dbt Develop #}

{% set old_etl_relation_query %}
    select * from public.dim_product
    where is_latest
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="product_id",
    column_to_compare="status"
) }}

version dependency with dbt_utils conflict

Describe the bug

installation/upgrade with together with dbt-util is conflicting

Steps to reproduce

packages:
  - package: dbt-labs/audit_helper
    version: [">=0.5.0", "<0.6.0"]

  - package: dbt-labs/dbt_utils
    version: [">=0.9.0", "<0.10.0"]

Expected results

dbt-audit-helper working with latest dbt-utils

Actual results

dbt-utils upgrade is failing

Screenshots and log output

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

Plugins:
  - bigquery: 1.2.0 - Up to date!

The operating system you're using:
using Fedora Linux 36

The output of python --version:
Python 3.10.6

Are you interested in contributing the fix?

Yes, can create a PR with a version bump

Version error for package dbt-lab--dbt_utils

Describe the bug

I get a version error for package dbt-lab--dbt_utils, while attempting to install dbt-audit-helper.

Steps to reproduce

  1. Add dbt-labs/audit_helper to packages.yml
  2. Run dbt deps

Expected results

audit_helper package in installed.

Actual results

Get the following error:

23:41:31  Running with dbt=1.3.1
23:41:33  Encountered an error:
Version error for package dbt-labs/dbt_utils: Could not find a satisfactory version from options: ['>=0.8.0', '<1.0.0', '>=0.8.0', '<2.0.0', '>=0.9.0', '<2.0.0', '>=0.8.0', '<0.9.0']

--->

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: fivetran/stripe
    version: [">=0.7.3"]
  - package: fivetran/hubspot
    version: [">=0.5.0", "<0.8.0"]
  - package: elementary-data/elementary
    version: [">=0.5.0", "<0.8.0"]
  - package: dbt-labs/dbt_utils
    version: [">=0.8.0", "<1.0.0"]

  # - package: dbt-labs/audit_helper
  #   version: [">=0.6.0", "<0.9.0"]

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

dbt=1.3.1

The operating system you're using:
Montery

The output of python --version:
Python 3.10.3

Additional context

Are you interested in contributing the fix?

add column name to `compare_column_values` output

Describe the feature

I would like the compare_column_values macro to include a column called column_name. This would make it easier (possible?) to incorporate this package into a dbt test suite that tested all the columns in a model (or multiple models). I brought this up in Slack--here's a link in case there's any discussion).

Describe alternatives you've considered

  • Paid tools exist.
  • Not sure there is an alternative other than testing every column individually.

Additional context

I am building this for our Snowflake warehouse, but I think it would work across warehouses.

Who will this benefit?

This is for AEs and other data stakeholders who want to run a rigorous test suite that detects data changes when a PR is submitted (or, more generally, data changes between different versions of code).

Are you interested in contributing this feature?

Yes--I'm currently working on a branch on a fork and will open a PR shortly (happy to take direction/guidance and adjust accordingly).

compare_column_values() does not work on dbt Cloud

Describe the bug

The examples for compare_column_values (source) use print_table() but according to Fishtown Analytics support, print_table() does not work in dbt Cloud. There should at minimum be a disclaimer in the README, especially for the Advanced Usage, as the SQL is generated but not the table, forcing the user to copy paste and run each compiled SQL manually.

Steps to reproduce

Follow the instructions for usage or advanced usage of audit_helper.compare_column_values() in dbt Cloud and execute with dbt run-operation

In my case, this:

{% macro audit_my_table() %}
{%- set columns_to_compare=adapter.get_columns_in_relation(ref('my_new_table'))  -%}

{% set old_etl_relation_query %}
    select * from my_old_table
{% endset %}

{% set new_etl_relation_query %}
    select * from {{ ref('my_new_table') }}
{% endset %}

{% if execute %}
    {% for column in columns_to_compare %}
        {{ log('Comparing column "' ~ column.name ~'"', info=True) }}
        {% set audit_query = audit_helper.compare_column_values(
                a_query=old_etl_relation_query,
                b_query=new_etl_relation_query,
                primary_key="my_pk",
                column_to_compare=column.name
        ) %}

        {% set audit_results = run_query(audit_query) %}
        {% do audit_results.print_table() %}
        {{ log(audit_query, info=True) }}
    {% endfor %}
{% endif %}

{% endmacro %}

When I dbt run-operation this macro, nothing happens at all in the logs, and there's some circumstantial evidence that it messes up my dbt Cloud Develop pod leading to basic queries of my_new_table and possibly others failing. (the latter was due to non-breaking spaces)

If I remove the print_table() line, the above compiles and prints the first query in the logs successfully, it successfully compiles all the queries, and if I copy-paste-run any of those from the logs into a statement tab and run it, they work fine and I see the expected table in the results tab.

Expected results

A markdown formatted table based on a successfully compiled SQL query in the logs similar to the ones in the README.

Actual results

The macro will run "successfully" and compares every column but does not print any of the markdown tables in the logs.

System information

packages:

  • package: fishtown-analytics/dbt_utils
    version: 0.6.4
  • package: fishtown-analytics/codegen
    version: 0.3.1
  • package: fishtown-analytics/audit_helper
    version: 0.3.0
  • package: gitlabhq/snowflake_spend
    version: 1.2.0

dbt_date includes dbt_utils

  • package: calogica/dbt_date
    version: [">=0.3.0"]

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

0.19.1

The operating system you're using:
N/A

The output of python --version:
N/A

Additional context

There is a support ticket

Are you interested in contributing the fix?

Yes.

Using with dbt_utils 0.3.0

With my deps as follow

packages:
  - package: fishtown-analytics/dbt_utils
    version: 0.3.0
  - package: fishtown-analytics/audit_helper
    version: 0.0.5

I'm getting this error:

 dkc exec dbt dbt deps
Running with dbt=0.16.0
Encountered an error:
Version error for package fishtown-analytics/dbt_utils: Could not find a satisfactory version from options: ['=0.3.0', '>=0.1.24', '<0.3.0']

Any plans to support dbt_utils >= 0.3.0 ?

Thanks!

dbt_utils version dependency?

I'm curious to try out this package but it has a package version specification for dbt_utils locked at <0.2 and of course we are on 0.2.1. Is this because of a real incompatibility (or inability to test) with the latest version? I'm hoping it's just an oversight in spec in the packages.yml file ;) Thanks! Anthony

README provides guidance that doesn't work

Describe the bug

According to our README.md:

https://github.com/dbt-labs/dbt-audit-helper#compare_relations-source

This query is particularly useful when you want to check that a refactored model, or a model that you are moving over from a legacy system, match up.
Usage: The query is best used in dbt Develop so you can interactively check results

{# in dbt Develop #}

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="old_etl_schema",
      identifier="fct_orders"
) -%}

{% set dbt_relation=ref('fct_orders') %}

{{ audit_helper.compare_relations(
    a_relation=old_etl_relation,
    b_relation=dbt_relation,
    exclude_columns=["loaded_at"],
    primary_key="order_id"
) }}

This actually does not work as of writing.

Steps to reproduce

Try the README.md recommended steps to use compare_relations in dbt Cloud IDE:

image

The error is:

Compilation Error in model compare_rels (models/compare_rels.sql)
  Macro get_filtered_columns_in_relation expected a Relation but received the value: None
  
  > in macro _is_relation (macros/jinja_helpers/_is_relation.sql)
  > called by macro default__get_filtered_columns_in_relation (macros/sql/get_filtered_columns_in_relation.sql)
  > called by macro get_filtered_columns_in_relation (macros/sql/get_filtered_columns_in_relation.sql)
  > called by macro compare_relations (macros/compare_relations.sql)
  > called by model compare_rels (models/compare_rels.sql)

Expected results

No compilation error - and we should be able to preview (and see the results of compare_relation) as the docs suggest.

Actual results

Error as above.

Screenshots and log output

System information

packages:
  - package: dbt-labs/audit_helper
    version: 0.9.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

dbt v1.5.0-latest

The operating system you're using:
dbt Cloud

The output of python --version:

Additional context

We can actually compare relations in the IDE - just that we need to use compare_queries instead I think:

image

Are you interested in contributing the fix?

I think this is just a docs fix?

compare_queries does not work on Dremio

Describe the bug

compare_queries marco uses "count" as a field name in summary_stats CTE without escaping, which is not allowed in Dremio SQL.

Steps to reproduce

In Dremio environment, create the following model:

-- File regression_test.sql
{% set old_query %}
select 1 as a
union all
select 2 as a
{% endset %}


{% set new_query %}
select 1 as a
union all
select 2 as a
{% endset %}


{{
    audit_helper.compare_queries(
        a_query=old_query,
        b_query=new_query,
        primary_key="a"
    )
}}

Run audit model: dbt run -m regression_test

Expected results

Model runs.

Actual results

Model run fails, see log below.

Additional context

I can go into the compiled model file in "target" directory and manually quote "count" in summary_stats CTE and at the end where it is used. If I then copy and paste the SQL into Dremio console, it runs as works as intended.

Screenshots and log output

17:35:50  1 of 1 START sql view model regression_test .................................... [RUN]
17:35:50  1 of 1 ERROR creating sql view model regression_test ........................... [ERROR in 0.39s]
17:35:50  
17:35:50  Finished running 1 view model in 0 hours 0 minutes and 2.60 seconds (2.60s).
17:35:50  
17:35:50  Completed with 1 error and 0 warnings:
17:35:50  
17:35:50  Runtime Error in model regression_test (models/regression/regression_test.sql)
17:35:50    ERROR: Encountered "as count" at line 102, column 18.
17:35:50    Was expecting one of:
17:35:50        ")" ...
17:35:50        "ORDER" ...
17:35:50        "LIMIT" ...
17:35:50        "OFFSET" ...
17:35:50        "FETCH" ...
17:35:50        "FROM" ...
17:35:50        "," ...
17:35:50        "AS" <IDENTIFIER> ...
<skipped long list of options, clearly Dremio stumbles on "as count">
17:35:50  
17:35:50  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

System information

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-labs/audit_helper
    version: 0.9.0

Which database are you using dbt with?
Dremio.

The output of dbt --version:

Core:
  - installed: 1.5.8
  - latest:    1.6.6 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - dremio: 1.5.0 - Up to date!

The operating system you're using:
Debian inside DevContainer via VSCode.

The output of python --version:
Python 3.11.6

Are you interested in contributing the fix?

PR provided.

Advanced Usage example for compare_column_values doesn't work with dbt 0.18

Describe the bug

See above!

Steps to reproduce

Dummy code based on our project

{%- set columns_to_compare=adapter.get_columns_in_relation(ref('fct_orders'))  -%}

{% set old_etl_relation_query %}
    select * from data.prod_schema.fct_orders
{% endset %}

{% set new_etl_relation_query %}
    select * from data.dev_schema.fct_orders
{% endset %}

{% if execute %}
    {% for column in columns_to_compare %}
        {{ log('Comparing column "' ~ column.name ~'"', info=True) }}

        {% set audit_query = audit_helper.compare_column_values(
            a_query=old_etl_relation_query,
            b_query=new_etl_relation_query,
            primary_key="FINAL_ORDER_ID",
            column_to_compare=column.name
        ) %}

        {% set audit_results = run_query(audit_query) %}
        {% do audit_results.print_table() %}
        {{ log("", info=True) }}

    {% endfor %}
{% endif %}

Expected results

Can run the audit query

Actual results

Screenshots and log output

Encountered an error:
Runtime Error
Compilation Error in model test (models/marts/test.sql)
'audit_helper' is undefined

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres

  • redshift

  • bigquery

  • snowflake

  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing the fix?

Yes, let me know what the problem is with the syntax here!

No Macro named 'get_columns_in_relation_sql' found

Hi There,

We are trying to compare two tables but unfortunately, we are getting below error.

I have sent an email to [email protected] 7 days ago but still waiting for the issue to be resolved.

Encountered an error:
Compilation Error
In dispatch: No macro named 'get_columns_in_relation_sql' found
Searched for: 'audit_helper.sqlserver__get_columns_in_relation_sql', 'audit_helper.default__get_columns_in_relation_sql'

Remove "to-do" section from bottom of README

Describe the feature

Remove this section from the bottom of the README:

image

Additional context

This has been present since the original commit for this repo, but would be better tracked in a GitHub issue instead.

#63 preserves the record of the proposed feature, so the "to-do" can be deleted.

Easier toggle between summary and diff when comparing relations and queries

Describe the feature

I've found the compare_queries and compare_relations very useful. It would be great to add an optional parameter when calling the compare_queries and compare_relations macros which would allow for an easy switch between the summary output and the full diff output so that we don't need to manually switch by commenting/uncommenting.

It's a small change and I would be happy to contribute it myself if that's something that is possible?

Who will this benefit?

Anyone who sees that the two relations aren't exactly the same and wants to see the rows that differ between the two relations.

Are you interested in contributing this feature?

Yes. My suggested change is small. I'm thinking something like this (where diff is the optional parameter and defaults to false to maintain backwards compatability):

{% if diff %}

compared_queries as (
    select
        *
    from all_records
    where not (in_a and in_b)
    order by {{ primary_key ~ ", " if primary_key is not none }} in_a desc, in_b desc
)

{% else %}

summary_stats as (
    select
        in_a,
        in_b,
        count(*) as count
    from all_records
    group by 1, 2
),

compared_queries as (
    select
        *,
        round(100.0 * count / sum(count) over (), 2) as percent_of_total
    from summary_stats
    order by in_a desc, in_b desc
)

{% endif %}

Required version of dbt for 'dbt_utils': ['>=0.20.0', '<0.21.0'] but latest stable version of dbt is 0.19.2

Describe the bug

In dbt-audit-helper the dbt-utils version is [>=0.7.0,<0.8.0], though, it appears this version of dbt_utils does not support any version of dbt less than 0.20.0, though the latest stable version of dbt is 0.19.2

Steps to reproduce

Install audit_helper create a model and compile it.

# package.yml
packages:
    - package: dbt-labs/dbt_utils
      version: 0.7.0
    - package: dbt-labs/audit_helper
      version: 0.4.0
dbt deps
dbt compile -m. path.to.my.audit_helper.model

Expected results

Compiled query in target/

Actual results

Running with dbt=0.18.1
Encountered an error:
Runtime Error
  Failed to read package: Runtime Error
    This version of dbt is not supported with the 'dbt_utils' package.
      Installed version of dbt: =0.18.1
      Required version of dbt for 'dbt_utils': ['>=0.20.0', '<0.21.0']
    Check the requirements for the 'dbt_utils' package, or run dbt again with --no-version-check

Running with --no-version-check throws a different error

Compilation Error in model <my_models> (path/to/my/model.sql)
  In adapter.dispatch, got a string packages argument ("dbt_utils"), but packages should be None or a list.

System information

# package.yml
packages:
    - package: dbt-labs/dbt_utils
      version: 0.7.0
    - package: dbt-labs/audit_helper
      version: 0.4.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.19.2

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using: MacOs 11.4

The output of python --version: python 3.8.3

Are you interested in contributing the fix?

For sure.

`compare_all_columns` fails if table has too many columns in BigQuery

Describe the bug

When trying to apply audit_helper.compare_all_columns to a table with 196 columns, the test fails with

18:07:33    Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
18:07:33    compiled SQL at target/run/****/src/tests/****_unit_test.sql

Note that the compared tables have a single row each, as I'm trying to use audit_helper for a SQL unit test.

This is more of a BigQuery limitation then a bug with the library. However, it may be possible to circumvent it if somehow we can paginate over all columns.

Steps to reproduce

  1. Create a table with 196 columns. Their types probably don't matter.
  2. Create a second table with the same schema.
  3. Try to run audit_helper.compare_all_columns comparing them.

Expected results

I'd like there to be a report showing which columns don't match among the rows.

Actual results

The query fails because it's too complex

Screenshots and log output

N/A (sensitive information)

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.0
  - package: dbt-labs/audit_helper
    version: 0.6.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.2.3
  - latest:    1.3.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:
Linux 5.15.0-56-generic #62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The output of python --version:
Python 3.7.13

Additional context

Are you interested in contributing the fix?

I may contribute with guidance, though I haven't signed any agreement

compare_which_columns_differ: does not exist

Describe the bug

The compare_which_columns_differ macro doesnt exist.

However, there's a compare_which_relation_columns_differ macro, but it doesn't accept exclude_columns.

Screenshots and log output

https://github.com/dbt-labs/dbt-audit-helper/blob/main/macros/compare_which_relation_columns_differ.sql

System information

The contents of your packages.yml file:

  • package: dbt-labs/audit_helper
    version: 0.12.0

Which database are you using dbt with?

  • snowflake

Fix ambiguous column name warning

When using this package, I got an ambiguous column name warning message from audit_helper/macros/compare_relation_columns.sql line 24 "full outer join b_cols using (column_name)"

I believe this can be fixed by joining with 'on' rather than 'using' - "full outer join b_cols on a_cols.column_name = b_cols.column_name"

'compare_column_values' not able to create output table

Describe the bug

When executing dbt run on the audit file, the output brings an error on the table creation.

Steps to reproduce

Expected results

I expected to see, how the change in my attribution model will affect production once merged.

Actual results

The library does not finish the job.

Screenshots and log output

14:33:15  Database Error in model audit_attribution (#####)
14:33:15    syntax error at or near ")"
14:33:15    LINE 25:   );
14:33:15               ^
14:33:15  
14:33:15  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.4.1
  - latest:    1.5.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - redshift: 1.4.0 - Up to date!
  - postgres: 1.4.1 - Update available!

The operating system you're using:

The output of python --version:

Additional context

Are you interested in contributing the fix?

enable 'quick' macro for databricks

Describe the feature

quick_are_queries_identical is currently enable for BQ and Snowflake only. Enabling it for Databricks too only takes a minor change: using the Snowflake sql and replacing hash_agg with xxhash64.

Describe alternatives you've considered

none

Additional context

none

Who will this benefit?

users of dbt-databricks adapter

Are you interested in contributing this feature?

since this is a quick one, i already have a PR ready, but cannot push (probably need to be added as contributor?)

compare_query_columns to compare multiple columns across two queries

Describe the feature

I would love the output from compare_all_columns to be applicable to multiple columns across two queries. Its like compare_column_values, but supporting multiple columns and thus using the output format from compare_all_columns.

Happy to consider alternative naming for macro and args, but below should give clear indication on what I would like.

Describe alternatives you've considered

Alternative is to use all of these individually.

Additional context

I see that compare_all_columns depends on using column definitions defined in the relations, so that means we should just pass them along.

Example summary

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_query_columns(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    columns = ["status", "amount", "some_other_column"], 
) }}

Summary output

column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
status 10 0 0 0 0 0
amount 2 0 0 0 0 8
some_other_column 6 4 4 0 0 0

Example verbose

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_query_columns(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    columns = ["status", "amount", "some_other_column"], 
   summarize=False,
) }}

Detailed response

primary_key column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
1 status true false false false false false
1 amount false false false false false true
1 some_other_column false true true false false false
... ... ... ... ... ... ... ...

Who will this benefit?

This is useful for comparing source and target queries in detail in one go, without relations defined and without comparing all columns, or repeating many times for a single specific column.

Are you interested in contributing this feature?

Yes, I would consider contributing only if it is deemed useful and likely to be merged into the lib.

Great lib, and of course it's simple enough for me to do in my own codebase, but also happy to enrich this lib for others if there is a general desire.

compare_queries can be wrong due to approximation by BigQuery

Describe the bug

I used compare_queries and the generated view reported that there are some gaps between the target tables.
However, according to my research, many records existed in both of a_except_b and b_except_a.

My assumption is intersect distinct and except discint approximately deals with records.

Steps to reproduce

I just used macros like that. I masked the project ID and tables.

{% set old_table_query %}
  select *
  from `xxxxxxxx.old.users`
{% endset %}

{% set new_table_query %}
  select *
  from `xxxxxxxx.new.users`
{% endset %}

{{ audit_helper.compare_queries(
    a_query=old_table_query,
    b_query=new_table_query,
    primary_key="id"
) }}

Expected results

All records should be matched.

Actual results

There are some gaps.

Screenshots and log output

Screen Shot 2020-10-21 at 3 20 35 PM

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

$ dbt --version
installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using:
Mac OS 10.15.5

The output of python --version:
Python 3.7.7

Are you interested in contributing the fix?

Yes, I have an idea to solve this.

Add support for spark

File "/usr/local/lib/python3.9/dist-packages/dbt/parser/manifest.py", line 526, in macro_depends_on
possible_macro_calls = statically_extract_macro_calls(
File "/usr/local/lib/python3.9/dist-packages/dbt/clients/jinja_static.py", line 43, in statically_extract_macro_calls
ad_macro_calls = statically_parse_adapter_dispatch(
File "/usr/local/lib/python3.9/dist-packages/dbt/clients/jinja_static.py", line 147, in statically_parse_adapter_dispatch
macro = db_wrapper.dispatch(func_name, macro_namespace=macro_namespace).macro
File "/usr/local/lib/python3.9/dist-packages/dbt/context/providers.py", line 186, in dispatch
raise CompilationException(msg)
dbt.exceptions.CompilationException: Compilation Error
In dispatch: No macro named 'get_columns_in_relation_sql' found
Searched for: 'project.spark__get_columns_in_relation_sql', 'project.default__get_columns_in_relation_sql', 'audit_helper.spark__get_columns_in_relation_sql', 'audit_helper.default__get_columns_in_relation_sql'
11:23:16 Encountered an error:
Compilation Error
In dispatch: No macro named 'get_columns_in_relation_sql' found
Searched for: 'project.spark__get_columns_in_relation_sql', 'project.default__get_columns_in_relation_sql', 'audit_helper.spark__get_columns_in_relation_sql', 'audit_helper.default__get_columns_in_relation_sql'

Duplicate rows

Currently, these queries don't take into account the presence of duplicates rows.

There's probably a question of whether this is actually OK, perhaps uniqueness should be tested separately anyway. But if that's the case then the assumption should be called our more specifically in the docs

`compare_all_columns` has a tough time with null values

Describe the bug

Currently, the compare_all_columns macro does a poor job of distinguishing between missing data and null values.

We must check to see if the primary key exists in both of the two tables to determine whether the value in that table is null or missing.

Steps to reproduce

Incorrect handling of missing rows leading to wrong perfect_match value

  • PK 12345 exists in Table A but not in Table B
  • first_name for that row in Table A is NULL

Incorrect comparison of null to not null leading to wrong conflicting_valuesvalue

  • Compare any not-null to null value and the conflicting_values value was false when it should be true
    Note: Should it really be a conflicting_value if one of the columns is not null and the other is null? [Debate about what a null value really is redacted.] Yes! Because it's not a perfect match.

Expected results

Incorrect handling of missing rows leading to wrong perfect_match value

  • This should instead be categorized as perfect_match = false

Incorrect comparison of null to not null leading to wrong conflicting_valuesvalue

  • This should instead be categorized as conflicting_values = true

Actual results

The opposite of what's expected.

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/audit_helper
    version: 0.7.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.4.1
  - latest:    1.4.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - snowflake: 1.4.1 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:
Screenshot 2023-03-27 at 14 56 23

The output of python --version:

Python 3.9.6

Additional context

Are you interested in contributing the fix?

Yes!

Support Concatenate SQL in unique_key config

Describe the feature

I would like to be able to use concat SQL syntax to combine multiple columns that act as a surrogate key for a table, similar to how you can for the unique key in a incremental model or snapshot - something like:
unique_key= "col_a||'-'||col_b"

Describe alternatives you've considered

Currently if I want to use audit helper macros on tables that do not have a unique key I must use the surrogate key macro or something similar to actually add this column onto the object so it can be referenced.

Additional context

This feature is not database specific and would match how we handle the unique_key configuration in other parts of dbt for example snapshots & incremental models.

Who will this benefit?

This allows me to more easily work with the audit helper packages by extended use cases to tables with not primary key. This saves me time as a developer since I do not have to actually add the surrogate key to an object in the warehouse.

Are you interested in contributing this feature?

compare_column_values not working with Athena/Presto

Describe the bug

Function compare_column_values not working on Athena/Presto.

Steps to reproduce

Create test in Presto environment:

{% set old_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{% set new_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{{ audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="col_a",
    column_to_compare="col_b"
)
}}

Log output

COLUMN_NOT_FOUND: line 59:10: Column 'column_name' cannot be resolved or requester is not authorized to access requested resources

System information

dbt-labs/audit_helper version: 0.9.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: Presto)

Additional context

Presto does not support grouping by aliases.
Line 53 of compare_column_values.sql: group by column_name should be removed in order to work with Presto

Are you interested in contributing the fix?

I can try making a PR

macros - drop table issue (redshift table)

Describe the bug

I have a dbt macro that drops a table if it exists. However, the subsequent code that is supposed to recreate the table fails since it seems to think that the table still exists. When I run the model again (dbt run +table_name+), the t_able gets created and the data is populated in the table._

And the same issue arises when I run it again (run #3) and succeeds when I rerun!

Wondering if there is a delay in committing on the redshift cluster where I have a table

Expected results

#Successful Run

Actual results

#Failure to create table and load data!
Completed with 1 error and 0 warnings:
Invalid unique_key value provided:

Add a parameter to summarize by a dimension

Describe the feature

Would love to have an option to group audit_helper results by a dimension! Here's what that view looks like:

Screenshot 2024-02-29 at 5 34 37 PM

Describe alternatives you've considered

I've written this out myself using audit helper results and then transforming over them to show results over months.

Additional context

This is helpful so you can recognize patterns and dive more specifically in to where discrepancies occur. In the previous screenshot, I could identify where old migrations happened and worked with a team who could identify months where there were explosive order volumes, which didn't always come in with the right data - this helped us quickly check to verify and eliminate a lot of things from our extensive validations.

Here is the code that I created to help with this (saved over my local audit_helper compare_queries code):

{% macro compare_queries(a_query, b_query, primary_key=None, summarize=true, summarize_by=None) -%}
  {{ return(adapter.dispatch('compare_queries', 'audit_helper')(a_query, b_query, primary_key, summarize, summarize_by)) }}
{%- endmacro %}

{% macro default__compare_queries(a_query, b_query, primary_key=None, summarize=true, summarize_by=None) %}

with a as (

    {{ a_query }}

),

b as (

    {{ b_query }}

),

a_intersect_b as (

    select * from a
    {{ dbt.intersect() }}
    select * from b

),

a_except_b as (

    select * from a
    {{ dbt.except() }}
    select * from b

),

b_except_a as (

    select * from b
    {{ dbt.except() }}
    select * from a

),

all_records as (

    select
        *,
        true as in_a,
        true as in_b
    from a_intersect_b

    union all

    select
        *,
        true as in_a,
        false as in_b
    from a_except_b

    union all

    select
        *,
        false as in_a,
        true as in_b
    from b_except_a

),

{%- if summarize %}

    {%- if summarize_by is none %}

    summary_stats as (

        select

            in_a,
            in_b,
            count(*) as count

        from all_records
        group by 1, 2

    ),

    final as (

        select

            *,
            round(100.0 * count / sum(count) over (), 2) as percent_of_total

        from summary_stats
        order by in_a desc, in_b desc
    
    )

    {%- else %}

    summary_stats as (

        select

            in_a,
            in_b,
            {{ summarize_by }} as audit_group,
            count(*) as count

        from all_records
        group by 1, 2, 3

    ),

    counts as (

        select

            *,
            round(100.0 * count / sum(count) over (partition by audit_group), 2) as percent_of_total

        from summary_stats
        order by audit_group, in_a desc, in_b desc
    
    ),

    final as (
        select
            audit_group,
            coalesce(
                min(case when in_a and in_b then percent_of_total || '% (' || count || ')' end),
                '--'
            ) as perfect_match,
            coalesce(
                min(case when in_a and not in_b then percent_of_total || '% (' || count || ')' end),
                '--' 
            ) as only_a,
            coalesce(
                min(case when not in_a and in_b then percent_of_total || '% (' || count || ')' end),
                '--'
            ) as only_b
        from counts
        group by 1
    )

    {%- endif %}

{%- else %}

final as (
    
    select * from all_records
    where not (in_a and in_b)
    order by {{ primary_key ~ ", " if primary_key is not none }} in_a desc, in_b desc

)

{%- endif %}

select * from final

{% endmacro %}

and here is the usage:

-- summarize_by is the new parameter that will count over
-- whatever dimension you desire!
{{ audit_helper.compare_queries(
    a_query=old_query,
    b_query=new_query,
    primary_key='order_id, order_item_id, fulfillment_id',
    summarize=true,
    summarize_by="date_trunc('month', order_timestamp_utc)"
) }}

There are probably some cleanup tasks in order to make it data warehouse friendly - for example, I'm not sure if piped concatenations work on all data platforms.

Who will this benefit?

Anyone performing migration work who need to validate old data sets against new! It helps cut down time considerable when you're able to identify patterns in time-based results. For example:

  • If the inconsistency is seen across the board, it's probably logic or a join
  • If the inconsistency isn't seen across the board, it's probably something that happened with the data at that point in time or due to a time-based join.

Are you interested in contributing this feature?

Yes! But I submitted this because I'm afraid of finding out I don't have time 😢

get_columns_in_relation_sql returns an error during parse phase on Snowflake when using a created Relation

Offending code:

{% set old_etl_relation=adapter.get_relation(
      database=target.database,
      schema="ANALYTICS",
      identifier="DIM_ACCOUNTS"
) -%}
{{ audit_helper.get_columns_in_relation_sql(old_etl_relation) }}

Result:

Compilation Error in macro snowflake__get_columns_in_relation_sql (macros/compare_relation_columns.sql)
  'None' has no attribute 'information_schema'
  > in macro adapter_macro (macros/adapters/common.sql)
  > called by macro get_columns_in_relation_sql (macros/compare_relation_columns.sql)
  > called by macro compare_relation_columns (macros/compare_relation_columns.sql)
  > called by macro snowflake__get_columns_in_relation_sql (macros/compare_relation_columns.sql)

Workaround:
Wrap everything in an {% if execute %} statement

Investigation:
I believe this is because I call {{ relation.information_schema('columns') }}, but during the parse-phase I haven't yet created the relation

Likely happening on BQ and Postgres too

compare_all_columns not working on Presto

Describe the bug

Function compare_all_columns does not work on Presto.

Steps to reproduce

Create test in Presto environment:

{{ audit_helper.compare_all_columns(
    a_relation=ref('test__old'),
    b_relation=ref('test__new'),
    primary_key="col_a"
)}}

Models to test (same query for each one):

select
 'a' AS col_a,
 'b' AS col_b

Screenshots and log output

line 29:7: mismatched input 'with'. Expecting: '(', 'SELECT', 'TABLE', 'VALUES'

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: Presto)

The output of dbt --version:
dbt-labs/audit_helper version: 0.9.0

Additional context

line 45-51, WITH main as((WITH double parenthesis followed by WITH does not seem to be allowed.
It works with one parenthesis.

Are you interested in contributing the fix?

I am going to suggest a PR

compare_relations adapter.get_relation doesn't work as in the README

Describe the bug

The compare_relations macro seems to have an issue with the adapter.get_relation value on the get_filtered_columns_in_relation operation. It seems to work with an {% if execute %} wrapper, though.

Steps to reproduce

Copy the code from the README for compare_relations, then use it as-is, swapping the adapter.get_relations call to an accessible table. Try to run the macro - you will get an error with an empty value.

Expected results

A summarized comparison between the two datasets. Using Benoit's suggestion from an internal channel, the macro works fine like this:

{% set dbt_relation=ref('orders') %}

{% if execute %}
    {%- set old_etl_relation = adapter.get_relation(
        database="ANALYTICS",
        schema="DBT_COHMS",
        identifier="FCT_ORDERS") -%}

    {{ audit_helper.compare_relations(
        a_relation=old_etl_relation,
        b_relation=dbt_relation,
        primary_key="order_id"
    ) }}
{% endif %}

Actual results

With the snippet in the README:
Screenshot 2023-09-12 at 6 01 17 PM

Benoit's version:
Screenshot 2023-09-12 at 6 06 58 PM

System information

The contents of your packages.yml file:
Screenshot 2023-09-12 at 6 08 52 PM

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:
1.6.0 - dbt Cloud

The operating system you're using:
Mac / Arc Browser

The output of python --version:

Additional context

It's fine if things now need to be wrapped in if execute, but the README should reflect!

Are you interested in contributing the fix?

Would love to, but don't have time to solution if needed 🥲

Macro to check if two schemas contain the same relations

Describe the feature

compare_schemas: given two schemas, determine their relative complements (AKA, set differences) similar to compare_relations.

🚧 Further refinement would be necessary to determine which type(s) of atomic units should be compared.

The most simple approach would be to only compare relation names (and not compare the relation types, column names, row values, etc).

in_a in_b count percent_of_total
True True 7 0.30
True False 10 0.43
False True 6 0.26

Setting a summarize argument to false could let you check which relations do not match between schemas:

relation_name in_a in_b
raw_orders True False
raw_customers False True
stg_orders True False
stg_customers False True

Describe alternatives you've considered

A legit option is to close this as wontfix!

If we do choose to implement it, there are many different ways that two schemas could be compared, depending how deeply one wants to go. Anything beyond simple comparison of relation names within the schema could be deferred to more granular macros like compare_relations (or completely different packages, like dbt-expectations).

Additional context

The initial commit for this repo had the following to-dos:

  • Macro to check if two models have the same structure
  • Macro to check if two schemas contain the same relations
  • Extend check_equality macro to handle edge cases

There is now only one remaining.

The main motivation for creating this issue is so that I can feel better about cleaning up the README by removing this from the bottom 😎

image

Who will this benefit?

🤷

Are you interested in contributing this feature?

No answer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.