Git Product home page Git Product logo

tsql-utils's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tsql-utils's Issues

enable `web/` macros (e.g. `dbt_utils.get_url_parameter()`)

dbt_utils.get_url_parameter() doesn't work currently (neither does get_url_host or get_url_path).

I think it has to do with dbt_utils.split_part:

{% macro sqlserver__split_part(string_text, delimiter_text, part_number) %}
LTRIM(CAST(('<X>'+REPLACE({{ string_text }},{{ delimiter_text }} ,'</X><X>')+'</X>') AS XML).value('(/X)[{{ part_number }}]', 'VARCHAR(128)'))
{% endmacro %}

dbt-utils's integration test message is:

('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]XML parsing: line 1, character 56, semicolon expected (9411) (SQLMoreResults)')

this was the generated SQL by the integration tests...

with data as (
    select * from "dbt-msft-serverless-db"."test"."data_urls"
)

select
    nullif(

    LTRIM(CAST(('<X>'+REPLACE(

    LTRIM(CAST(('<X>'+REPLACE(url,'utm_medium=' ,'</X><X>')+'</X>') AS XML).value('(/X)[2]', 'VARCHAR(128)'))

,'&' ,'</X><X>')+'</X>') AS XML).value('(/X)[1]', 'VARCHAR(128)'))

    ,'') as actual,
    medium as expected

from data

union all

select
    nullif(

    LTRIM(CAST(('<X>'+REPLACE(

    LTRIM(CAST(('<X>'+REPLACE(url,'utm_source=' ,'</X><X>')+'</X>') AS XML).value('(/X)[2]', 'VARCHAR(128)'))

,'&' ,'</X><X>')+'</X>') AS XML).value('(/X)[1]', 'VARCHAR(128)'))

,'') as actual,
    source as expected

from data

test for `insert_by_period` is failing... don't know why

@davidclarance maybe you can help me with this? happy to give more context later -- just jotting this down for now

test's compiled SQL

-- setup

with a as (

    select * from "dbt-msft-serverless-db"."test"."test_insert_by_period"

),

b as (

    select * from "dbt-msft-serverless-db"."test"."expected_insert_by_period"

),

a_minus_b as (

    select "id", "created_at" from a
    

    except


    select "id", "created_at" from b

),

b_minus_a as (

    select "id", "created_at" from b
    

    except


    select "id", "created_at" from a

),

unioned as (

    select 'a_minus_b' as which_diff, * from a_minus_b
    union all
    select 'b_minus_a' as which_diff, * from b_minus_a

)

select * from unioned

results

which_diff id created_at
b_minus_a 3 2018-02-02
b_minus_a 4 2018-03-02
b_minus_a 5 2018-04-02
b_minus_a 6 2018-05-02

decide API for dbt-utils macros that work without issue in TSQL

consider dbt_utils.star() which works without issue in TSQL. But, because it isn't currently implemented in tsql-utils, you have to call dbt_utils.star() to use it... which is overly complicated?

The simplest solution IMHO involves dbt-labs/dbt-core#2923 where if a user calls tsql_utils.star, it would first look for a star macro in the tsql-utils package and if it can't find one, then it would go look in dbt-utils and find dbt-utils.star. However, that's a medium- or long-term fix.

In the short-term, @jtcohen6 do you think it makes the most sense to have 100% dbt-utils macro cover in tsql-utils, but macros that do not need modification are simply dispatched back to dbt-utils, like below? Otherwise, I think it might be challenging for users to have to remember to call dbt-utils for some macros and tsql-utils for others....

{% sqlserver__star(from, relation_alias=False, except=[]) %}
    {% do return( dbt_utils.star(from, relation_alias, except) ) %}
{% endmacro %}

{% synapse_star(from, relation_alias=False, except=[]) %}
    {% do return( sqlserver.star(from, relation_alias, except) ) %}
{% endmacro %}

Add testing models for the macros

I can open a PR for this with samples from dbt-utils.

Options:

  1. We provide project that installs dbt-utils and tsql-utils in correct way, then runs the tests from dbt-utils. This may be implemented by the integration-tests code I can see floating around, but this seems to mostly be config?
  2. We copy appropriate tests into this repo.

Port test_mutually_exclusive_ranges() to TSQL

my priority for this macro is not the macro itself, but really because it is used in dbt-utils's integration testing for validation of OTHER macros.

calc CTE

these two coalesces are the tricky parts that need to be converted into NULLIFs or CASE WHENs? Also false needs to be replaced with 0 bc TSQL doesn't have a true & false. Here's a gist of the compiled SQL that's error-ing out. If you log into the test db (dbt-msft-serverless-db), you can debug it.

my interpretation for the logic for the first coalesce statement below, though i'm not sure in what scenario lower_bound < upper_bound could be NULL.

  • if lower_bound < upper_bound is not null:
    • return the result of the comparison: true or false (or in TSQL's case: 1 or 0)
  • else if lower_bound < upper_bound is NULL:
    • return false (i.e. 0)

current error

the error i'm getting is Incorrect syntax near '<'. (102)

Database Error in test dbt_utils_mutually_exclusive_ranges_data_test_mutually_exclusive_ranges_no_gaps_not_allowed__lower_bound__upper_bound (models/schema_tests/schema.yml)
  ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near '<'. (102) (SQLExecDirectW)")
  compiled SQL at target/compiled/dbt_utils_integration_tests/models/schema_tests/schema.yml/schema_test/dbt_utils_mutually_exclusive_ranges_data_test_mutually_exclusive_ranges_no_gaps_f27ec0771ee181e29da8b3adeddba768.sql
-- For each record: lower_bound should be < upper_bound.
-- Coalesce it to return an error on the null case (implicit assumption
-- these columns are not_null)
coalesce(
    lower_bound < upper_bound,
    false
) as lower_bound_less_than_upper_bound,

-- For each record: upper_bound {{ allow_gaps_operator }} the next lower_bound.
-- Coalesce it to handle null cases for the last record.
coalesce(
    upper_bound {{ allow_gaps_operator }} next_lower_bound,
    is_last_record,
    false
) as upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound

window_functions CTE

this piece I already was able to convert

row_number() over (
    {{ partition_clause }}
    order by {{ lower_bound_column }} desc
) = 1 as is_last_record

becomes

case when
    row_number() over (
        {{ partition_clause }}
        order by {{ lower_bound_column }} desc
    ) = 1
    then 1 else 0
end as is_last_record

sqlserver__type_string is too short

Hi Guys,

First of all, thank you for creating this package. Your efforts are much appreciated.

I noticed the type_string adapter looks like this:

{%- macro sqlserver__type_string() -%}

VARCHAR(900)

{%- endmacro -%}

Is there anything that speaks against using VARCHAR(8000). By limiting it to 900, there could be instances where strings are truncated.

That's particularly bad since type_string is used in surrogate_key. For example, I use hashes of URLs as keys and this would lead to duplicate surrogate keys, because some URLs are longer than 900 characters.

Compilation error when following "Dev environment set-up" instructions in CONTRIBUTING.md

I am having trouble running integration tests for this package in my development environment. I can follow the "Dev environment set-up" instructions up to step 5, but then I encounter a Compilation Error when I try to run dbt seed or related dbt commands like run or test. The error specifies "No macro named 'except' found":

❯ dbt deps --target sqlserver
22:21:02  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
22:21:02  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
22:21:02  Running with dbt=1.1.1
22:21:02  Installing ../../
22:21:02    Installed from <local @ ../../>
22:21:02  Installing ../../dbt-utils
22:21:02    Installed from <local @ ../../dbt-utils>
22:21:02  Installing ../../dbt-utils/integration_tests
22:21:02    Installed from <local @ ../../dbt-utils/integration_tests>
22:21:02  Installing ../
22:21:02    Installed from <local @ ../>

❯ dbt seed --target sqlserver --full-refresh
22:21:21  [WARNING]: Deprecated functionality
The `source-paths` config has been renamed to `model-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
22:21:21  [WARNING]: Deprecated functionality
The `data-paths` config has been renamed to `seed-paths`. Please update your
`dbt_project.yml` configuration to reflect this change.
22:21:21  Running with dbt=1.1.1
22:21:21  Partial parse save file not found. Starting full parse.
22:21:22  Encountered an error:
Compilation Error
  In dispatch: No macro named 'except' found
      Searched for: 'tsql_utils_dbt_utils_integration_tests.sqlserver__except', 'tsql_utils_dbt_utils_integration_tests.default__except', 'dbt.sqlserver__except', 'dbt.default__except'

Has anyone experienced an issue like this before or have any thoughts on how I can resolve this?

Also potentially related to this issue, I noticed that my dbt-sqlserver version is different from the 0.21.0rc1 version specified in.circleci/config.yml. I'm thinking it would be best for my dev environment to match the ci environment, but it looks the current version of dbt-utils requires dbt-core >= 1.0.0.

`dbt_date.get_date_dimension` not working with SQL Server 2019

dbt_date.get_date_dimension macro not working with SQL Server. Tested with latest package versions (as of posting) and versions matching the dbt_date pinned commit from the tsql-utils repo.

dbt model

{{ dbt_date.get_date_dimension(
    "2020-01-01",
    "2024-01-01",
)}}

Error message

('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near the keyword 'with'. (156) (SQLMoreResults); [42000] [Microsoft][ODBC Driver 17 for SQL Server]
[SQL Server]Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be termina
ted with a semicolon. (319); [42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near ')'. (102)")

SQL Server version

Microsoft SQL Server 2019 (RTM-GDR) (KB5029377) - 15.0.2104.1 (X64) 
	Aug 16 2023 00:09:21 
	Copyright (C) 2019 Microsoft Corporation
	Developer Edition (64-bit) on Windows Server 2019 Datacenter 10.0 <X64> (Build 17763: ) (Hypervisor)

dbt_project.yml dispatch configuration

dispatch:
  - macro_namespace: dbt_utils
    search_order: ['tsql_utils', 'dbt_utils']
  - macro_namespace: dbt_date
    search_order: ['tsql_utils', 'dbt_date']

Intended packages.yml

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: dbt-msft/tsql_utils
    version: 0.10.0
  - package: calogica/dbt_date
    version: 0.10.0

packages.yml testing dbt_date pinned commit

packages:
  - package: dbt-labs/dbt_utils
    version: [">=0.9.0", "<1.0.0"]
  - package: dbt-msft/tsql_utils
    version: 0.10.0
  - package: calogica/dbt_date
    version: 0.6.0

`sqlserver__dateadd()` does not work on Synapse

for some reason sqlserver__dateadd works on Azure SQL, but on Synapse, it comes up short by a day.

actual expected
1900-01-02 00 00 00.0000000 1900-01-01 00 00 00.000

{% macro sqlserver__dateadd(datepart, interval, from_date_or_timestamp) %}
dateadd(
{{ datepart }},
{{ interval }},
cast({{ from_date_or_timestamp }} as datetime)
)
{% endmacro %}
{% macro synapse__dateadd(datepart, interval, from_date_or_timestamp) %}
{% do return( tsql_utils.sqlserver__dateadd(datepart, interval, from_date_or_timestamp)) %}
{% endmacro %}

with data as (

    select * from "dbtsynapseci"."test"."data_dateadd"

)

select
    case
        when datepart = 'hour' then cast(

    dateadd(
        hour,
        interval_length,
        cast(from_time as datetime)
        )

 as 
    
    
    datetime2
)
        when datepart = 'day' then cast(

    dateadd(
        day,
        interval_length,
        cast(from_time as datetime)
        )

 as 
    
    
    datetime2
)
        when datepart = 'month' then cast(

    dateadd(
        month,
        interval_length,
        cast(from_time as datetime)
        )

 as 
    
    
    datetime2
)
        when datepart = 'year' then cast(

    dateadd(
        year,
        interval_length,
        cast(from_time as datetime)
        )

 as 
    
    
    datetime2
)
        else null
    end as actual,
    result as expected

from data

Multi-field surrogate key failing due to using default dbt_utils concat macro

Hi,

I noticed while using surrogate_key that the compiled sql was syntactically incorrect due to the operator used to concat the list of fields I am passing to the macro. After some triage, I realized that it was using the default dbt_utils.concat() macro which uses '||' to concat.

In order to resolve this, I added the following to my dbt_project.yml:

dispatch:
  - macro_namespace: dbt_utils
    search_order: ['tsql_utils', 'dbt_utils']

vars:
  dbt_utils_dispatch_list: ['tsql_utils']
  dbt_date_dispatch_list: ['tsql_utils']
  audit_helper_dispatch_list: ['tsql_utils']
  dbt_expectations_dispatch_list: ['tsql_utils']

This solved the issue but I am not confident that this is the correct solution?

testing macros defined using CTEs

as I work on #59, I've had to disable ~17 tests in integration_tests/dbt_utils/dbt_project.yml for the same reason. The reason why is because of the pivot to the test block where code to be tested gets injected as a subquery into the result of get_test_sql(). For the ~17 or so macros being tested, they all use CTEs, which aren't allowed as subqueries in TSQL. previously this was only an issue for bespoke tests, where the solution was to tell users that CTEs in bespoke tests were not supported.

However, this issue is a bigger blocker, because now:

  • we don't have the ability to test dbt-utils macros in tsql, without first rewriting them to not use CTEs, and
  • even existing sqlserver__ ports of dbt-utils macros that were working would now need to be refactored to not use CTEs in order to be tested

the above two points are possible, but not ideal. Here's a commit of me doing it once, but I'm not looking forward to how this will look for a macro like get_date_spine.

your input is much appreciated as always @jtcohen6! It's a bummer that I didn't consider this scenario when testing dbt-sqlserver with dbt-core==0.20.0 initially.

Warning: the `concat` macro is now provided in dbt Core with dbt_core 1.3.0+

I'm using the tsql_utils.surrogate_key macro and when running or compiling dbt I get this warning:
Warning: the concat macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use concat (no prefix) instead.

`date_spine` isn't working with `dbt_utils 1.1.0`

Hello!

I'm trying to run a date_spine to populate my Calendar Dimension, but it crash with Compilation Error:

23:29:43  Running with dbt=1.4.6
23:29:44  Encountered an error:
Compilation Error in model DimCalendar (models/03_dimensions/DimCalendar.sql)
  'dict object' has no attribute 'datediff'
  
  > in macro sqlserver__date_spine (macros/dbt_utils/datetime/date_spine.sql)
  > called by macro date_spine (macros/sql/date_spine.sql)
  > called by model DimCalendar (models/03_dimensions/DimCalendar.sql)

Version info:

dbt --version
Core:
  - installed: 1.4.6
  - latest:    1.5.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - sqlserver: 1.4.3 - Up to date!
packages:
  - package: dbt-labs/dbt_utils
    version: ["1.1.0"]
  - package: calogica/dbt_date
    version: ["0.7.2"]
  - package: dbt-msft/tsql_utils
    version: ["0.9.0"]

With a regression to dbt_utils 0.9.0 date_spine worked, but had some warns:

> Executing task: dbt compile --model DimCalendar
23:35:07  Running with dbt=1.4.6
23:35:07  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- tests
23:35:07  Found 65 models, 0 tests, 0 snapshots, 1 analysis, 695 macros, 0 operations, 10 seed files, 199 sources, 0 exposures, 0 metrics
23:35:07  
23:35:08  Concurrency: 4 threads (target='dev')
23:35:08  
23:35:08  Warning: the `datediff` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `datediff` (no prefix) instead. The dbt_sazi.DimCalendar model triggered this warning.
23:35:08  Warning: the `dateadd` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `dateadd` (no prefix) instead. The dbt_sazi.DimCalendar model triggered this warning.
23:35:08  Done.

How to follow the original dbt package and why a submodule

Hi,

I am in the process of creating TSQL shims for the great packages dbt-date and dbt-expectations (dbt-date-tsql and dbt-expectations-tsql) and I have followed a similar approach to tsql-utils, creating a submodule linking to the original package.

I am actually wondering, why are dbt shims following this approach and what are the advantages? Is this to track a specific version of the original package? But in that case, shouldn't .gitmodules be defining a specific branch or commit from this package?

And second question, once the packages are ready, would it make sense to move them under the dbt-msft umbrella?

Cheers,

dbt_utils expression_is_true does not work without `--store-failures`

I have dbt_utils and tsql_utils installed.

On one of my model's columns, I have the following test:

- name: MyColumn
  tests:
    - not_null
    - accepted_values:
        values: ["Opt1", "Opt2", "Opt3", "Opt4"]
    - dbt_utils.expression_is_true:
        expression: "!= 'Opt1'"
        config:
          where: "OtherColumn is null or OtherColumn = 0"

When I run dbt test, I get the following error:
('42000', '[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Create View or Function failed because no column name was specified for column 1. (4511) (SQLMoreResults)')

The compiled SQL is

select
    1
from (select * from MyTable where OtherColumn is null or OtherColumn = 0) dbt_subquery

where not(MyColumn != 'Opt1')

which, when run, is created as a view. SQL Server does not support unnamed columns in views, so this fails.

For reference, the following dispatch configuration is in my dbt_project.yml

dispatch:
  - macro_namespace: dbt_utils
    search_order: ["tsql_utils", "dbt_utils"]

Is this a bug, or am I doing something wrong? It seems there is no expression_is_true macro in tsql-utils for dbt_utils, but I may be understanding the dispatch flow incorrectly. Thanks!

get_date_dimension not working

Seems like compiled query generates nested CET expression, which causes the failure.

{{ dbt_date.get_date_dimension('2000-01-01', '2050-12-31') }}
Incorrect syntax near 'with'.

with base_dates as (    
with date_spine as (...

package.yml

packages:
  - package: calogica/dbt_date
    version: 0.4.1
  - package: dbt-labs/dbt_utils
    version: 0.7.4
  - package: dbt-msft/tsql_utils
    version: 0.8.1

dispatch defined in dbt_project.yml

dispatch:
  - macro_namespace: dbt_utils
    search_order: ['tsql_utils', 'dbt_utils']
  - macro_namespace: dbt_date
    search_order: ['tsql_utils', 'dbt_date']

Convert surrogate keys from hash to int for Power Bi

Hey Guys,

yes... ANOTHER surrogate key issue from me... But I think you guys are the right people to nerd around with for this type of thing :)

Problem Description

I am visualizing the models I create with DBT in Power BI and I imagine many of you are doing the same.

One of the best practices in Power BI is to keep columns small and reduce their uniqueness. Unfortunately hash surrogate keys don't fit that best practice at all.

You can't use the binary column type for relationships and instead have to use the string type. So all your keys are 32 character long and highly unique.

I analysed my model with Dax Studio and found that a very large portion of the model size is due to surrogate keys.

So one negative impact that is definitely happening is a bigger model size and RAM consumption of the Power BI datasets.

Additionally, there are some claims that it also affects the performance if you create relationships on large text columns.

Here is an article explaining the performance impact them and here is a reddit discussion.

To be honest, I am skeptical about that. Since relationships are pre-defined, it would be silly if Power BI didn't optimize away the strings... (but I don't know for sure).

Is this even a problem that needs solving?

So my first question is: How do you handle the problem? Or do you think it's not important to handle at all?

Does the simplicity of DBT's hash keys and the productivty increase of DBT outweight the downsides when using Power BI for you?

Do you have any idea on how to benchmark whether hash keys have an impact on query performance for imported Power BI datasets?

Possible solution

I have also come up with a prototype to solve this issue. I have not implemented it in production yet (and I am still undecided whether I actually should), but it seems to work well enough to get your opinion here.

Dimension table example

The model below has a test_int_key column that is set to null when the model is built.

It also has a post_hook that updates that column with a int value corresponding to the hash key.

{{ 
    config(
        materialized='table', 
        as_columnstore = false,
        post_hook=["
            {{ gen_int_surrogate_keys(this, 'test_key', 'test_int_key')}}
        "]
    )
}}

with src_test as (
    
    select * from {{ref('src_test')}}
    
),

dim_test as (
    select
        {{ dbt_utils.surrogate_key(["test_id"]) }} as test_key,
        cast(null as int) as test_int_key, -- Will be replaced by final int key in macro
        status,
        foo,
        bar
    from base_test
)
select * from dim_test

Macro that maintains a table with int keys for each hash key

This macro creates a table for the primary key of the dimension with an identity property for the int_key and another column for the hash_key.

When the post_hook runs this macro, it merges all new hash keys into the table, which generates new rows with a new unique int_key for them (due to the identity column property).

Then the macro updates the int_key column of the dimension table with int keys that correspond to the hash keys.

{%- macro gen_int_surrogate_keys(this,
                                 hash_key_col_name,
                                 int_key_col_name) -%}
    {% 
        set int_table = this.schema + '.int_keys_' + this.table
    %}

    -- Create table to store int keys for hash keys if it doesn't exist
    if object_id ('{{ int_table }}', 'U') is null
    begin
        create table {{ int_table }} (
            int_key int identity(1,1) not null,
            hash_key varbinary(8000)
        )
        {% set idx_name = 'int_keys_' + this.table + '__index_on_hash_key' %}
        create nonclustered index {{ idx_name }}
            on {{ int_table }} (hash_key)
    end;

    -- Merge new hash keys that are not in int_table yet
    with hash_key_data as (
        select
            {{ hash_key_col_name }} as hash_key
        from {{ this }}
    )
    merge {{ int_table }} target_tbl
    using hash_key_data src_tbl
        on target_tbl.hash_key = src_tbl.hash_key
    when not matched by target
    then insert (hash_key) values (src_tbl.hash_key);

    -- Update orig table's int_key column with int keys
    update 
        {{ this }}
    set 
        {{ this }}.{{ int_key_col_name }} = int_key.int_key 
    from {{ int_table }} int_key
    where 
        {{ this }}.{{ hash_key_col_name }} = int_key.hash_key
{%- endmacro -%}

Fact Table View

I generate table models for my facts and dimensions. And then I create views with renamed columns that I actually import in Power BI.

The underlying dimension tables contain the int_key columns. The fact tables are not regenerated to include the int key columns.

Instead I join to the dimensions in the report views on the hash key, and then retrieve the int key from the dimension:

with fact_test as (
    
    select * from {{ref('fact_test')}}
    
),

dim_test as (
    
    select * from {{ref('dim_test')}}
    
),

final as (
    select 
        -- Use the int key as the key instead of the hash key
        dim_test.test_int_key as TestKey,
        spend as Spend,
        impressions as Impressions,
        clicks as Clicks,
        conversions as Conversions,
    from fact_test

    -- Join to the dimension table using the hash key
    -- This way we don't have to recreate the actual 
    -- fact tables with the new int keys.
    -- We do dynamically in the report views
    left join dim_test
    on fact_test.test_key = dim_test.test_key
)

select * from final

What do you think?

This increases the model generation and probably makes the Power BI refresh a bit longer due to the join in the report view.

But the upside is that the model size and ram consumption would be lower. It also has potentially to improve query performance.

Do you think this is an issue worth bothering with? Or is the upside too small?

And what do you think of my idea to solve it? Do you perhaps have improvement suggestions or other ideas of how this could be solved better?

And do you think there is a big enough need for this that we should integrate the solution into tsql_tools?

The insert_by_period materialization for TSQL

The insert_by_period materialization is very useful when populating large amounts of historical data. It enables the efficient use of indexes and allows work to be chunked into smaller pieces.

Currently, this materialization is disabled for TSQL.

Having started transitioning our work to dbt recently, I have a working implementation of the insert_by_period materialization that's currently running in production. There are a few additional features I've implemented to allow some developer flexibility. I think it is now stable enough to contribute externally.

The implementation is based on Alex Higg's implementation for snowflake.

cc: @alittlesliceoftom

Cannot get dbt-utils to work

I have a bare bones install of dbt-sqlserver installed. Have the following packages

  • package: dbt-labs/dbt_utils
    version: 0.9.1
  • package: dbt-msft/tsql_utils
    version: 0.8.1
    and the dbt_project
    dispatch:
  • macro_namespace: dbt_utils
    search_order: ['tsql_utils', 'dbt_utils']
    . When I run dbt test --no-version-check I get the error
    19:08:58 Running with dbt=1.1.2
    19:08:58 Unable to do partial parsing because a project dependency has been added
    19:08:58 Unable to do partial parsing because a project config has changed
    19:08:59 Encountered an error:
    Compilation Error
    In dispatch: No macro named 'any_value' found
    Searched for: 'Arca.sqlserver__any_value', 'Arca.default__any_value', 'dbt.sqlserver__any_value', 'dbt.default__any_value'

Not sure what to do to fix this

CI Failing due to an Incompatible Seed on Synapse

The CI is failing during the dbt seed stage due to an ill formatted file (for synapse), the file is formatted fine for the Azure SQL runs.

image

This means that the tests aren't running for synapse

The data is not stored in this repo, but in dbt_utils (https://github.com/fishtown-analytics/dbt-utils/blob/master/integration_tests/data/etc/data_people.csv)

A local fix could be to let dbt seed fail gracefully without raising an error to the CI and preventing the test from running. Or to identify why this file does not work and fixing that. Alternately we could --exclude the file's load, but we would also have to then exclude downstream tests which quickly gets messy.

Error while using the generate_surrogate_key macro with one field. The concat function requires 2 to 254 arguments

Current Behavior

We need to pass minimum 2 fields while calling the dbt_utils.generate_surrogate_key macro from the model.

with listing_hosts as (
    select
        xxxxxxx
    from
       xxxx
),

dim_host as (
    select
        yyyyyy
    from
       yyyy
)
select
    {{ dbt_utils.generate_surrogate_key([
        **'host_id','host_id'**
    ]) }} as host_sid,
    {{ cols_host }}
from dim_host

Expected behavior

Generate surrogate key even if I pass only one field as input to the macro.

with listing_hosts as (
    select
        xxxxxxx
    from
       xxxx
),

dim_host as (
    select
        yyyyyy
    from
       yyyy
)
select
    {{ dbt_utils.generate_surrogate_key([
        **'host_id'**
    ]) }} as host_sid,
    {{ cols_host }}
from dim_host

Error message while passing only one field:

14:52:03  Unhandled error while executing target\run\xxxx\xx\xxxx\xxx\xxxxx.sql
('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Incorrect syntax near the keyword 'from'. (156) (SQLMoreResults); [42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Incorrect syntax near the keyword 'order'. (156); [42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The concat function requires 2 to 254 arguments. (189)")

add support for date_spine

In this dbt slack thread, someone shared a dbt-utils-esque version of the date_spine macro.

We should bring this into this package. It might even also work for Synapse, I don't see any recursive CTEs...

steps to implement macro

  1. create a datetime folder inside of tsql-utils/macros/dbt_utils/
  2. put the macro in a file called date_spine.sql to the newly created datetime folder
  3. add a macro to the end of the file to make it also work for synapse
    {% macro synapse__date_spine(datepart, start_date, end_date) -%}
        {% do return( tsql_utils.sqlserver__date_spine(datepart, start_date, end_date)) %}
    {%- endmacro %}
  4. delete these lines
    datetime:
    test_date_spine: &disabled # BROKEN DUE TO MODEL DEF
    +enabled: false

example macro

{% macro sqlserver__date_spine_sql(start_date, end_date, datepart) %}
with
l0 as (
    select c
    from (select 1 union all select 1) as d(c)
),
l1 as (
    select
        1 as c
    from l0 as a
    cross join l0 as b
),
l2 as (
    select 1 as c
    from l1 as a
    cross join l1 as b
),
l3 as (
    select 1 as c
    from l2 as a
    cross join l2 as b
),
l4 as (
    select 1 as c
    from l3 as a
    cross join l3 as b
),
l5 as (
    select 1 as c
    from l4 as a
    cross join l4 as b
),
nums as (
    select row_number() over (order by (select null)) as rownum
      from l5
),
rawdata as (
    select top ({{dbt_utils.datediff(start_date, end_date, datepart)}})  + rownum -1 as n
    from nums
    order by rownum
),
all_periods as (
    select (
        {{
            dbt_utils.dateadd(
                datepart,
                'n',
                start_date
            )
        }}
    ) as date_{{datepart}}
    from rawdata
),
filtered as (
    select *
    from all_periods
    where date_{{datepart}} <= {{ end_date }}
)
select * from filtered
{% endmacro %}
{% macro sqlserver__date_spine(start_date, end_date, datepart) -%}
    {% set date_spine_query %}
        {{sqlserver__date_spine_sql(start_date, end_date, datepart)}} order by 1
    {% endset %}
    {% set results = run_query(date_spine_query) %}
    {% if execute %}
    {% set results_list = results.columns[0].values() %}
    {% else %}
    {% set results_list = [] %}
    {% endif %}
    {%- for date_field in results_list %}
        select '{{ date_field }}' as date_{{datepart}} {{ 'union all ' if not loop.last else '' }}
    {% endfor -%}
{% endmacro %}

`dbt_utils.hash` failing for synapse

I don't think this is an error with the macro, but really something weird going on with equivalence. for example, why aren't the below equal in Synapse?

actual expected
D41D8CD98F00B204E9800998ECF8427E d41d8cd98f00b204e9800998ecf8427e

BUG: dbt-utils drop_old_relations doesn't take into account model aliases

Hi all, first contribution to a project so please let me know if I've done anything wrong! 😄

Only a small change but a key one for the functionality of this macro, I think.

Currently, when collecting the models names to check presence in the target DB, the macro only looks at node.name. This means aliased models are not checked correctly and models still in use are dropped.

Simple suggested solution is to swap current_models.append(node.name) with current_models.append(node.name if node.config.alias == None else node.config.alias)

Changes to dispatch in dbt v0.20

Hey T-SQL team!

Given that this package exists alongside dispatch, but doesn't dispatch macros itself, I think this will only require a change to the README. Once users and packages have upgraded to the new dispatch syntax in dbt v0.20, they'll want to specify a first-order project config in dbt_project.yml:

 dispatch:
   - macro_namespace: dbt_utils
     search_order: ['tsql_utils', 'dbt_utils']
   - macro_namespace: dbt_date
     search_order: ['tsql_utils', 'dbt_date']
   - macro_namespace: dbt_expectations
     search_order: ['tsql_utils', 'dbt_expectations']

Instead of:

 vars:
   dbt_utils_dispatch_list: ['tsql_utils']
   dbt_date_dispatch_list: ['tsql_utils']
   dbt_expectations_dispatch_list: ['tsql_utils']

Two crucial differences in the new syntax:

  • Users must always specify the name of the dispatching package itself within the search order list
  • Users should only specify namespaces for packages that they actually have installed. If someone is only using dbt_utils, they should only specify that one.

In the meantime, there is backwards compatibility for the old syntax, so users on old versions of dbt, or using old versions of packages, can still specify those vars.

communicte what macros are not supported by tsql-utils

currently the only place where you can see what models are not supported is by looking at integration_tests/dbt_utils/dbt_project.yml (see below) which is not intuitive, but while the YAML is precise the models and tests in the integration tests don't necessarily correspond 1-1 with individual macros.

Maybe something in the README.md? Or perhaps just one issue per macro? However, dbt_utils.group_by() I know we'll never support. worth mentioning?

currently unsupported

after #17 the current macros we still could port are:
test_split_part() (not working on Synapse?)
test_date_spine()
insert_by_period materialization
generate_series()
get_column_values()
get_relations_by_pattern()
get_relations_by_prefix_and_union()
get_url_host()
get_url_parameter()
get_url_path()

models:
dbt_utils_integration_tests:
+enabled: true
cross_db_utils:
test_dateadd: &disabled
+enabled: false
test_datediff: *disabled
test_hash: *disabled
test_last_day: *disabled
test_split_part:
+enabled: "{{ target.name != 'synapse' }}"
datetime:
test_date_spine: *disabled
materializations: *disabled
schema_tests:
data_test_mutually_exclusive_ranges_no_gaps: *disabled
data_test_mutually_exclusive_ranges_with_gaps: *disabled
data_test_not_constant: *disabled
data_test_relationships_where_table_2: *disabled
data_test_unique_where: *disabled
data_test_not_null_where: *disabled
# the following work but only when
# the dbt-utils submodule macros are overwritten
data_test_at_least_one: *disabled
data_people: *disabled
data_test_expression_is_true: *disabled
data_test_not_constant: *disabled
sql:
test_generate_series: *disabled
test_get_column_values: *disabled
test_get_relations_by_pattern: *disabled
test_get_relations_by_prefix_and_union: *disabled
test_groupby: *disabled
get_query_results_as_dict: *disabled
test_surrogate_key: *disabled
test_union: *disabled
web:
test_url_host: *disabled
test_url_path: *disabled
test_urls: *disabled

dbt compiler is generating codes with nested CTE that cause compilation failure

While using dbt_utils.deduplicate inside a CTE, the dbt compiler creates code that has nested CTE which then throws a compilation error while running.

Incorrect syntax near the keyword 'with'

('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
Incorrect syntax near the keyword 'with'. (156)
(SQLMoreResults); [42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.
(319); [42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Incorrect syntax near 'natural'. (102)")

How to reproduce, I am creating a model that references another model: stg_dim_listings

My Code:

{% set cols = dbtplyr.get_column_names(ref('stg_dim_listings')) %}
{% set cols_host = dbtplyr.starts_with('host_', cols) | join(', ') %}
with listing_hosts as (
    select
        {{ cols_host }}
    from
        {{ ref('stg_dim_listings') }}
),
dim_host as (
    {{ dbt_utils.deduplicate(
        relation='listing_hosts',
        partition_by='host_id',
        order_by='host_listings_count desc',
       )
    }}
)
select
    {{ dbt_utils.generate_surrogate_key([
        'host_id'
    ]) }} as host_sid,
    {{ cols_host }}
from dim_host

dbt generated code from complied folder:

with listing_hosts as (
    select
        host_id, host_url, host_name, host_location, host_about, host_response_time, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, host_since, host_response_rate, host_acceptance_rate
    from
        "azuresql_dbt_db"."staging"."stg_dim_listings"
),

dim_host as (
    with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by host_id
                order by host_listings_count desc
            ) as rn
        from listing_hosts as _inner
    )
select
 distinct data.*
 from listing_hosts as data
   natural join row_numbered
    where row_numbered.rn = 1
)
select convert(varchar(50), hashbytes('md5', concat(coalesce(cast(host_id as VARCHAR(MAX)), '_dbt_utils_surrogate_key_null_'), '')), 2)
 as host_sid, host_id, host_url, host_name, host_location, host_about, host_response_time, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, host_since, host_response_rate, host_acceptance_rate
from dim_host

From my understanding, nested WITH is not allowed in sql server. Kindly correct me if I am wrong in my analysis.

Consider using varbinary in hash() instead of varchar

Hi Guys,

First of all, thank you for creating this package. Your efforts are much appreciated.

I noticed the hash adapter converts the md5 bytes into a varchar:

{% macro sqlserver__hash(field) %}
    convert(varchar(50), hashbytes('md5', {{field}}), 2)
{% endmacro %}

If we leave the hash as varbinary, it uses only 16 bytes of data, but as varchar it uses 32. I suspect it's probably also slower to do joins on 32 character varchar than on a varbinary.

Would it be possible to take out that conversion?

I have been using varbinary in my own adapter for several months without problems. The only issue is that Power Bi doesn't support relationships on varbinary columns.

To solve that problem I created a separate macro that does the conversion when it's needed:

{% macro cast_hash_to_str(col) -%}
  convert(varchar(32), {{ col }}, 2)
{%- endmacro %}

And in my final model that is being consumed by PBI I just do something like...

final as (
    select
        {{ cast_hash_to_str('url_key') }} as UrlKey,
        ...
    from whatever
)

This way the hashes use less space in the DB, perform faster joins and when necessary, such as in Power Bi they are converted to strings.

Date_spine macro not able to pass full date range to SQL Server

I get an error when I want to create a date dimension based on the date spine macro using a range of 1900-01-01 till 2999-01-01 because of an error:
Error: ('42000', '[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries. (103003) (SQLExecDirectW)'

Note that apparently the date spine macro is able to generate dates using below script is from 1900-01-01 till 1933-06-21. Starting from the 1933-06-22 it fails.

Copying the script from the compiled folder and pasting it direct in SQL Management Studio seems to cause no problem. It appears that the issue is in passing the compiled script to SQL Server.

Steps to reproduce:

with

Base_Date as (
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('1900-01-01' as date)",
        end_date="cast('2999-12-31' as date)"
        )
    }}
)
select * from Base_Date

Tested on:
SQL Server 14.0.3445.2
Dbt core: 1.1.2
Dbt synapse: 1.1.0
Dbt sqlserver: 1.1.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.