Git Product home page Git Product logo

dbt-ml-preprocessing's Introduction

dbt-ml-preprocessing

A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.

The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.

Currently they have been tested in Snowflake, Redshift , BigQuery, SQL Server and PostgreSQL 13.2. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.

โš ๏ธ There are now several better alternatives to this package. If you're using Snowflake, they now offer the snowflake-ml-python package which is fully supported and much more comprehensive. Within dbt, the Python models feature allows Snowflake, BigQuery and Databricks users to use scikit-learn directly

The macros are:

scikit-learn function macro name Snowflake BigQuery Redshift MSSQL PostgreSQL Example
KBinsDiscretizer k_bins_discretizer Y Y Y Y Y example
LabelEncoder label_encoder Y Y Y Y Y example
MaxAbsScaler max_abs_scaler Y Y Y Y Y example
MinMaxScaler min_max_scaler Y Y Y Y Y example
Normalizer normalizer Y Y Y Y Y example
OneHotEncoder one_hot_encoder Y Y Y Y Y example
QuantileTransformer quantile_transformer Y Y N N Y example
RobustScaler robust_scaler Y Y Y Y Y example
StandardScaler standard_scaler Y Y Y N Y example

* 2D charts taken from scikit-learn.org, GIFs are my own

Installation

To use this in your dbt project, create or modify packages.yml to include:

packages:
  - package: "omnata-labs/dbt_ml_preprocessing"
    version: [">=1.0.2"]

(replace the revision number with the latest)

Then run: dbt deps to import the package.

dbt 1.0.0 compatibility

dbt-ml-preprocessing version 1.2.0 is the first version to support (and require) dbt 1.0.0.

If you are not ready to upgrade to dbt 1.0.0, please use dbt-ml-preprocessing version 1.0.2.

Usage

To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing:

docs screenshot

dbt-ml-preprocessing's People

Contributors

comaradotcom avatar dataders avatar jamesweakley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-ml-preprocessing's Issues

OHE error handling... casting string ass boolean?

I'm working to make the one_hot_encoder macro compatible with Azure SQL and Azure Synapse. The compiler is having takes issue with line 93 because there is no boolean datatype in TSQL. My question is what does this resolve to in other DW's, NULL?

{%- if handle_unknown=='error' %}
case
when {{ source_column }} = '{{ category }}' then true
when {{ source_column }} in ('{{ category_values | join("','") }}') then false
else cast('Error: unknown value found and handle_unknown parameter was "error"' as boolean)
end as is_{{ source_column }}_{{ no_whitespace_column_name }}
{% endif %}

cc: @comaraDOTcom

Support ml-driven preprocessing (e.g. PCA) with SQL

I'd be interested to see an extension of this project to incorporate more ML-driven preprocessing into the pipelines. Thinking things such as PCA, K-means, Bayesian classifiers. Is this a direction you've already considered? Where do you see this fitting into the spectrum from "preprocessing" to "inference"?

What are the prospects for this repo?

Hi,
I just tried to use this package and have noticed a couple of bugs, unresolved issues opened for several years, unaddressed opened pull requests also, and the last commit as of today was 18 months ago.
Do you plan on maintaining this repo or should the community fork it to keep it going?
Thank you!

Narrow scope of one_hot_encoder OHE

In reviewing #4, I echo @comaraDOTcom's sentiment when they say they'd be "more inclined to use the output of the macro as a CTE and select a subset of cols in a subsequent CTE." To that end, I'd like to propose two macros (names debatable):

  • one_hot_encoder_wrapper: the existing functionality with include_columns and exclude_columns, and
  • one_hot_encoder which takes only the source_table, source_column, category_values, and handle_unknown params.

To me, the benefits would be:

  • more direct alignment with sklearn.preprocessing.OneHotEncoder and pandas.get_dummies()
  • enable smaller code footprint for adapters that require dispatching
  • better complement existing package functionality such as dbt_utils.star(),
  • this way dbt models that lever one_hot_encoder will look like other dbt models, instead of a single macro call with no SELECT statement.

Example usage

Suppose a table, fruits that:

  • has 3 columns: id, species, and `color; and,
  • the color columns has two values: orange and yellow

goal compiled SQL

SELECT
id,
species,
is_color_orange,
is_color_yellow,
FROM database.fruits

possible uses

SELECT
id,
species,
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}

alternatively if one would like to include or exclude certain columns from the source table, they could do so like this

SELECT
{{ dbt_utils.star(from=ref('fruits'), except=['color']) }},
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}

include_columns argument breaks one_hot_encoder macro

I have been working with the one_hot_encoder macro on a feature store and I came across a bug when using it.

Bug

The bug that I encountered was that when the include_columns argument was not '*', the code for the one_hot_encoder would insert blanks as the columns in the select statement and the code would not compile because of the comma that would be alone there.

For example, if the code was this:

gender as ( {{ dbt_ml_preprocessing.one_hot_encoder( source_table = ref('my_table'), source_column = 'GENDER', include_columns = ['ID']) }} )

The compiled SQL file would provide an "unexpected ',' error" because the line would look like:

select , case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)

When it should be:

select GENDER, case WHEN GENDER = 'MALE' then true... (and the rest of the compiled sql)

I looked at the code for the one_hot_encoder and I found the problem to be in lines 102-104 of the code. (In the deafult__one_hot_encoder macro).

For some reason, when include_columns is set to a list of columns we want, the "column" in "for column in col_list" is already the name of what is in the list. For example, if include_columns = ['ID'], column would be 'ID', so when you do column.name on that, it would be like 'ID'.name and it returns a blank for some reason.

It works fine when include_columns = "*" because that makes "column" something like SnowflakeColumn(column='name', dtype='VARCHAR', char_size = 16777216, numeric_precision=None, numeric_scale=None), so when you do column.name on it, it will give you the name.

Quick Solution

I have come up with a quick solution to this bug, and it is probably not the most elegant and robust. In the original code, under line 55, I made a new variable called 'flag' and set it equal to 'inc' like this: {% set flag = 'not_inc' %}. This will be a variable that tells the macro if the include_columns argument is "*" or not. Then under line 60 in the original code, I set flag = 'inc' because it would mean the include_columns variable was set by the programmer. So like this: {% set flag = 'inc' %}.

Then on line 70 of the original code, I added that variable into the adapter.dispatch thing, so the new line was:

{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list, flag) }}

as opposed to

{{ adapter.dispatch('one_hot_encoder',packages=['dbt_ml_preprocessing'])(source_table, source_column, category_values, handle_unknown, col_list) }}

Finally, for the default__one_hot_encoder macro on line 73 of the original code, I changed the logic up a bit to catch the instances where the include_columns argument was not "*". I replaced lines 76-78 of the original code with:

 {%- if flag == 'not_inc' -%}
         {% for column in col_list %}
             {{ column.name }},
         {%- endfor -%}

 {%- elif flag == 'inc' -%}
        {% for column in col_list %}
            {{ column }},
        {%- endfor -%}
    {%- endif -%}

This logic makes sure that if include_arguments is something like ['ID'], then it will not do column.name on 'ID', but rather just take the column name straight from column. Otherwise, when include_columns="*", proceed like before.

This solution may not be robust, so I was hoping to find out why this bug occurs and maybe find a better fix for it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.