Git Product home page Git Product logo

rb_quality_plugin's People

Contributors

aivu avatar alexzlue avatar rbriski avatar wjduenow avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sherrli

rb_quality_plugin's Issues

Send Notification Feature

This feature will add functionality to notify teams of when a data quality check fails.

  • Add Unit tests for this feature to yaml loading tests
  • Add function to Operator
  • Email will only be sent if emails are provided AND test fails

yaml dq operator

create a DQ operator that will process a dq configuration yaml. This operator will be called and instantiated from the utility function described in issue #14

DQ task will take a similar format as such:

dq_task = DQYAMLOperator(
            task_id='dq_task',
            dq_config_path='/path/to/yaml/config',
            dq_check_args={'check_arg':'arg_val'},
            dag=dag
)

Edit (5/11): Add yaml implementation to current DQ operators, not create new operators

Filtered Message Writer

We may have a policy that all DQ checks be logged to a table. At the same time, there may be checks that we also want to be sent to Slack, Teams, or an Email. We can do this with the MultiWriter. However, it may be the case that we always log to a table, but send to this secondary location under more constrained circumstances - not logs, but alerts.

I see a need for three new wrapper message writers. They all would take as an argument a child, another MessageWriter:

  • Filter would take as a second argument a function func, and only send_message on it's child if func(message) returned True
  • CheckPassedFilter would only send_message on it's child if the data quality check passed
  • CheckFailedFilter would only send_message on it's child if the data quality check failed

We would need to extend the default expected values in message to include a passed attribute, which would allow us to define CheckPassedFilter and CheckFailedFilter by using Filter in this way:

class CheckPassedFilter(Filter):
  def __init__(self, child, *args, **kwargs):
    super().__init__(super, child, lambda message: message['passed'], *args, **kwargs)

class CheckFailedFilter(Filter):
  def __init__(self, child, *args, **kwargs):
    super().__init__(super, child, lambda message: not message['passed'], *args, **kwargs)

DataQualityThresholdCheckOperator

This operator will take in a data quality sql check and verify the result with a minimum and maximum threshold and returns the result into XCom.

  • Unit test files
  • Operator constructor
  • Operator execute() method

include utility functions to add dq tasks dynamically

Utility functions need to be added to create dq tests dynamically.

this function will pass in a list of yaml paths and create a task for each dq test.
These yaml configurations may also include parameters to configure parts of the dq check (suggested by Drew Ellingson)

alternatively, the function may also pass through the directory path, and walk through to find all yaml tests in that directory.

Quarantine Operator

Initially by @mtsadler here

We need a paradigm for quarantining only "bad" rows (the rows that cause quality check to fail).

In other words:

  • SQL moves the "bad" rows in a quarantine table
  • SQL moves the "not bad" rows in target table

DQ Operators for Bigquery using BQL

I've noticed two issues with the current dq operators:

  1. Running a DataQualityThresholdCheckOperator task through cloud composer, I get a deprecation warning that parameter 'bql' is being used rather than 'sql'. We should be passing any queries into bigquery using the 'sql' parameter. Relevant task instance log

  2. Also, yesterday's PR has 'use_legacy_sql=True' default. It should be false. We want to use standardsql, which allows for dml, subselects, etc. 'use_legacy_sql=True' is the default behavior of the bigquery hook / API that we have to override.

Dynamically generate multiple dq checks from one config file

I think there will be instances where we want to use one dq yaml file to generate multiple dq check tasks.

eg. YAML:

task_id: dq_check_{source}_for_recs
min_threshold: 1
sql: |-
    select count(1) from dataset.{source}

triggered from a dag by something like:

SOURCE_TABLES = ['source_1', 'source_2']

ingress_list = [
    (yaml_file_path, {'SOURCE': source}) for source in SOURCE_TABLES
]

ingress_checks = dq_check_tools.create_dq_checks_from_list(dag, ingress_list)

currently on running this, I get an error The key (task_dq_check_{SOURCE}_1) has to be made of alphanumeric characters, dashes, dots and underscores exclusively

Looking through the helper tool and operator, I don't think theres currently any parameter replacement happening for the task_id field.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.