Git Product home page Git Product logo

great-expectations / great_expectations Goto Github PK

View Code? Open in Web Editor NEW
9.5K 82.0 1.5K 194.13 MB

Always know what to expect from your data.

Home Page: https://docs.greatexpectations.io/

License: Apache License 2.0

Python 99.13% Jupyter Notebook 0.11% CSS 0.04% Lua 0.03% Dockerfile 0.01% Shell 0.04% Jinja 0.52% JavaScript 0.12%
pipeline-tests dataquality datacleaning datacleaner data-science data-profiling pipeline pipeline-testing cleandata dataunittest

great_expectations's Introduction

Python Versions PyPI PyPI Downloads Build Status pre-commit.ci Status codecov DOI Twitter Follow Slack Status Contributors Ruff

About GX OSS

GX OSS is a data quality platform designed by and for data engineers. It helps you surface issues quickly and clearly while also making it easier to collaborate with nontechnical stakeholders.

Its powerful technical tools start with Expectations: expressive and extensible unit tests for your data. As you create and run tests, your test definitions and results are automatically rendered in human-readable plain-language Data Docs.

Expectations and Data Docs create verifiability and clarity throughout your data quality process. That means you can spend less time translating your work for others, and more time achieving real mutual understanding across your entire organization.

Data science and data engineering teams use GX OSS to:

  • Validate data they ingest from other teams or vendors.
  • Test data for correctness post-transfomation.
  • Proactively prevent low-quality data from moving downstream and becoming visible in data products and applications.
  • Streamline knowledge capture from subject-matter experts and make implicit knowledge explicit.
  • Develop rich, shared documentation of their data.

Learn more about how data teams are using GX OSS in case studies from Great Expectations.

See Down with pipeline debt for an introduction to our pipeline data quality testing philosophy.

Our upcoming 1.0 release

We’re planning a ton of work to take GX OSS to the next level as we move to 1.0!

Our biggest goal is to improve the user and contributor experiences by streamlining the API, based on the feedback we’ve received from the community (thank you!) over the years.

Learn more about our plans for 1.0 and how we’ll be making this transition in our blog post.

Get started

GX recommends deploying GX OSS within a virtual environment. For more information about getting started with GX OSS, see Get started with Great Expectations.

  1. Run the following command in an empty base directory inside a Python virtual environment to install GX OSS:

    pip install great_expectations
  2. Run the following command to import the great_expectations module and create a Data Context:

    import great_expectations as gx
    
    context = gx.get_context()

Get support

Contribute

We deeply value the contributions and engagement of our community. We’re temporarily pausing the acceptance of new pull requests (PRs). We’re going to be updating the API and codebase frequently and significantly over the next few months—we don’t want contributors to spend time and effort only to find that we’ve just implemented a breaking change for their work.

Hold onto your fantastic ideas and PRs until after the 1.0 release, when we will be excited to resume accepting them. We appreciate your understanding and support as we make this final push toward this exciting milestone. Watch for updates in our Slack community, and thank you for being a crucial part of our journey!

Code of conduct

Everyone interacting in GX OSS project codebases, Discourse forums, Slack channels, and email communications is expected to adhere to the GX Community Code of Conduct.

great_expectations's People

Contributors

abegong avatar alexsherstinsky avatar anhollis avatar anthonyburdi avatar austiezr avatar ayirplm avatar aylr avatar billdirks avatar cdkini avatar cselig avatar dependabot[bot] avatar derekma73 avatar donaldheppner avatar eugmandel avatar jcampbell avatar joshua-stauffer avatar kenwade4 avatar kilo59 avatar kwcanuck avatar kyleaton avatar nathanfarmer avatar petermoyer avatar rachel-reverie avatar roblim avatar shinnnyshinshin avatar spbail avatar szecsip avatar talagluck avatar trangpham avatar tyler-hoffman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

great_expectations's Issues

Feature/unit test refactor

Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.

Add more thorough unit tests for...

Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between

What additional logic should we pack into Expectation decorators?

What are all the generic parameters that Expectations should accept?

All Expectations

  • output_format
  • include_kwargs
  • catch_exceptions
  • exclude_null_values?

For column_map_expectations

  • mostly

For column_aggregate_expectations

  • confidence_threshold

What other logic can we include?

  • Input validation

  • Output validation

    • Is JSON serializable
    • Has expected fields, etc.
  • Docstring propagation...

  • Create and append the Expectation to the dataset

  • Logic for de-duplication/updating Expectations

Should it be easy to simultaneously create many expectations?

Consider the case of something like the following:

for column in df.columns:
    df.expect_column_mean_to_be_between(column, min, max)

Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?

Proposal: Use a WeightedPartitions for distributional expectations

{
  "partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
  "weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}

Partitions specifies the lower bound for each partition.
Weights specifies the total mass within each partition. (lower_bound <= value < upper_bound)

  • The number of entries in partition and weight lists must be equal.
  • For convenience, partitions are always sorted in ascending order.
  • Weights must sum to exactly 1.0

Note: Are there a JSON-serializable versions of inf and -inf?

Putting weights and partitions together into a single object has several advantages:

  • Can be passed/returned through true_value and other parameters
  • Simpler to test
  • Less prone to accidental separation in exploratory workflows

Using PDF instead of CDF has some advantages, too

  • Unified representation for categorical and continuous data
  • More user-friendly graphs
  • Still information-complete for calculating CDFs

Create issues from this issue. :)

Notes from July 10th call:

Custom expectations:

  • Standardize on column_map_expectation and column_aggregate_expectation
  • Drop column_elementwise_expectation for now. (If we discover we need it, we can add it back it.)
  • Add better worked examples in the docs.
  • We also need to add a prototyping syntax for expectations that doesn't require subclassing and decorators. Something along the lines of:
dataset.expect_function_to_be_elementwise_true('column', function)
   => assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
   => assert(function(df.column == True))

Output formats

  • Make it clear in the docs: output_format is categorical, not strictly ordered. This makes the output API more flexible and extensible.
  • Bring the docs up to date (e.g. true_value for aggregate_column_expectations,
  • Change include_lineage to include_kwargs. Also make it clear that expectations have only kwargs, no args.
  • Think about including row_index_list as a return value. (This gets complicated in some-non-pandas systems.)
  • What about error messages and handling in expectations?

How should we implement distributional expectations within the new expectation decorators?

Distributional expectations are different from all the other @column_aggregate_expectations:

They need to accept a confidence_threshold argument, similar to mostly for column_map_expectations. Unlike mostly, confidence_threshold isn't optional.

In addition to a true_value, they should also return a confidence_value:

{
  success : boolean,
  true_value : partitioned_weights,
  confidence_value : float on [0,1]
}

The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.

How should we capture this in our expectations?

Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the

I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.

@jcampbell, @dgmiller Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.