great-expectations / great_expectations Goto Github PK

Always know what to expect from your data.

Home Page: https://docs.greatexpectations.io/

License: Apache License 2.0

Python 99.13% Jupyter Notebook 0.11% CSS 0.04% Lua 0.03% Dockerfile 0.01% Shell 0.04% Jinja 0.52% JavaScript 0.12%

pipeline-tests dataquality datacleaning datacleaner data-science data-profiling pipeline pipeline-testing cleandata dataunittest

great_expectations's Introduction

About GX OSS

GX OSS is a data quality platform designed by and for data engineers. It helps you surface issues quickly and clearly while also making it easier to collaborate with nontechnical stakeholders.

Its powerful technical tools start with Expectations: expressive and extensible unit tests for your data. As you create and run tests, your test definitions and results are automatically rendered in human-readable plain-language Data Docs.

Expectations and Data Docs create verifiability and clarity throughout your data quality process. That means you can spend less time translating your work for others, and more time achieving real mutual understanding across your entire organization.

Data science and data engineering teams use GX OSS to:

Validate data they ingest from other teams or vendors.
Test data for correctness post-transfomation.
Proactively prevent low-quality data from moving downstream and becoming visible in data products and applications.
Streamline knowledge capture from subject-matter experts and make implicit knowledge explicit.
Develop rich, shared documentation of their data.

Learn more about how data teams are using GX OSS in case studies from Great Expectations.

See Down with pipeline debt for an introduction to our pipeline data quality testing philosophy.

Our upcoming 1.0 release

We’re planning a ton of work to take GX OSS to the next level as we move to 1.0!

Our biggest goal is to improve the user and contributor experiences by streamlining the API, based on the feedback we’ve received from the community (thank you!) over the years.

Learn more about our plans for 1.0 and how we’ll be making this transition in our blog post.

Get started

GX recommends deploying GX OSS within a virtual environment. For more information about getting started with GX OSS, see Get started with Great Expectations.

Run the following command in an empty base directory inside a Python virtual environment to install GX OSS:
```
pip install great_expectations
```
Run the following command to import the great_expectations module and create a Data Context:
```
import great_expectations as gx

context = gx.get_context()
```

Get support

Chat with community members about general issues and seek solutions on the GX Slack community channel.
If you've found a bug, open an issue in one of the GX GitHub repositories.

Contribute

We deeply value the contributions and engagement of our community. We’re temporarily pausing the acceptance of new pull requests (PRs). We’re going to be updating the API and codebase frequently and significantly over the next few months—we don’t want contributors to spend time and effort only to find that we’ve just implemented a breaking change for their work.

Hold onto your fantastic ideas and PRs until after the 1.0 release, when we will be excited to resume accepting them. We appreciate your understanding and support as we make this final push toward this exciting milestone. Watch for updates in our Slack community, and thank you for being a crucial part of our journey!

Code of conduct

Everyone interacting in GX OSS project codebases, Discourse forums, Slack channels, and email communications is expected to adhere to the GX Community Code of Conduct.

great_expectations's People

Contributors

Stargazers

Watchers

Forkers

las-ncsu greatexpectationslabs mgasner gouker luminantdata polya20 theianrobertson louispotok schrockn crawlik bouke-nederstigt njsmith8 zack3241 fhachez colemanja91 damolaakinleye ilmari-abaenglish edjoesu sotte ncsu-las shasthojoy cggarvey smontanaro luke-zhu irontablee allen8838 datariot anhollis eugmandel elsander benzei avanderm rossem kenoskynci jambo1 rahulj51 jseeman prem2017 cselig 2legit lewtun znd4 royalts mdscruggs acompa mastratton3 khorgath himanshukandwal nature123 chris-dis-missed guoqq1994 binbenban radinkar danish-moengage walvekarvarun cuulee orenovadia appliedinfo heerensharma talagluck msempere hubayirp alexsherstinsky fpli-mbr scarrucciu grehce pariyat danieloliver williamjr bballamudi peasonkews ian-whitestone dalisaydavid nchrist2 goeddie ffinfo isaacaguirre changhsinlee anuragnaik adamhepner noncomposmentis rajesuwerps xxsacxx wegamekinglc cwerner henrywu2019 techwrekfix alyizzet confman bobhaffner bgscoones clee8912 alexras williamwsyhk kkwan-tc kkwanyang jcampbell abhishekms1047 joostboonzajerflaes-heineken arseniid

great_expectations's Issues

Docstrings in DataSet Expectations should propagate to PandasDataSet

What should expect_column_value_lengths_to_be_less_than_or_equal_to do when passed floats or integers?

I would have expected it to throw a TypeError, instead it uses the value of the numeric data.

I think of the "length" of an int or float as meaningless. As such, I would expect this expectation to only work for strings.

Feature/unit test refactor

Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.

Settle on a stopgap output spec for expectations result objects

Add tabbed autocomplete for dataset.column_name.column_expectation

A bit of sugar:

dataset.column_name.expect_something(arg1, arg2) should evaluate to dataset.expect_something(column_name, arg1, arg2)

...and ipython should be able to autocomplete the expectation on tab.

Function signatures should be present in base DataSet class and overridden in implementing subclasses (e.g. PandasDataSet)

Currently, the API doesn't suggest parallelism will necessarily exist in the expectations that will be implemented in classes that inherit from DataSet

Should expectation decorators add expectations that cannot be executed to the expectation configuration?

For example, if an expectation is written on a column that does not exist, currently that expectation will immediately be added, even if it is never even evaluated.

Decide on expected behavior for (and implement) distributional expectations

Decide on expected behavior for (and implement)
expect_column_numerical_distribution_to_be
expect_column_frequency_distribution_to_be

Docstrings

Docstrings updated for version 0.1

In append_expecation, add an option for not overwriting duplicate expectations.

Clean up overall documentation

Propose an API for custom expectations

Remove the need for `DataSet` when using the column_expectation decorator

Not this:

    @DataSet.column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

But this:

    @column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

All expectations should have docstrings

Implement these flags in validation tool

great_expectations my_dataset.csv my_expectations.json --output_format=BOOLEAN_ONLY --catch_exceptions=False --include_config=True

Feature/distributional expectations

Remove all refs to expect_column_value_lengths_to_be_less_than_or_equal_to

expect_column_proportion_of_unique_values_to_be_between doesn't work without optional field max_value

Returns False:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, include_config=True)['success']

Returns True:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, max_value=1, include_config=True)['success']

In unit tests, we should always use `assert_equal`, not `assert`

Remove (broken) multicolumn relations and check serialization for all expectations

Implement the datasource API

expect_column_value_lengths_to_be_between doesn't allow for specifying a single length

expect_column_value_lengths_to_be_between use exclusive boundaries, so you can't specify that all values are of the same length. For example:

drg.expect_column_value_lengths_to_be_between(column=" Average Covered Charges ", min_value=9, max_value=9)

will return:

{'exception_list': [
  '$105929.47',
  '$101282.03',
  '$146892.00',
...

Should Expectations include an `exclude_null_values` parameter?

Improve distributional expectations

Distributional expectations need:

Documentation
Bug Fixes
Unit tests
Better and simpler helpers
KL Divergence for discrete data

Add more thorough unit tests for...

Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between

Activate suppress_exceptions in all expectations

Lots of the expectations don't implement suppress_exceptions.

Append_expectation drops expectations of the same type even for different columns

...which is very broken. 👎

Feature/distributional expectations

Close #39 with improvements for documentation, unit tests, and bug fixes.

ensure unittest functionality is python 2 and 3 compatible

Convert expectation decorators in pandas_dataset.py and update expectation decorator in base.py

expect_table_row_count_to_equal should be changed to DataSet.table_expectation
expect_table_row_count_to_be_between should be changed to DataSet.table_expectation
expect_column_values_to_be_subset should be to DataSet.expectation
update decorators in base.py to use the new python 3 logic

Document behavior of expect_column_value_lengths_to_be_between

With the new decorators, passing args (instead of kwargs) to Expectations sometimes crashes them.

[Replication example needed]

We need to either fix this, or document and own it.

Also, we should write tests against this.

At this stage in the decorator refactor, this is the single biggest source of uncertainty for me.

Add documentation on running tests to the docs (and finish converting unittests)

Running unit tests currently requires of a mix of:

python -m unittest tests (for things converted to unittest) nosetests (for those not). We need to finish conversion and add to developer/contributor docs.

Do we really need dataset.@expectation and dataset.@column_expectation

The code is 90% redundant. Isn't there some way to refactor these?

Standardize output formats

Should `catch_exceptions` try to continue executing, or just return False with an informative error message?

What additional logic should we pack into Expectation decorators?

What are all the generic parameters that Expectations should accept?

All Expectations

output_format
include_kwargs
catch_exceptions
exclude_null_values?

For column_map_expectations

mostly

For column_aggregate_expectations

confidence_threshold

What other logic can we include?

Input validation
Output validation
- Is JSON serializable
- Has expected fields, etc.
Docstring propagation...
Create and append the Expectation to the dataset
Logic for de-duplication/updating Expectations

Support python3 and python3-compatible unittest framework (unittest)

nose recommends using a new framework for new projects to support python3, and we want to be as broadly compatible as possible.

Re-implement all those poor suppressed expectations.

Should it be easy to simultaneously create many expectations?

Consider the case of something like the following:

for column in df.columns:
    df.expect_column_mean_to_be_between(column, min, max)

Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?

Expectations should ensure their configuration is saveable at runtime

Currently, a user can create an expectation using parameters that are not json serializable but not be aware of the error until attempting to save the config.

Feature/stopgap output spec

Implement expect_column_values_to_be_of_type with tests

Define API for distributional expectations

Implement expect_column_values_to_match_json_schema with tests

named_regex_patterns should actually do something.

Right now, there's not way to programmatically reference them from Expectations.

Proposal: Use a WeightedPartitions for distributional expectations

{
  "partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
  "weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}

Partitions specifies the lower bound for each partition.
Weights specifies the total mass within each partition. (lower_bound <= value < upper_bound)

The number of entries in partition and weight lists must be equal.
For convenience, partitions are always sorted in ascending order.
Weights must sum to exactly 1.0

Note: Are there a JSON-serializable versions of inf and -inf?

Putting weights and partitions together into a single object has several advantages:

Can be passed/returned through true_value and other parameters
Simpler to test
Less prone to accidental separation in exploratory workflows

Using PDF instead of CDF has some advantages, too

Unified representation for categorical and continuous data
More user-friendly graphs
Still information-complete for calculating CDFs

Address isinstance python2 and python3 compatibility

Currently, ensure_json_serializable uses instance in a way that does not work for both python 2 and 3.

Essentially this pushes the project to only python2 in this version (since @abegong added the unicode type back to the check).

Create issues from this issue. :)

Notes from July 10th call:

Custom expectations:

Standardize on column_map_expectation and column_aggregate_expectation
Drop column_elementwise_expectation for now. (If we discover we need it, we can add it back it.)
Add better worked examples in the docs.
We also need to add a prototyping syntax for expectations that doesn't require subclassing and decorators. Something along the lines of:

dataset.expect_function_to_be_elementwise_true('column', function)
   => assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
   => assert(function(df.column == True))

Output formats

Make it clear in the docs: output_format is categorical, not strictly ordered. This makes the output API more flexible and extensible.
Bring the docs up to date (e.g. true_value for aggregate_column_expectations,
Change include_lineage to include_kwargs. Also make it clear that expectations have only kwargs, no args.
Think about including row_index_list as a return value. (This gets complicated in some-non-pandas systems.)
What about error messages and handling in expectations?

How should we implement distributional expectations within the new expectation decorators?

Distributional expectations are different from all the other @column_aggregate_expectations:

They need to accept a confidence_threshold argument, similar to mostly for column_map_expectations. Unlike mostly, confidence_threshold isn't optional.

In addition to a true_value, they should also return a confidence_value:

{
  success : boolean,
  true_value : partitioned_weights,
  confidence_value : float on [0,1]
}

The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.

How should we capture this in our expectations?

Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the

I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.

@jcampbell, @dgmiller Thoughts?