quantco / datajudge Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 3.0 890 KB

Assessing whether data from database complies with reference information.

Home Page: https://datajudge.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.76% Shell 0.24%

datajudge's People

Contributors

Stargazers

Watchers

Forkers

yyyasin19 0xbe7a jak-ket

datajudge's Issues

`AggregateNumericRangeEquality`: Mismatch between implementation and documentation

The doc string of the corresponding add_*_constraint method claims the following:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:

(key, k)
(key, k+1)
...
(key, k+n-1)

Yet, the implementation checks the following:

datajudge/src/datajudge/constraints/groupby.py

Lines 36 to 37 in 0350318

 def missing_from_range(values, start=0): 

 return set(range(start, max(values) + start)) - set(values)

On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.

It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:

(key, k)
(key, k+1)
(key, k+2)

According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect

(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)

which would flag the current key as a failure for some k.

We do not notice this diverging behaviour in our tests since our tests only use start_value=1.

What is intended behaviour?

Add a `BetweenRequirement` method for the `ColumnType` constraint

As of now, there is only a WithinRequirement method to add a ColumnType constraint.

Yet, one might envision a use case in which a user wants to ensure equality of column types.

Bug: Raw condition string does not work for snowflake columns

When using the following setup (just an example, probably transferable to other cases as well)

req = WithinRequirement.from_table(...)
req.add_date_between_constraint(column_name, "1900-01-01", "2100-01-01")

the following SQL is produced (roughly):

SELECT schema.table.column_name
FROM schema.table
WHERE (column_name >= 1900-01-01) and (column_name <= 2100-01-01)

The missing ' around the dates throws an SQL compilation error, at least for Snowflake ❄️ .

This is caused at a pretty general location in the code:

column = ref.get_column(engine)
    new_condition = Condition(
        conditions=[
            Condition(raw_string=f"{column} >= {lower_bound}"), # ... missing here
            Condition(raw_string=f"{column} <= {upper_bound}"),
        ],
        reduction_operator="and",
    )

I'm somewhat surprised this works for other Conditions or databases, I thought that most SQL dialects require one to wrap string arguments in single quotes.

Should we just make this wrapping dependent on the type of the bounds?
I also think that always using single-quotes should not break anything, at least for the few test cases I tried it out.

let me know, I can draft a quick fix :)

Improvements in `VarCharRegex` assertion message

The constraint VarCharRegex shows all examples not matching with the given regex pattern. This is fine if the set is small, but quite cumbersome if such a collection is big.

The question would be how many do we show, and how those examples are selected. In principle, I don't see a problem with randomly picking N (something between 10 and 20 probably) from the set of counterexamples.

Furthermore, it'd be nice to add to the message the total number of non-matching entries.

Add: Quantiles for date spans (or other ranges)

For constraints such as NRowsMaxGain the user can specify a date_range_deviation, where datajudge calculates the date range covered by a dataset (e.g. 150 "days of data" are in the first table, 200 in the second table, therefore the table is allowed to grow by (200/150 = 1.33) 33% in terms of the number of rows).

To make this more robust, a requested feature was to allow using quantiles instead of max and min dates.
Currently, our date range is calculated as follows: (for snowflake and mssql)

sa.func.datediff(
    sa.text("day"),
    sa.func.min(column),
    sa.func.max(column),
)

The new implementation would, for example, 99 percentile date. This would remove outliers, that were generated e.g. by user input (current date is 2022-06-01 but the user enters 2021-01-06 shifting the whole dataset by a month.

Issue: A database-agnostic implementation of this feature is not so straightforward.

Improve on error handling when column is not found

When a column specified in a constraint is not found, an unmanaged exception is shown that doesn't help in understanding what's the problem

______ test_constraint[UniquesSubset::X.X.X] ______

constraint = <dbcheck.constraints.uniques.UniquesSubset object at 0x7f9be45be730>
engine = Engine(postgresql://XXXX:***@XXXXXXXXXXXX:5432/health)

    @pytest.mark.parametrize("constraint", get_constraints(SCHEMA, OLD_SCHEMA), ids=idfn)
    def test_constraint(constraint, engine):
>       test_result = constraint.test(engine)

test_data.py:59: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
X/lib/python3.8/site-packages/dbcheck/constraints/base.py:178: in test
    value_factual = self.get_factual_value(engine)
X/lib/python3.8/site-packages/dbcheck/constraints/base.py:110: in get_factual_value
    factual_value, factual_selections = self.retrieve(engine, self.ref)
X/lib/python3.8/site-packages/dbcheck/constraints/uniques.py:94: in retrieve
    uniques, selection = db_access.get_uniques(engine, ref)
X/lib/python3.8/site-packages/dbcheck/db_access.py:377: in get_uniques
    selection = ref.get_selection(engine).alias()
X/lib/python3.8/site-packages/dbcheck/db_access.py:150: in get_selection
    [table.c[column_name] for column_name in self.columns]
X/lib/python3.8/site-packages/dbcheck/db_access.py:150: in <listcomp>
    [table.c[column_name] for column_name in self.columns]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <sqlalchemy.sql.base.ImmutableColumnCollection object at 0x7f9be365e5e0>
key = 'l'

    def __getitem__(self, key):
        try:
>           return self._index[key]
E           KeyError: 'l'

KeyError: 'l' in this case is not very useful, l being the name of the column not found.

SQL constraints (eg. primary key) checking

Snowflake doesn't support primary keys. (You may define primary keys and it remembers them as metadata but doesn't check them for integrity, eg. uniqueness.)

Would be helpful to be able to check primary keys, and possibly other SQL-native constraints, with the help of datajudge.

We're likely to build this for a project so I wonder what datajudge folks think about the feature.

Provide a constraint limiting the number of missing values in a column

As of now, this can be done via

between_requirement = BetweenRequirement.from_tables(
    db_name1=db_name,
    db_name2=db_name,
    schema_name1=schema_name,
    schema_name2=schema_name,
    table_name1="mytable",
    table_name2="mytable",
)
between_requirement.add_n_rows_max_loss_constraint(
    constant_max_relative_loss=.30,
    condition2=Condition.raw_string(f"{mycolumn} IS NOT NULL")
)

Yet, it seems clear that this isn't perfectly user-friendly.

Moreover a comparison of missingness between tables could be useful.

The null_absence_constraint method exists but currently doesn't support a tolerance parameter.

Output: Round numbers for cleaner output

We recently had the following output in our project

E   AssertionError: <schema1>.<table> has 0.21356959387452668 of #rows of <schema2>.<table>.
It was only allowed to increase 0.2102716914986854.

I know that the error that 0.21 should be 1.21 has been fixed, but I'd like to think about ways we could make the user output more readable.

Naive idea:

Allow the user to set the rounding, e.g. when the user allows an increase of 0.1 than they maybe don't want to see 0.1000023423020234 as output. (hard to make it right for all cases)

Another solution

Easy: Just show the numbers up to the first change! This would print

E   AssertionError: <schema1>.<table> has 0.213 of #rows of <schema2>.<table>.
It was only allowed to increase 0.21.

here

and we could even just convert to percentages

E   AssertionError: <schema1>.<table> has 21.3% of #rows of <schema2>.<table>.
It was only allowed to increase 21%.

typo: dates in changelog are incorrect

the dates for versions after 1.3.0 have the wrong year, I believe

Allow users to access actual outcome of tests

Hey,

this is an issue I have encountered using datajudge within some projects and I'd like to discuss whether this is a general use case.

Status Quo

Currently, the test() method of a Constraint returns a TestResult object.
Example: NumericMean

def test(self, engine: sa.engine.Engine) -> TestResult:
    # retrieve values ...
    result = deviation <= self.max_absolute_deviation
    return TestResult(result, assertion_text)

For our pytest integration, we assert the outcome of this TestResult

@pytest.mark.parametrize(
    "constraint", all_constraints, ids=Constraint.get_description
)
def test_constraint(constraint, datajudge_engine):
    # ...
    test_result = constraint.test(datajudge_engine)
    assert test_result.outcome, test_result.failure_message

This is completely fine since the goal of a test is to check whether something meets a certain requirement.
In the case of a constraint failing the user gets feedback on what went wrong in the form of the TestResult.failure_message field.

Missing feature

In general, though, datajudge might not only be used in combination with pytest to evaluate if something meets a certain requirement but also to validate the results by plotting, comparing, or evaluating them.

One concrete example is the use of statistical requirements/constraints, such as the KolmogorovSmirnov2Sample constraint where from a user perspective it might not only be important to validate whether the test result is significant but also to report back how significant it is.
A user might display these values in a dashboard tracking database changes over time, or use them to initially set some values for constraint specification.

Example: Measure how many values are currently missing in a data dump and then specify the constraint for the future based on that value.

This would, additionally, allow the user to inspect the outcomes of tests that were successful and answer the question "Okay, but how close are we to the threshold"?

Needed changes

Since the project architecture is pretty clean (🚀 ), implementing this feature would just require adding a new field to the TestResult object, e.g. values which allows the user to access the internally measured value for a certain constraint.

The change would be fully backward-compatible and could initially be just integrated for constraints where it makes sense the most, e.g. statistical constraints.
We could then leave a note in the documentation, that the results are available by accessing the field with a certain name.

For statistical constraints, this field would contain the test statistic and/or the p-value. For numeric constraints, it would contain the actually measured mean, number of rows, missing columns, etc.

Example Usage

This could result in the following code on the user side.

results = []
for year in range(2019, 2023):
  req = BetweenRequirement.from_tables(year, year+1) # simplified
  req.add_statistical_test_constraint(...)
  req.add_n_rows_max_gain_constraint(0.1)
  stat_result = req[0].test(engine)
  gain_result = req[1].test(engine)
  
   # Users can track the change in data
  results.append((year, stat_result.values["p_value"], gain_result.values["n_rows_factual"]))

It might seem like a niche use case at first, but I think it's an important part of a framework geared toward data validation. :)
Happy to hear what you think!

DateMin constraint including tolerance

Hey,

I have a situation where I want to validate that almost all dates in a column are sensible, i.e., between 1900 and 2050.
Initially, I thought about the DateMin constraint; unfortunately, it does not currently support specifying a tolerance.
Since only ~99% of values fulfill this specification, I can't use the regular DateMin constraint.

Is there another way to achieve this with the current suite, or should we/I refactor DateMin to allow for tolerance?

Flaky CI tests: `sqlalchemy` `QueuePool`

For a while we have observed that our CI tests are flaky, e.g.

https://github.com/Quantco/datajudge/actions/runs/7889184078/job/21528439857

The error seems to always be:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/20/3o7r)

ci: cancel runs on new commits

Our CI currently let's runners run on old commits even though a newer one arrived.
This can be solved pretty easy by adding the following concurrency group to the ci.yaml file:

name: Workflow X

on: # ...

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

jobs: # ...

This can save resources (and make this faster, if you have a limited amount of runners)

Delete old branch from RTD

Who can delete this? https://datajudge.readthedocs.io/en/jonashaag-patch-1/

Improve handling of non-existing backend

If a backend that doesn't exist is given as an argument to pytest, it'll try to execute all the tests but it'll fail since there is no connection string defined.

This handling should improve, by finishing early and not waiting until trying to run all the tests that are bound to fail

Isolation of snowflake tables

Status quo:

We have a single snowflake database in use for CI.
Our integration tests set up and tear down tables before and after running the tests, respectively. Across executions of the tests, the tables are always created with the same table name and within the same schema, e.g. dbo and int_table1[0].
Whenever CI is triggered simultaneously by more than one event within a timespan of ~10', at least one of the CI runs will almost surely fail (False Negative) due to interference between both runs. After all, they rely on the same database and same tables.

Solution ideas:

Give tables (pseudo-)unique names, e.g. int_table1_{timestamp} instead of int_table1. Even though our test suite comes with a teardown, we might need to set up a regular cleanup of the snowflake database since the integration test teardown itself might occasionally fail.
Make use of locking[1] for the tables at hand. This would make a CI test execution wait for all previously started to finish.
Implement a mechanism to only create/write to tables if not already existent. This would rely on the assumption that a table is immutable/doesn't change over time. This assumption should in principle be true as of now but might not always be. I imagine that we might need to adapt/get rid off the teardown behaviour.

Any thoughts, opinions and suggestions are welcome!

[0] https://github.com/Quantco/datajudge/blob/main/tests/integration/conftest.py#L62-L68
[1] https://docs.snowflake.com/en/sql-reference/transactions.html#resource-locking

Extend test examples in documentation

As of now, we have two examples in the documentation. One[0] is a very simple and another[1] a slightly more involved illustration on how to express expectations between data and reference values as well as other data.

We might look into extending our set of examples with the following:

An example expressing expectations against date columns. Some of our date constraints are fairly complex and might profit a lot from a comprehensive example.
An example focusing on the execution of tests. This example might even reuse requirements and constraints from another example and focus on
- Subselection of tests via pytest's -k parameter.
- Generation of html report.
- Marking[2] of tests into different classes.
- Parametrization of tests, e.g. with database names passed to pytest via CLI parameters.

[0] https://datajudge.readthedocs.io/en/latest/examples/example.html
[1] https://datajudge.readthedocs.io/en/latest/examples/example_twitch.html
[2] https://docs.pytest.org/en/7.1.x/example/markers.html

Improve number reporting in `add_categorical_bound_constraint`

I'm getting constraint failure messages like

6.288887928086567e-05% (20 out of 31802125)

and

1.0073226238812658% (320350 out of 31802125)

In both cases, using fewer decimal places would be nice I think. At the very least we shouldn't be showing scientific notation + % IMO.

For the counts, I think it would be nice to display them like Python does, ie. 320_350 out of 31_802_125.

Add README TOC

What do you think about adding a TOC to the README? It's pretty long already.

Note that GitHub itself has a Markdown TOC but it's kind of hidden visually: https://github.blog/changelog/2021-04-13-table-of-contents-support-in-markdown-files/

What I'm suggesting is to add something that looks like this: https://github.com/thlorenz/doctoc

It would be autogenerated by something like https://github.com/marketplace/actions/toc-generator or https://github.com/ekalinin/github-markdown-toc#toc-generation-with-github-actions.

Snowflake: Integers are returned as `decimal.Decimal`

CI suddenly started failing [0]. It seems that a non-pinned dependency on snowflake-sqlalchemy has changed [1] the outcome of the integration tests.

We create some tables with columns of type sqlalchemy.Integer[2]. Eventually we fetch rows from these tables and compare them to python integers. Yet, now the values returned come as decimal.Decimal values and therefore standard comparison for equality fails.

I can recreate the CI errors locally by simply updating my conda environment.
When manually downgrading snowflake-sqlalchemy to version 1.3.4, the error vanishes.

[0] https://github.com/Quantco/datajudge/runs/7519131413?check_suite_focus=true
[1] https://github.com/snowflakedb/snowflake-sqlalchemy/releases/tag/v1.4.0
[2] https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.Integer

Optionally use percentiles in `db_access::get_date_growth_rate`

As of now the date growth is defined by new_date_range / old_date_range - 1, each range being defined by the difference between maximum and minimum therein.

Yet, using minima and maxima is very sensitive to outliers. Instead, one might favour the more robust approach of using the pth and 100-pth percentiles.

This could be an optional feature of db_access.get_date_growth_rate.

Suggestion by Stefan Sorg.

Declarative approach to define requirements

The current approach of defining constraints against a datasource is fairly 'imperative'. It has been suggested that a more natural approach would lean towards a more 'declarative' approach.

In light of limiting complexity around data tests as much as possible, one could look at a specification as more of a 'configuration' - merely expressing expectations in a simple fashion and not relying on mutations, control flow and loops.

It could be possible to provide an additional API to express constraints: at creation time of a Requirement object. A user could then voluntarily opt in to look at Requirement objects in an immutable fashion. This could but wouldn't need to happen in a wrapper around Requirement.

For now, we are not sure about how to evaluate the upside of such an alternative, more high-level API. Hence, we decided to stay alert and let the idea sink in without a concrete next step.

cc @jonashaag @kklein

Harmonize and reassess format of date constraint parameters

An example of such a parameter would be add_date_min_constraint and its min_value parameter.

First, it should be clarified, what currently works. Exemplary questions to ask:

Can timestamps be used, as opposed to dates?
Can the parameters strictly only be strings or do datetime objects with appropriate __str__ methods work as well?
Is the currently suggested format, e.g. “‘20121230’”, the only one working?

Second, it should be asked how it should work. Some questions to ask:

What are the respective ISO standards?
What formats to which databases support?

Merging from `db_access.py` `get_unique` and `get_unique_count`?

Both functions are extremely similar, would it be possible to merge them in a sane way?

Separating creation of query from execution of it

Currently, we have helper functions that actually build the query and then immediately execute returning the retrieved data and the query itself.

I suggest that those functions only return the statement, which would be called at initialization time storing it in an instance attribute of the constraint and then in retrieve it executes it.

My rationale to do this is the following case. I want to create the constraint and I want to obtain the SQL query that said constraint will produce without having to actually perform the test (or more exactly the retrieval). This is particularly interesting for explorative use of this library and when the execution of the query could take a long time, e.g., more than 10 minutes.

This has no priority, just a suggested idea, which also establishes better the responsibility of the retrieve method.

Set up CI integration tests for alternative database management systems/engines.

Postgres
Snowflake

Extend CI.

Test whether package is building correctly, e.g. whether flit install finishes successfully.
Evaluate test coverage.

`KolmogorovSmirnov2Sample`: `Condition`s are ignored

The KolmogorovSmirnov2Sample constraint can be initialized with DataReferences containing Conditions. This is apparent by the parameters in add_ks_2sample_constraint:

https://github.com/Quantco/datajudge/blob/main/src/datajudge/requirements.py#L1276-L1277

Yet, since get_ks_2sample in db_access operates on DataSource objects - instead of DataReference objects - the Conditions attached to DataReferences are ignored.

While a Condition was indeed used in one integration test, it was used for both DataReferences at the same time. Since the same test also succeeds without a Condition it did not became apparent that Conditions are ignored:

https://github.com/Quantco/datajudge/blob/main/tests/integration/test_integration.py#L1768-L1789

Resolving this issue is one of the many things we should obtain for free when tackling #29 and only ever accessing the underlying data via DataReference.get_selection - which takes care of these things under the hood.

Confusing error message in RowMatchingEquality

I'm comparing two tables that have 0% identical rows, and I'm getting:

AssertionError: 1.000000 > 0.01 of rows matched between OLD_TABLE and NEW_TABLE

Shouldn't this be "NOT matched"?

Re-implement KS test in sqlalchemy expression language API

Hey,

to track this task, I'd like to open a new issue for the future.
Currently, the Kolmogorov Smirnov test has been implemented in pure SQL, which was already a big win (cf. #28 ).

To get even more independence from different database internals and remove some of the clunky implementation details (e.g. string replacements in mssql table references to fit into the SQL template) it would be helpful to convert the SQL query into sqlalchemy expression language.

This shouldn't be too hard, since:

the expression language API has been used throughout the project as well, so there is a lot of referene
the test consists mostly of with-query steps and commonly-available functions
this is completely seperated from the rest of the project (hence the good first issue label) and consists of "just" translating an SQL query into sqlalchemy expressions

Thanks in advance for whoever might be participating (even if it's just future me).

Docs formatting broken

Refactor some assertion message to make use of colored output

cf. PR #89
This includes calling the function diff_color

factual_fmt, target_fmt = diff_color(factual, target)

and then printing the _fmt variables instead of the original numbers.

Deprecation warning

Seen in CI while doing integration testing

Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings. Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)

Provide `Constraint`s on percentiles for numeric columns

E.g. in order to express that a median of a column should lie within $(13.37 \cdot.9, 13.37\cdot1.1)$ one could write

my_within_requirement.add_numeric_quantile_constraint(
    k=.5,
    expected_kth_percentile=13.37,
    relative_tolerance=.1
)

Improve memory consumption of AggregateNumericRangeEquality

AggregateNumericRangeEquality requires ~ 20 GiB of memory (~ 50 M rows).

.fetchall() returns a list. Could we change this to perform the checks in a streaming fashion that doesn't require all of the data in memory at once?

Improve error message if str was passed where list[str] was expected

If you pass a str to add_categorical_bound_constraint(columns=) it will error with KeyError: 'x' where x is the first letter of the column name. Maybe we can have it output a more helpful error message.

Way to provide custom name for `get_description()`

Currently it's not obvious how to select a single constraint when using pytest -k. The pytest test names are taken from get_description() which may be something like ...table1|table2. Problems with that:

Test names are not necessarily unique and pytest seems to append 0, 1, 2, ... to the names.
pytest -k doesn't allow to specific things like table1|table2 because of the |

Idea: Provide a way to add your own name to each of the requirements/constraints?

Question/Feature Request: Check distribution of unique values in column

Scenario: We have a bunch of tables that each contain the same column C that has unique values in [0, 10]. Each table has a different unique distribution of those values and we want to check that the distributions are as we expect, with some tolerance.

Example:

Table 1
- 1: [0.4, 0.5] (at least 40% and at most 50% of values in C are = 1)
Table 2
- 1: [0, 0.01] (at most 1% of values in C are = 1)
- 2: [0.6, 1] (at least 60% of values in C are = 2)
All tables:
- 0: [0, 0] (no values are 0)
- 3-10: [0, 0.05] (at most 5% of values in C are >= 3)

How would you achieve something like this nicely in datajudge?

Thanks a lot!

Fault tolerant error handling on invalid Condition()

One can see it as a user error that Condition(rawstring="col is not NULL") fails if col does not exist in table. However, the general concept of datajudge seems to be to handle errors fault tolerant and to report a list of errors at the end. Thus it might be nice to try-catch also plain query related errors and to report those fault tolerant at the end as well.

It may be a conscious decision to only handle value errors fault tolerant and structural problems fail fast. So feel free to reject this issue. I don't have long user experience with datajudge, yet. I am just helping a team use it.

Performing regex matching on db engine itself

Currently, the VarcharRegexConstraint requests all the values in a column and transfers them to the application to perform the matching.

PostgreSQL and MSSql server do support executing (limited) regex pattern matching in the engine itself.

It'd be very nice to see how we could make use of that. Or maybe to create a more simple version using the db, while keeping the existing one for cases when the db cannot handle the given pattern.

Snowflake column name case-sensitive

I am trying to specify constraints for a Snowflake table with a schema that includes a column named "ALTER". After adding an add_uniques_equality_constraint constraint, testing the code results in a KeyError: 'alter' error.

Steps to Reproduce

Create a Snowflake table with the following schema:

create or replace TABLE EXAMPLE (
	JAHR FLOAT,
	"ALTER" FLOAT
);

Add a add_uniques_equality_constraint constraint:

requirement.add_uniques_equality_constraint(
    ["ALTER"], list(range(0, 100 + 1))
)

Test the constraints

Expected Behavior

The code should run without errors.

Actual Behavior

Running the code results in a KeyError: 'alter' error.

Root Cause

The issue is caused by the Snowflake capitalization fix (https://github.com/Quantco/datajudge/blob/main/src/datajudge/db_access.py#L346). This fix forces the column name to be lowercase, which interferes with the case-sensitive "ALTER" column name in the table schema.

Validation of tolerance parameters

For example, max_relative_deviation or max_absolute_deviation are typically expected to be positive. Yet, we currently do hardly any input validation in the respective Constraint constructors. Validating this input could help users finding sneaky typos/bugs.

Provide 2-sample Anderson Darling test as constraint

The recently implemented Kolmogorov Smirnov test can be useful when trying to assess whether two samples are drawn from the identical underlying distribution. While it might already satisfy a user's need, some might argue (e.g. [0]) that other kinds of tests might be more adequate in certain circumstances. A prominent alternative could be the Anderson Darling [1] test.

Once our Kolmogorov Smirnov test query logic is implemented via the sqlalchemy language expression api - see #29 - it should be very easy to share a lot of the underlying query logic between both tests.

[0] https://www.researchgate.net/publication/267205556_Power_Comparisons_of_Shapiro-Wilk_Kolmogorov-Smirnov_Lilliefors_and_Anderson-Darling_Tests
[1] https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test#Non-parametric_k-sample_tests

Refine `AssertionError` message for `NRowsMinGain` and `NRowsMaxGain` constraints

When the constraint NRowsMinGain is not met, the following exemplary AssertionError is thrown:

AssertionError: TAB_NEW has 0.30231844350740233 of #rows of TAB_OLD. It was supposed to increase at least 0.5.

I find this message misleading as TAB_NEW has actually 130 % of #rows of TAB_OLD, or 30 % increase in #rows.

As a minimal refinement I would suggest to replace the string "of #rows of" with "relative gain in #rows compared to" in the assertion_text of NRowsMinGain.

This would change the message to

AssertionError: TAB_NEW has 0.30231844350740233 relative gain in #rows compared to TAB_OLD. It was supposed to increase at least 0.5.

Additionally, one could express the numbers as percentage by multiplying relative_gain as well as self.min_relative_gain with 100, adding "%" to the string and omitting the "relative". That would give us:

AssertionError: TAB_NEW has 30.231844350740233 % gain in #rows compared to TAB_OLD. It was supposed to increase at least 50 %.

Similarly, it would probably make sense to change the error message in NRowsMaxGain.

	def missing_from_range(values, start=0):
	return set(range(start, max(values) + start)) - set(values)

quantco / datajudge Goto Github PK

datajudge's People

Contributors

Stargazers

Watchers

Forkers

datajudge's Issues

Naive idea:

Another solution

Status Quo

Missing feature

Needed changes

Example Usage

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Recommend Projects

Recommend Topics

Recommend Org