Comments (19)
Ok, I get it. So there would only be the Check class with factory methods injecting the fn argument. Not that bad...
I have to think a moment what this leads to. We get fewer new classes but instead one rather huge class (pandas style, so to say...).
I am very used to class hierarchies, that's why I came up with the other approach at first.
For the user not that different after all.
I think the one or the other way we should separate this from the core code of the Check class. Maybe have something like a standard_checks module with either the factory fumctions or the subclasses. Both only use the public interface of Check.
from pandera.
@mastersplinter @chr1st1ank now that #98 is merged, I think there are a few entangled issues that make sense to raise here:
- The
Check
class is responsible for too much, pieces of logic need to be delegated elsewhere:
- need to move error formatting logic out into separate helpers (perhaps in a
utils.py
module). - need to delegate the preparation of the
check_obj
object to the schema component consuming theCheck
object.
- Define a
CheckBase
class with core check logic. HypothesisBase
should subclassCheckBase
Check
subclassesCheckBase
and serves as entrypoint class for built-in checks via classmethod approachHypothesis
should subclassHypothesisBase
and servers as entrypoint class for built-in hypotheses via classmethod approach.
from pandera.
Fair enough, so the class methods it is...
I am just glad there is a plan we agree on. Every change of mind would have been a lot of refactoring with all these built-in tests we want to have.
@chr1st1ank re: #164, these built-in checks should not take
**check_params
, since these should always beelement_wise=False
,groups=None
,groupby=None
. See #75 and #165 for supporting built-in checks for column comparison and group comparison checks.The only parameter besides
error
that makes sense to pass in isn_failure_cases
.
Then I'll leave all additional parameters out for the moment. It is easier to add something to the interface later than to remove something people have started using.
Re: testing, I think the tests you wrote look good to me, I think with the nature of this library the test-to-package code ratio will (or should) always be a little on a higher side :)
I assume you are right. But at least I will narrow down the test cases to only use the Check class and not the Schema classes anymore.
Thanks for your input, guys!
from pandera.
This is definitely a good idea, especially if we want to import schemas from an external format like yaml.
I have drafted some already to combine them with the yaml reader which I would also put to discussion in the yaml topic (#91 ).
Looks like this:
class ValueRange(pandera.Check):
"""Check whether values are within a certain range."""
def __init__(self, min_value=None, max_value=None):
"""Create a new ValueRange check object.
:param min_value: Allowed minimum value. Should be a type comparable to
the type of the pandas series to be validated (e.g. a numerical type
for float or int and a datetime for datetime) .
:param max_value: Allowed maximum value. Should be a type comparable to
the type of the pandas series to be validated (e.g. a numerical type
for float or int and a datetime for datetime).
"""
super().__init__(fn=self.check)
self.min_value = min_value
self.max_value = max_value
def check(self, series: pd.Series) -> pd.Series:
"""Compare the values of the series to the predefined limits.
:returns pd.Series with the comparison result as True or False
"""
if self.min_value is not None:
bool_series = series >= self.min_value
else:
bool_series = pd.Series(data=True, index=series.index)
if self.max_value is not None:
return bool_series & (series <= self.max_value)
return bool_series
class StringMatch(pandera.Check):
"""Check if strings in a pandas.Series match a given regular expression."""
def __init__(self, regex: str):
"""Create a new StringMatch object based on the given regex.
:param regex: Regular expression which must be matched
"""
super().__init__(fn=self.match)
self.regex = regex
def match(self, series: pd.Series) -> pd.Series:
"""Check if all strings in the series match the regular expression.
:returns pd.Series with the comparison result as True or False
"""
return series.str.match(self.regex)
class StringLength(pandera.Check):
"""Check if the length of strings is within a specified range"""
def __init__(self, min_len: int = None, max_len: int = None):
"""Create a new StringLength object with a given range
:param min_len: Minimum length of strings (default: no minimum)
:param max_len: Maximu length of strings (default: no maximum)
"""
super().__init__(fn=self.check_string_length)
self.min_len = min_len
self.max_len = max_len
def check_string_length(self, series: pd.Series) -> pd.Series:
"""Check if all strings does have an acceptable length
:returns pd.Series with the validation result as True or False
"""
if self.min_len is not None:
bool_series = series.str.len() >= self.min_len
else:
bool_series = pd.Series(data=True, index=series.index)
if self.max_len is not None:
return bool_series & (series.str.len() <= self.max_len)
return bool_series
Here is how they are used then:
schema = pandera.SeriesSchema(
pandas_dtype=series.dtype.name, nullable=True,
checks=[checks.ValueRange(min_value=min_val, max_value=max_val)]
)
schema = pandera.SeriesSchema(
pandas_dtype=pandera.String,
nullable=False,
checks=[checks.StringMatch(regex=pattern)]
)
I think we shouldn't overdo it. For example the regex check covers all other string checks you listed above - is the user is willing to express them as regex.
What's your feeling about these checks?
Edit: To make more transparent what I am up to I created a draft pull request which is referenced below.
from pandera.
One API design decision I'd like to discuss for this issue is how to implement built-in Check
and Hypothesis
functions. I'll propose two options, knowing there are probably more:
- For each built-in check/hypothesis, add a new class method to the corresponding class definition, much like the two sample ttest and one sample ttest.
- subclass the
Check
/Hypothesis
classes to implement built-in checks/hypotheses.
Note that this issue is about how pandera
implements built-in checks, not how users can extend the Check
and Hypothesis
for customized checks.
I'm leaning toward option (1), for the following reasons:
- prevents proliferation of classes, which tends to yield a lot of boilerplate class definition code
- encourages simplicity in check function implementation (complex check functions should be generally not be built-in methods) and extension (simply supply a function to
Check
) Check
/Hypothesis
as an entrypoint to the built-in methods makes the API easier to use, it minimizes the number of high-level concepts that users have to learn (schemas, checks, hypotheses)
An example of (1), besides the two Hypothesis methods linked above, would be:
class Check(object):
...
@classmethod
def range(cls, min, max):
return cls(
fn=lambda s: (min <= s) & (s <= max),
error="failed range check between %s and %s" % (min, max)
)
For option (2), the subclassed equivalent would be:
class Range(Check):
def __init__(self, min, max):
self.min = min
self.max = max
super(Range, self).__init__(
fn=self.range,
error="failed range check between %s and %s" % (min, max)
)
def range(self, series):
return (self.min <= series) & (series <= self.max)
Of course, the differences in the two examples might seem trivial and I think this boils down to taste... indeed, I originally designed the Check
class for flexibility and extensibility via functions, not via a method
that subclasses are supposed to implement.
My rationale behind this is that in pandera
, classes should implement the key high-level concepts that map onto pandas
data structure elements, whereas the specific checks that validate those data are either defined by users, or common checks should be extremely simple and have a clear entrypoint.
Thoughts @mastersplinter @chr1st1ank?
from pandera.
I like the class method approach for brevity.
I like @chr1st1ank's suggestion of having the factory methods in a module to keep the Check itself tidy (it's already a bit big)
from pandera.
@cosmicBboy So we'd have this hierarchy?
CheckBase
Check
HypothesisBase
Hypothesis
What're you thinking needs to be in HypothesisBase
rather than just having something like this?:
CheckBase
Check
Hypothesis
from pandera.
I was thinking that HypothesisBase
would define core logic and methods that distinguish a Check
from a Hypothesis
, e.g. is_one_sample_test
, hypothesis_check
, and relationships
, and the Hypothesis
class would be the built-in hypothesis factory. This would make the relationship between CheckBase
-> Check
isomorphic to HypothesisBase
-> Hypothesis
.
However, I'm not wedded to this structure, as @chr1st1ank's proposal of built-in checks being subclasses is also a reasonable path. I think once the code is cleaned up a little bit, (#96, #99) it might be easier to make a decision
from pandera.
Ok, let's list what the classes currently seem to do:
- Check:
- Prepare a dataframe or series for a test (e.g. groupby)
- Apply the test function
- Find violations
- Report violations
- Hypothesis:
- A little additional dataframe/series preparation
- Create Hypothesis objects with special built-in test functions
What we are trying now is to add built-in test functions also for the general Check class. There I also like the functional approach suggested by @cosmicBboy. Note that this is also done already for the Hypothesis checks. The inheritance hierarchy we discuss here seem to point in the opposite direction, however.
My suggestion is to stick to the current rather flat class structure because more inheritance levels seem to complicate things further:
Let's try to design all built-in checks in a manner that they can only use the public API of the Check class to create concrete checks. The Hypothesis subclass seems necessary to change the behaviour of some methods of the Check class, because this can't be done with the init function. In other cases we only add a classmethod generating a specialised variant of Check:
- Check
- classmethod smaller_than()
- classmethod string_matches()
- ...
- Hypothesis
- classmethod one_sample_ttest()
- classmethod two_sample_ttest()
- ...
As we are already planning to move functionality out of the Check function to other places, there might not be too much code left to fill additional inheritance levels anyway.
from pandera.
So, I guess after the nice refactoring of #153 and as we need it for other topics like the yaml i/o and schema inference, this topic here becomes interesting again.
Maybe I can find some time for this next weekend to write a couple of checks including tests. But before that let's nail down the architecture.
Based on what's written above we would go for factory methods? The usage would be for example:
from pandera import Check
my_range_check = Check.create_range_check(min=5, max=10)
from pandera.
Yes, I think the class (factory) methods are the way to go here.
One thing I wanted to discuss about design and naming is that I think it would be good to:
- define a minimal set of "primitive" checks that we can use to construct other, more specialized checks, e.g. no trailing/leading whitespace check can use the regex matching check
- name these built-in checks in an intuitive and concise way, so with the example above, I'd propose
Check.in_range
instead ofCheck.create_range_check
since theCheck
namespace already tells the user that we want to define checks, andcreate_
is superfluous with the fact that we're initializing a check already. - use panda's vectorized, C-optimized functionality as much as possible
- keep things simple by only applying checks to a single column (don't support built-in groupby checks yet)
Jumping off of the proposal in the issue description, here's a proposal draft for the first iteration of built-in checks.
For dtypes that support pandas comparison operators >, >=, <, <=, ==, !=
Check.greater_than
: greater than scalar x- Alternatives:
Check.greater_than_or_equals
: greater than or equal to scalar x- alternatively, the
greater_than
method can have aninclusive=bool
flag
that makes the provided scalar inclusive.
- Alternatives:
Check.equals
: equal to scalar xCheck.not_equals
: not equal to scalar xCheck.less_than
: less than scalar x (forless_than_or_equals
see note on Alternatives withgreater_than
method)Check.in_range
: is in range x - y (inclusive or exclusive)Check.is_monotic
: is sorted, ascending or descending, withstrict=bool
flag
for strict/non-strict monotonicity
For dtypes that support checks for set membership
Check.isin
: is in list-like/set sCheck.notin
: is not in list-like/set s
For datetime dtypes
Check.date_format
: check for date format "<some_format>"- (comparison operators and range checks also apply for datetime dtype)
For string dtype
Check.str_matches
: column matches "regex" patternCheck.str_has_no_trailing_whitespace
: no trailing whitespaceCheck.str_has_no_leading_whitespace
: leading whitespaceCheck.str_contains
: column contains "string"Check.str_startswith
: column starts with "string"Check.str_endswith
: column ends with "string"
One thing I'm wondering is if we should have a functional_checks
module that implements the core check logic as pure functions, and the Check.*
class methods just delegate to those functions, so that core logic of checks are easier to test, and Check.*
is simply the interface to those functions.
For example:
# checks.py
import functional_checks
from functools import partial
class Check():
...
@classmethod
def in_range(self, min_value, max_value, min_inclusive=True, max_inclusive=True):
cls(
fn=partial(
functional_checks.in_range,
min_value, max_value, min_inclusive, max_inclusive
),
groups=None,
groupby=None,
element_wise=False,
error="range check %s, %s" % (min_value, max_value)
)
# functional_checks.py
def in_range(min_value, max_value, min_inclusive, max_inclusive, series):
min_check = series >= min_value if min_inclusive else series > min_value
max_check = series >= max_value if max_inclusive else series > max_value
return min_check & max_check
Please add your thoughts @chr1st1ank @mastersplinter.
from pandera.
This design still doesn't feel clear enough for me.
-
The
create_
suffix was to highlight that the factory functions within the Check class are not ordinary methods. This is because they are not just in the checks module, but within the Check class. But I agree that the suffix is a bit awkward. -
Clearly pure functions are easy to test. But a Check object behaves like a pure function already. If we have both, the classmethod and the external function we have to test both anyway.
-
Another idea: What if we provided the check functions as higher order functions in the pandera.checks module? Example, using your in_range:
def in_range(min_value, max_value, min_inclusive, max_inclusive):
def check_fn(series):
min_check = series >= min_value if min_inclusive else series > min_value
max_check = series >= max_value if max_inclusive else series > max_value
return min_check & max_check
return check_fn
Then the user would have to put them as fn
argument into the Check constructor by himself like so:
import pandera
from pandera import Column, Check, DataFrameSchema, checks
schema = DataFrameSchema(
{
"column1": Column(pandera.Int, Check(checks.in_range(min=5, max=10)))
}
)
from pandera.
But a Check object behaves like a pure function already. If we have both, the classmethod and the external function we have to test both anyway.
good point, I like the higher order functions idea. To refine the classmethods approach a little further, I think the following two options would get at the spirit of what you're saying that makes the syntax a little more concise (Check(checks.*)
syntax seems a little clunky to me):
Option 1:
This is exactly as you put it, but returning a Check object directly
def in_range(min_value, max_value, min_inclusive, max_inclusive):
def check_fn(series):
min_check = series >= min_value if min_inclusive else series > min_value
max_check = series >= max_value if max_inclusive else series > max_value
return min_check & max_check
return Check(check_fn, error="range check %s, %s" % (min_value, max_value))
Option 2:
Do the exact same thing as in option 1, except within the factory method:
class Check():
...
@classmethod
def in_range(self, min_value, max_value, min_inclusive=True, max_inclusive=True):
def check_fn(series):
min_check = series >= min_value if min_inclusive else series > min_value
max_check = series >= max_value if max_inclusive else series > max_value
return min_check & max_check
cls(check_fn,
error="range check %s, %s" % (min_value, max_value))
The problem with either of these approaches is that we lose the reusability of check_fn
since it only exists within the scope of the higher order function (in option 1) or the class method (in option 2), but this may not be a big deal.
from pandera.
For me option 1 sounds better. Mainly because I would expect to find many checks in a module called checks
, but not within a class called Check
.
But the point with the function not being testable on its own is bothering me. You can see that I tried the implementation for greater_than for a start in PR #164 .
The amount of test code for such an easy thing is tremendous. I thought for a moment if it would be less if we put the actual check function separately as you suggested, but that doesn't help. The test code is so much because we always need a series to check if we want vectorized operations (which we definitely want).
Ideas welcome, of course...
Would you sign off the general approach as started in the pull request?
from pandera.
what are you thoughts @mastersplinter? to me 1 and 2 are logically equivalent, the difference is a matter of user expectations and API design. I'm still partial to the classmethod approach, I think it's a reasonable way of organizing the built-in checks, and it's already a pattern that's established with the built-in Hypothesis checks:
- two_sample_ttest: https://github.com/pandera-dev/pandera/blob/master/pandera/hypotheses.py#L204
- one_sample_ttest: https://github.com/pandera-dev/pandera/blob/master/pandera/hypotheses.py#L314
I'm open to option 1 if it makes more sense from a usability perspective, but the difference seems very small and I can't really decide and just default to sticking to the classmethod pattern. If we do move on option 1, though, we should pull out the those two Hypothesis methods into their own functions in the hypotheses
namespace.
@chr1st1ank re: #164, these built-in checks should not take **check_params
, since these should always be element_wise=False
, groups=None
, groupby=None
. See #75 and #165 for supporting built-in checks for column comparison and group comparison checks.
The only parameter besides error
that makes sense to pass in is n_failure_cases
.
Re: testing, I think the tests you wrote look good to me, I think with the nature of this library the test-to-package code ratio will (or should) always be a little on a higher side :)
from pandera.
Purely out of convention I think the class method approach is more consistent across the codebase.
from pandera.
great, thanks for discussing and contributing @chr1st1ank! after this is done I'll push out the 0.2.8 0.3.0 release: https://github.com/pandera-dev/pandera/milestone/3
Edit: Thinking about this some more, I think #164 is a significant enough enhancement to warrant a minor version bump. I'll cut a 0.2.8 release in the next few days since I think there have been a good number of bug-fixes since 0.2.7.
from pandera.
fixed by #164
from pandera.
For datetime dtypes
Check.date_format
: check for date format "<some_format>"- (comparison operators and range checks also apply for datetime dtype)
Hello Nils! Why was it decided for this issue to be closed without implementing Check.date_format in checks.py?
Vladimir
from pandera.
Related Issues (20)
- check_types decorator doesnt validate after inplace mutation with method
- `pyspark.DataFrameSchema` doesn't have data syntheis method(s) implemented HOT 2
- pandera 0.18.0 does not work with multimethod 1.11.0 (but with 1.10.0) HOT 2
- Type hints are not retained HOT 4
- Allow column-level `coerce` setting to override class-level `coerce`.
- `values` arg in `unique_values_eq` has incorrect type hint.
- strict=True is not very strict on the index
- Document pandera model with sphinx HOT 2
- TypeError("Subscripted generics cannot be used with class and instance checks") HOT 8
- Conversion from 1 pydantic dtype dataframe to another fails when strict = 'filter'
- Passing DataFrameSchema to function that check_output is decorating?
- Dynamically create a DataFrameModel
- Index validation ignored in SeriesSchema
- Dask and Pandera Installation Problems via conda-forge HOT 10
- Polars LazyFrame validation only does schema checks, DataFrame validation does full validation HOT 1
- [BUG]: using `typing.Optional` on a column in `DataFrameModel` with `polars` HOT 2
- Pandas backend: `check_nullable` is inefficient when a column schema has `nullable=True` HOT 2
- Enhanced Validation Reporting for PySpark DataFrames in Pandera HOT 9
- Backend registration is not thread safe HOT 3
- Validating datetime columns regardless of timezone HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.