Git Product home page Git Product logo

Comments (19)

chr1st1ank avatar chr1st1ank commented on July 24, 2024 1

Ok, I get it. So there would only be the Check class with factory methods injecting the fn argument. Not that bad...
I have to think a moment what this leads to. We get fewer new classes but instead one rather huge class (pandas style, so to say...).
I am very used to class hierarchies, that's why I came up with the other approach at first.
For the user not that different after all.

I think the one or the other way we should separate this from the core code of the Check class. Maybe have something like a standard_checks module with either the factory fumctions or the subclasses. Both only use the public interface of Check.

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024 1

@mastersplinter @chr1st1ank now that #98 is merged, I think there are a few entangled issues that make sense to raise here:

  1. The Check class is responsible for too much, pieces of logic need to be delegated elsewhere:
  1. Define a CheckBase class with core check logic.
  2. HypothesisBase should subclass CheckBase
  3. Check subclasses CheckBase and serves as entrypoint class for built-in checks via classmethod approach
  4. Hypothesis should subclass HypothesisBase and servers as entrypoint class for built-in hypotheses via classmethod approach.

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024 1

Fair enough, so the class methods it is...
I am just glad there is a plan we agree on. Every change of mind would have been a lot of refactoring with all these built-in tests we want to have.

@chr1st1ank re: #164, these built-in checks should not take **check_params, since these should always be element_wise=False, groups=None, groupby=None. See #75 and #165 for supporting built-in checks for column comparison and group comparison checks.

The only parameter besides error that makes sense to pass in is n_failure_cases.

Then I'll leave all additional parameters out for the moment. It is easier to add something to the interface later than to remove something people have started using.

Re: testing, I think the tests you wrote look good to me, I think with the nature of this library the test-to-package code ratio will (or should) always be a little on a higher side :)

I assume you are right. But at least I will narrow down the test cases to only use the Check class and not the Schema classes anymore.

Thanks for your input, guys!

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024

This is definitely a good idea, especially if we want to import schemas from an external format like yaml.

I have drafted some already to combine them with the yaml reader which I would also put to discussion in the yaml topic (#91 ).
Looks like this:

class ValueRange(pandera.Check):
    """Check whether values are within a certain range."""
    def __init__(self, min_value=None, max_value=None):
        """Create a new ValueRange check object.

        :param min_value: Allowed minimum value. Should be a type comparable to
            the type of the pandas series to be validated (e.g. a numerical type
            for float or int and a datetime for datetime) .
        :param max_value: Allowed maximum value. Should be a type comparable to
            the type of the pandas series to be validated (e.g. a numerical type
            for float or int and a datetime for datetime).
        """
        super().__init__(fn=self.check)
        self.min_value = min_value
        self.max_value = max_value

    def check(self, series: pd.Series) -> pd.Series:
        """Compare the values of the series to the predefined limits.

        :returns pd.Series with the comparison result as True or False
        """
        if self.min_value is not None:
            bool_series = series >= self.min_value
        else:
            bool_series = pd.Series(data=True, index=series.index)

        if self.max_value is not None:
            return bool_series & (series <= self.max_value)

        return bool_series


class StringMatch(pandera.Check):
    """Check if strings in a pandas.Series match a given regular expression."""
    def __init__(self, regex: str):
        """Create a new StringMatch object based on the given regex.

        :param regex: Regular expression which must be matched
        """
        super().__init__(fn=self.match)
        self.regex = regex

    def match(self, series: pd.Series) -> pd.Series:
        """Check if all strings in the series match the regular expression.

        :returns pd.Series with the comparison result as True or False
        """
        return series.str.match(self.regex)


class StringLength(pandera.Check):
    """Check if the length of strings is within a specified range"""

    def __init__(self, min_len: int = None, max_len: int = None):
        """Create a new StringLength object with a given range

        :param min_len: Minimum length of strings (default: no minimum)
        :param max_len: Maximu length of strings (default: no maximum)
        """
        super().__init__(fn=self.check_string_length)
        self.min_len = min_len
        self.max_len = max_len

    def check_string_length(self, series: pd.Series) -> pd.Series:
        """Check if all strings does have an acceptable length

        :returns pd.Series with the validation result as True or False
        """
        if self.min_len is not None:
            bool_series = series.str.len() >= self.min_len
        else:
            bool_series = pd.Series(data=True, index=series.index)

        if self.max_len is not None:
            return bool_series & (series.str.len() <= self.max_len)
        return bool_series

Here is how they are used then:

    schema = pandera.SeriesSchema(
        pandas_dtype=series.dtype.name, nullable=True,
        checks=[checks.ValueRange(min_value=min_val, max_value=max_val)]
    )

    schema = pandera.SeriesSchema(
        pandas_dtype=pandera.String,
        nullable=False,
        checks=[checks.StringMatch(regex=pattern)]
    )

I think we shouldn't overdo it. For example the regex check covers all other string checks you listed above - is the user is willing to express them as regex.

What's your feeling about these checks?

Edit: To make more transparent what I am up to I created a draft pull request which is referenced below.

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

One API design decision I'd like to discuss for this issue is how to implement built-in Check and Hypothesis functions. I'll propose two options, knowing there are probably more:

  1. For each built-in check/hypothesis, add a new class method to the corresponding class definition, much like the two sample ttest and one sample ttest.
  2. subclass the Check/Hypothesis classes to implement built-in checks/hypotheses.

Note that this issue is about how pandera implements built-in checks, not how users can extend the Check and Hypothesis for customized checks.

I'm leaning toward option (1), for the following reasons:

  • prevents proliferation of classes, which tends to yield a lot of boilerplate class definition code
  • encourages simplicity in check function implementation (complex check functions should be generally not be built-in methods) and extension (simply supply a function to Check)
  • Check/Hypothesis as an entrypoint to the built-in methods makes the API easier to use, it minimizes the number of high-level concepts that users have to learn (schemas, checks, hypotheses)

An example of (1), besides the two Hypothesis methods linked above, would be:

class Check(object):
    ...

    @classmethod
    def range(cls, min, max):
        return cls(
            fn=lambda s: (min <= s) & (s <= max),
            error="failed range check between %s and %s" % (min, max)
        )

For option (2), the subclassed equivalent would be:

class Range(Check):

    def __init__(self, min, max):
        self.min = min
        self.max = max
        super(Range, self).__init__(
            fn=self.range,
            error="failed range check between %s and %s" % (min, max)
        )

    def range(self, series):
        return (self.min <= series) & (series <= self.max)

Of course, the differences in the two examples might seem trivial and I think this boils down to taste... indeed, I originally designed the Check class for flexibility and extensibility via functions, not via a method that subclasses are supposed to implement.

My rationale behind this is that in pandera, classes should implement the key high-level concepts that map onto pandas data structure elements, whereas the specific checks that validate those data are either defined by users, or common checks should be extremely simple and have a clear entrypoint.

Thoughts @mastersplinter @chr1st1ank?

from pandera.

mastersplinter avatar mastersplinter commented on July 24, 2024

I like the class method approach for brevity.

I like @chr1st1ank's suggestion of having the factory methods in a module to keep the Check itself tidy (it's already a bit big)

from pandera.

mastersplinter avatar mastersplinter commented on July 24, 2024

@cosmicBboy So we'd have this hierarchy?

  • CheckBase
    • Check
    • HypothesisBase
      • Hypothesis

What're you thinking needs to be in HypothesisBase rather than just having something like this?:

  • CheckBase
    • Check
    • Hypothesis

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

I was thinking that HypothesisBase would define core logic and methods that distinguish a Check from a Hypothesis, e.g. is_one_sample_test, hypothesis_check, and relationships, and the Hypothesis class would be the built-in hypothesis factory. This would make the relationship between CheckBase -> Check isomorphic to HypothesisBase -> Hypothesis.

However, I'm not wedded to this structure, as @chr1st1ank's proposal of built-in checks being subclasses is also a reasonable path. I think once the code is cleaned up a little bit, (#96, #99) it might be easier to make a decision

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024

Ok, let's list what the classes currently seem to do:

  • Check:
    • Prepare a dataframe or series for a test (e.g. groupby)
    • Apply the test function
    • Find violations
    • Report violations
  • Hypothesis:
    • A little additional dataframe/series preparation
    • Create Hypothesis objects with special built-in test functions

What we are trying now is to add built-in test functions also for the general Check class. There I also like the functional approach suggested by @cosmicBboy. Note that this is also done already for the Hypothesis checks. The inheritance hierarchy we discuss here seem to point in the opposite direction, however.

My suggestion is to stick to the current rather flat class structure because more inheritance levels seem to complicate things further:
Let's try to design all built-in checks in a manner that they can only use the public API of the Check class to create concrete checks. The Hypothesis subclass seems necessary to change the behaviour of some methods of the Check class, because this can't be done with the init function. In other cases we only add a classmethod generating a specialised variant of Check:

  • Check
    • classmethod smaller_than()
    • classmethod string_matches()
    • ...
    • Hypothesis
      - classmethod one_sample_ttest()
      - classmethod two_sample_ttest()
      - ...

As we are already planning to move functionality out of the Check function to other places, there might not be too much code left to fill additional inheritance levels anyway.

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024

So, I guess after the nice refactoring of #153 and as we need it for other topics like the yaml i/o and schema inference, this topic here becomes interesting again.
Maybe I can find some time for this next weekend to write a couple of checks including tests. But before that let's nail down the architecture.

Based on what's written above we would go for factory methods? The usage would be for example:

from pandera import Check
my_range_check = Check.create_range_check(min=5, max=10)

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

Yes, I think the class (factory) methods are the way to go here.

One thing I wanted to discuss about design and naming is that I think it would be good to:

  • define a minimal set of "primitive" checks that we can use to construct other, more specialized checks, e.g. no trailing/leading whitespace check can use the regex matching check
  • name these built-in checks in an intuitive and concise way, so with the example above, I'd propose Check.in_range instead of Check.create_range_check since the Check namespace already tells the user that we want to define checks, and create_ is superfluous with the fact that we're initializing a check already.
  • use panda's vectorized, C-optimized functionality as much as possible
  • keep things simple by only applying checks to a single column (don't support built-in groupby checks yet)

Jumping off of the proposal in the issue description, here's a proposal draft for the first iteration of built-in checks.

For dtypes that support pandas comparison operators >, >=, <, <=, ==, !=

  • Check.greater_than: greater than scalar x
    • Alternatives:
      • Check.greater_than_or_equals: greater than or equal to scalar x
      • alternatively, the greater_than method can have an inclusive=bool flag
        that makes the provided scalar inclusive.
  • Check.equals: equal to scalar x
  • Check.not_equals: not equal to scalar x
  • Check.less_than: less than scalar x (for less_than_or_equals see note on Alternatives with greater_than method)
  • Check.in_range: is in range x - y (inclusive or exclusive)
  • Check.is_monotic: is sorted, ascending or descending, with strict=bool flag
    for strict/non-strict monotonicity

For dtypes that support checks for set membership

  • Check.isin: is in list-like/set s
  • Check.notin: is not in list-like/set s

For datetime dtypes

  • Check.date_format: check for date format "<some_format>"
  • (comparison operators and range checks also apply for datetime dtype)

For string dtype

  • Check.str_matches: column matches "regex" pattern
  • Check.str_has_no_trailing_whitespace: no trailing whitespace
  • Check.str_has_no_leading_whitespace: leading whitespace
  • Check.str_contains: column contains "string"
  • Check.str_startswith: column starts with "string"
  • Check.str_endswith: column ends with "string"

One thing I'm wondering is if we should have a functional_checks module that implements the core check logic as pure functions, and the Check.* class methods just delegate to those functions, so that core logic of checks are easier to test, and Check.* is simply the interface to those functions.

For example:

# checks.py

import functional_checks

from functools import partial

class Check():

    ...

    @classmethod
    def in_range(self, min_value, max_value, min_inclusive=True, max_inclusive=True):
        cls(
            fn=partial(
                functional_checks.in_range,
                min_value, max_value, min_inclusive, max_inclusive
            ),
            groups=None,
            groupby=None,
            element_wise=False,
            error="range check %s, %s" % (min_value, max_value)
        )

# functional_checks.py

def in_range(min_value, max_value, min_inclusive, max_inclusive, series):
    min_check = series >= min_value if min_inclusive else series > min_value
    max_check = series >= max_value if max_inclusive else series > max_value
    return min_check & max_check

Please add your thoughts @chr1st1ank @mastersplinter.

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024

This design still doesn't feel clear enough for me.

  1. The create_ suffix was to highlight that the factory functions within the Check class are not ordinary methods. This is because they are not just in the checks module, but within the Check class. But I agree that the suffix is a bit awkward.

  2. Clearly pure functions are easy to test. But a Check object behaves like a pure function already. If we have both, the classmethod and the external function we have to test both anyway.

  3. Another idea: What if we provided the check functions as higher order functions in the pandera.checks module? Example, using your in_range:

def in_range(min_value, max_value, min_inclusive, max_inclusive):
    def check_fn(series):
        min_check = series >= min_value if min_inclusive else series > min_value
        max_check = series >= max_value if max_inclusive else series > max_value
        return min_check & max_check
    return check_fn

Then the user would have to put them as fn argument into the Check constructor by himself like so:

import pandera
from pandera import Column, Check, DataFrameSchema, checks

schema = DataFrameSchema(
    {
        "column1": Column(pandera.Int, Check(checks.in_range(min=5, max=10)))
     }
)

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

But a Check object behaves like a pure function already. If we have both, the classmethod and the external function we have to test both anyway.

good point, I like the higher order functions idea. To refine the classmethods approach a little further, I think the following two options would get at the spirit of what you're saying that makes the syntax a little more concise (Check(checks.*) syntax seems a little clunky to me):

Option 1:

This is exactly as you put it, but returning a Check object directly

def in_range(min_value, max_value, min_inclusive, max_inclusive):
    def check_fn(series):
        min_check = series >= min_value if min_inclusive else series > min_value
        max_check = series >= max_value if max_inclusive else series > max_value
        return min_check & max_check
    return Check(check_fn, error="range check %s, %s" % (min_value, max_value))

Option 2:

Do the exact same thing as in option 1, except within the factory method:

class Check():

    ...

    @classmethod
    def in_range(self, min_value, max_value, min_inclusive=True, max_inclusive=True):
        def check_fn(series):
            min_check = series >= min_value if min_inclusive else series > min_value
            max_check = series >= max_value if max_inclusive else series > max_value
            return min_check & max_check
        cls(check_fn,
            error="range check %s, %s" % (min_value, max_value))

The problem with either of these approaches is that we lose the reusability of check_fn since it only exists within the scope of the higher order function (in option 1) or the class method (in option 2), but this may not be a big deal.

from pandera.

chr1st1ank avatar chr1st1ank commented on July 24, 2024

For me option 1 sounds better. Mainly because I would expect to find many checks in a module called checks, but not within a class called Check.

But the point with the function not being testable on its own is bothering me. You can see that I tried the implementation for greater_than for a start in PR #164 .
The amount of test code for such an easy thing is tremendous. I thought for a moment if it would be less if we put the actual check function separately as you suggested, but that doesn't help. The test code is so much because we always need a series to check if we want vectorized operations (which we definitely want).
Ideas welcome, of course...

Would you sign off the general approach as started in the pull request?

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

what are you thoughts @mastersplinter? to me 1 and 2 are logically equivalent, the difference is a matter of user expectations and API design. I'm still partial to the classmethod approach, I think it's a reasonable way of organizing the built-in checks, and it's already a pattern that's established with the built-in Hypothesis checks:

I'm open to option 1 if it makes more sense from a usability perspective, but the difference seems very small and I can't really decide and just default to sticking to the classmethod pattern. If we do move on option 1, though, we should pull out the those two Hypothesis methods into their own functions in the hypotheses namespace.

@chr1st1ank re: #164, these built-in checks should not take **check_params, since these should always be element_wise=False, groups=None, groupby=None. See #75 and #165 for supporting built-in checks for column comparison and group comparison checks.

The only parameter besides error that makes sense to pass in is n_failure_cases.

Re: testing, I think the tests you wrote look good to me, I think with the nature of this library the test-to-package code ratio will (or should) always be a little on a higher side :)

from pandera.

mastersplinter avatar mastersplinter commented on July 24, 2024

Purely out of convention I think the class method approach is more consistent across the codebase.

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

great, thanks for discussing and contributing @chr1st1ank! after this is done I'll push out the 0.2.8 0.3.0 release: https://github.com/pandera-dev/pandera/milestone/3

Edit: Thinking about this some more, I think #164 is a significant enough enhancement to warrant a minor version bump. I'll cut a 0.2.8 release in the next few days since I think there have been a good number of bug-fixes since 0.2.7.

from pandera.

cosmicBboy avatar cosmicBboy commented on July 24, 2024

fixed by #164

from pandera.

vovavili avatar vovavili commented on July 24, 2024

For datetime dtypes

  • Check.date_format: check for date format "<some_format>"
  • (comparison operators and range checks also apply for datetime dtype)

Hello Nils! Why was it decided for this issue to be closed without implementing Check.date_format in checks.py?

Vladimir

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.