synthesized-io / insight Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 0.0 306.78 MB

🧿 Metrics & Monitoring of Datasets

License: BSD 3-Clause "New" or "Revised" License

Python 99.17% Dockerfile 0.44% Mako 0.39%

python data-science insights metrics monitoring data data-analysis framework

insight's Introduction

🧿 insight

A python package to quickly understand, assess, and compare pandas Series/DataFrames.

The predominant functions in the package focus on easy-to-use metrics and intelligent plotting functions. The metrics can also be configured from YAML to allow for simple to configure benchmarking and assessment scripts.

Installation

pip install insight

Usage

Metrics

At the core of insight are the metrics classes which can be evaluated on one series, two series, one dataframe or two dataframes.

>>> import insight.metrics as m
>>> metric = m.EarthMoversDistance()
>>> metric(df['A'], df['B'])
0.14

Plotting

The package provides various plotting functions which allow you to easily explore any series, dataframe or multiple dataframes.

>>> import insight.plotting as p
>>> p.plot_dataset([df1, df2])

Migrations

insight populates the results to the Postgres database configured by environment variables. To run migrations against it, simply:

insight-migrations

insight's People

Contributors

Stargazers

Watchers

insight's Issues

Cramer's V Produces NaNs on COMPAS

The current version of Cramer's V produces NaNs and RuntimeWarnings when run on categorical column pairs from the COMPAS dataset or when used for generating heatmaps in Fairlens.

It seems to work when used on UCI Adult and German Credit dataset, so it might be a more subtle issue.

Here is the specific RuntimeWarning:

Update .gitignore

Looks like .DStore isnt in the .gitignore file

Easily save the results of a synthesis trial to a database including: the metrics and value the dataset the generative model

Additional metrics (+tests)

For the tests of the new metrics, would be good to follow (you can search this approach too):

Arrange
Act
Assert

Let's have a look at pytest.fixtures too.

Refactor synthesized-io/insight repo

Set Up Documentation within the CI pipeline

We need to have some form of documentation and notebook examples of how this repo works.

Using Docker Compose, configure a database and Grafana dashboard for monitoring metrics

Use python 3.8 and numpy 1.21 to enable numpy.typing

Numpy.typing only works with python 3.8 and numpy 1.21, hence making changes to do the same.

Setting up CI + Packaging

CI to be setup in github actions:

Linting
Testing
Packaging
Publishing

Create/Add metrics for missing values

Port metrics matrix, adapter, etc. to the insight repo

Port the following metrics to the insight repo:

DiffColumnMetricAdapter
ColumnVector
DataFrameVector
TwoDataFrameVector
ColumnMetricVector
ColumnComparisonVector
RollingColumnMetricVector
ChainColumnVector
DataFrameMatrix
TwoDataFrameMatrix
TwoColumnMetricMatrix
DiffMetricMatrix

Kendall Tau Correlation metric peformance issues.

The KT correlation metric is causing evaluation scripts to run very slowly.

Remove insight folder in SDK (including tests)

Remove metrics and related files from the synthesized.

Optimization list for metrics

In classes like KendallTauCorrelation, there's a conversion of series to integer if they are of datetime type. This conversion is expensive.
Conversion to categorical vars in KendallTauCorrelation is repeated for sr_a and sr_b, which could be abstracted away.

In Mean._compute_metric, subtracting and then adding an array of zeros is redundant. It could be simplified to np.nanmean(sr.values).

The StandardDeviation class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.

In EarthMoversDistance, we can perform the operations we need on pd series more efficiently.
Also, we use dict to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.

We can use np.nan_to_num in EarthMoversDistanceBinned to handle nans more explicitly.

In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical)
The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical)
Overall, it seems like the there could be less code duplication (not critical)

Tasks

Beta Give feedback

Refactor datetime conversion in KendallTauCorrelation
Implement a utility function for categorical conversion in KendallTauCorrelation
Update Mean._compute_metric to use np.nanmean(sr.values)`
Update StandardDeviation to decide if sorting is necessary
Use pandas ops in EarthMoversDistance
Use np.nan_to_num in EarthMoversDistanceBinned to handle nans
Move up the checks for empty series / special conditions
Abstract away the similar check_column_types methods across all classes
Options

TypeError for columns with dtype=object that could be inferred as numeric dtype

Error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Suppose i have the following column (it's from the noaa dataset): ['718270', '718090', '718090', '710680', '475840'] (in the original dataset the column also contains some nans)

The original dtype of this column is 'object' (because each item is a string). In KullbackLeiblerDivergence we have the following code:

    def _compute_metric(self, sr_a: pd.Series, sr_b: pd.Series):
        (p, q) = zipped_hist((sr_a, sr_b), check=self.check)
        ...

In zipped_hist we have:

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))

    if is_continuous:
        np.histogram(...)

The error later is caused by the dtype=object column passed into the np.histogram function. It happens because we use check.continuous method to determine whether the column is continuous, and inside that method we use check.infer_dtype to change the dtype (to int in this case). But we never convert the original column, so np.histogram gets the series with dtype=object.

Is this a bug? If so, should this be a solution?

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))
    joint = check.infer_dtype(joint)
    data = [check.infer_dtype(series) for series in data]

Add precommit hooks to insight repo

Add typing stubs

mypy gives an error: Cannot find implementation or library stub for module named "insight" [import]

Need to add those!

Incorporate synthesized evaluation framework into insight

Add:

Evaluation framework
Assessor plots from SDKv2

Insight EarthMoversDistance cannot deal with ints vs floats comparison

When comparing a column with float type data and a column with int type data. EarthMoversDistance metric returns maximum distance

Configure metrics with YAML

The idea has two steps:

adding Metric.to_dict() and Metric.from_dict(d: Dict[str, Any])
Convert python dictionaries to/from YAML.

class Metric1(ABC):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    @property
    def a(self) -> float:
       return self._a

    def to_dict(self) -> Dict[str, Any]:
        raise NotImplementedError

    @classmethod
    def from_dict(cls, d: Dict[str, Any]) -> Metric:
        raise NotImplementedError


class Metric2(Metric1):
    def __init__(self, a):
        super().__init__(a, None)

    @property
    def a(self) -> float:
       return self._a

    @property
    def b(self) -> float:
       return self._a

    def to_dict(self) -> Dict[str, Any]:
        raise NotImplementedError

    @classmethod
    def from_dict(cls, d: Dict[str, Any]) -> Metric:
        raise NotImplementedError

A good test of the functionality:

# for any metric
my_metric = Metric(**config)

d = my_metric.to_dict()
my_metric2 = Metric.from_dict(d)

Assert my_metric(df) == my_metric2(df)

If i am given a dictionary, which function to I call to recreate the right metric?

d = load_dict(...)

EarthMoversDistance.from_dict(d)
# or do i call
KolmogorovSmirnovDistance.from_dict(d, check: Check)
# or...
TwoColumnMetric.from_dict(d)

{
  "metric_name": "EarthMoversDistance",
  "other_parameter": 0.3 
}

class Metric(ABC):
    
    _registry: Dict[str, Type[Metric]] = dict()

    @classmethod
    def _register_class(cls):
        ...

Overview of synthesized-io/insight

Provide an architectural overview of how metrics work in the current setup of synthesized insight.

Miro Board: https://miro.com/app/board/uXjVOrtAnvg=/

Change the reference of ColumnCheck to Check and handle method infer_dtype

Changes are required so that the SDK's implementation of Check class can work without any issue. Also, infer_dtype needs to be accessible through the SDK's check implementation

Metrics returning infinity as a result

Some of the metrics in the insight package are returning infinity instead of NaN.

Will share more details here shortly.

Set up convenient plotting functions for the different types of metrics

Complicated matrix metric return type.

The return type here looks pretty complicated. I suppose its because sometimes we are returning a dataframe of values and sometimes a dataframe of p_values as well right?

It would be nice if we could keep the return types as just a single DataFrame.

Maybe using something like this could help?

>>> import pandas as pd
>>> import numpy as np
>>> values = np.array([1, 1.5, 2])
>>> p_values = np.array([0.5, 0.1, 0.01])

>>> print(values, values.dtype)
[1.  1.5 2. ] float64

>>> print(p_values, p_values.dtype)
[0.5  0.1  0.01] float64

>>> metric = np.dtype([
...     ('value', 'f4'),
...     ('p_value', 'f4')
... ])
>>> combined = np.array([a  for a in zip(values, p_values)], dtype=metric)

>>> print(combined, combined.dtype)
[(1. , 0.5 ) (1.5, 0.1 ) (2. , 0.01)] [('value', '<f4'), ('p_value', '<f4')]

>>> print(combined['value'], 
[1.  1.5 2. ]

>>> print(combined['p_value'])
[0.5  0.1  0.01]

Originally posted by @simonhkswan in #32 (comment)

Be able to effectively plot visualizations of different metrics

Say you have a Metric and you only know what its Abstract Type is (i.e., Column, TwoColumn, etc.)

You also have Two Data Frames:
df_a, df_b

Can you write some functions to evaluate the metric. on the dataframes and then provide useful plots of the metrics.

class AvgColumnMetric(DataFrameMetric):
    """Adapter class"""
    def __init__(self, metric: ColumnMetric):
        self._metric = metric
        self.name = f"avg_{self._metric.name}"

    def call(df: pd.DataFrame):
        values = [self._metric(df[col]) for col in df.columns]
        return np.mean(values)

Column -> value for each column, a colour for each dataframe as a bar chart
TwoColumn:

comparing df_a[col] to df_b[col] for col in df_aorb.columns. -> bar chart one colour -> a plot for each column showing the two distributions
comparing df_a[col1] to df_a[col2] -> confusion matrix -> two of them.
```
      df_b[col1] to df_b[col2]
```

Error in bin centre computation for EMD

bin_centers = bin_edges[:-1] + np.diff(bin_edges) / 2

This computation of bin centers is incorrect. Needs to be this.

bin_centers = (bin_edges[:-1] + bin[1:]) / 2

Also if we want to handle the bin_edges = None case we could have something like this.

def _compute_metric(self, x: pd.Series, y: pd.Series) -> float:
    (p, q), bin_edges = utils.zipped_hist((x, y), bin_edges=self.bin_edges, ret_bins=True)

    if bin_edges is None:
        bin_centers = np.arange(len(p))
    else:
        bin_centers = (np.array(bin_edges[:-1]) + np.array(bin_edges[1:])) / 2

    return wasserstein_distance(bin_centers, bin_centers, u_weights=p, v_weights=q)

If we change it to this there's no need to have a separate EMD (behaves the same as emd_samples from pyemd).