Git Product home page Git Product logo

insight's Introduction

🧿 insight

GitHub top language GitHub PyPI - Downloads GitHub Repo stars

A python package to quickly understand, assess, and compare pandas Series/DataFrames.

The predominant functions in the package focus on easy-to-use metrics and intelligent plotting functions. The metrics can also be configured from YAML to allow for simple to configure benchmarking and assessment scripts.

PyPI CodeQL Status CI Status Coverage Code Smells pre-commit.ci status

Installation

pip install insight

Usage

Metrics

At the core of insight are the metrics classes which can be evaluated on one series, two series, one dataframe or two dataframes.

>>> import insight.metrics as m
>>> metric = m.EarthMoversDistance()
>>> metric(df['A'], df['B'])
0.14

Plotting

The package provides various plotting functions which allow you to easily explore any series, dataframe or multiple dataframes.

>>> import insight.plotting as p
>>> p.plot_dataset([df1, df2])

Migrations

insight populates the results to the Postgres database configured by environment variables. To run migrations against it, simply:

insight-migrations
distribution plots

insight's People

Contributors

akanksha1304 avatar dependabot[bot] avatar hamishteagle avatar hebruwu avatar jamied157 avatar marqueewinq avatar nialldevlin1 avatar pre-commit-ci[bot] avatar simonhkswan avatar tomcarter23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

insight's Issues

Add typing stubs

mypy gives an error: Cannot find implementation or library stub for module named "insight" [import]

Need to add those!

Additional metrics (+tests)

For the tests of the new metrics, would be good to follow (you can search this approach too):

  • Arrange
  • Act
  • Assert

Let's have a look at pytest.fixtures too.

TypeError for columns with dtype=object that could be inferred as numeric dtype

Error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Suppose i have the following column (it's from the noaa dataset): ['718270', '718090', '718090', '710680', '475840'] (in the original dataset the column also contains some nans)

The original dtype of this column is 'object' (because each item is a string). In KullbackLeiblerDivergence we have the following code:

    def _compute_metric(self, sr_a: pd.Series, sr_b: pd.Series):
        (p, q) = zipped_hist((sr_a, sr_b), check=self.check)
        ...

In zipped_hist we have:

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))

    if is_continuous:
        np.histogram(...)

The error later is caused by the dtype=object column passed into the np.histogram function. It happens because we use check.continuous method to determine whether the column is continuous, and inside that method we use check.infer_dtype to change the dtype (to int in this case). But we never convert the original column, so np.histogram gets the series with dtype=object.

Is this a bug? If so, should this be a solution?

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))
    joint = check.infer_dtype(joint)
    data = [check.infer_dtype(series) for series in data]

Optimization list for metrics

In classes like KendallTauCorrelation, there's a conversion of series to integer if they are of datetime type. This conversion is expensive.
Conversion to categorical vars in KendallTauCorrelation is repeated for sr_a and sr_b, which could be abstracted away.

In Mean._compute_metric, subtracting and then adding an array of zeros is redundant. It could be simplified to np.nanmean(sr.values).

The StandardDeviation class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.

In EarthMoversDistance, we can perform the operations we need on pd series more efficiently.
Also, we use dict to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.

We can use np.nan_to_num in EarthMoversDistanceBinned to handle nans more explicitly.

In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical)
The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical)
Overall, it seems like the there could be less code duplication (not critical)

Tasks

Error in bin centre computation for EMD

bin_centers = bin_edges[:-1] + np.diff(bin_edges) / 2

This computation of bin centers is incorrect. Needs to be this.

bin_centers = (bin_edges[:-1] + bin[1:]) / 2

Also if we want to handle the bin_edges = None case we could have something like this.

def _compute_metric(self, x: pd.Series, y: pd.Series) -> float:
    (p, q), bin_edges = utils.zipped_hist((x, y), bin_edges=self.bin_edges, ret_bins=True)

    if bin_edges is None:
        bin_centers = np.arange(len(p))
    else:
        bin_centers = (np.array(bin_edges[:-1]) + np.array(bin_edges[1:])) / 2

    return wasserstein_distance(bin_centers, bin_centers, u_weights=p, v_weights=q)

If we change it to this there's no need to have a separate EMD (behaves the same as emd_samples from pyemd).

Cramer's V Produces NaNs on COMPAS

The current version of Cramer's V produces NaNs and RuntimeWarnings when run on categorical column pairs from the COMPAS dataset or when used for generating heatmaps in Fairlens.

It seems to work when used on UCI Adult and German Credit dataset, so it might be a more subtle issue.

Here is the specific RuntimeWarning:

cramer

Port metrics matrix, adapter, etc. to the insight repo

Port the following metrics to the insight repo:

DiffColumnMetricAdapter
ColumnVector
DataFrameVector
TwoDataFrameVector
ColumnMetricVector
ColumnComparisonVector
RollingColumnMetricVector
ChainColumnVector
DataFrameMatrix
TwoDataFrameMatrix
TwoColumnMetricMatrix
DiffMetricMatrix

Complicated matrix metric return type.

The return type here looks pretty complicated. I suppose its because sometimes we are returning a dataframe of values and sometimes a dataframe of p_values as well right?

It would be nice if we could keep the return types as just a single DataFrame.

Maybe using something like this could help?

>>> import pandas as pd
>>> import numpy as np
>>> values = np.array([1, 1.5, 2])
>>> p_values = np.array([0.5, 0.1, 0.01])

>>> print(values, values.dtype)
[1.  1.5 2. ] float64

>>> print(p_values, p_values.dtype)
[0.5  0.1  0.01] float64

>>> metric = np.dtype([
...     ('value', 'f4'),
...     ('p_value', 'f4')
... ])
>>> combined = np.array([a  for a in zip(values, p_values)], dtype=metric)

>>> print(combined, combined.dtype)
[(1. , 0.5 ) (1.5, 0.1 ) (2. , 0.01)] [('value', '<f4'), ('p_value', '<f4')]

>>> print(combined['value'], 
[1.  1.5 2. ]

>>> print(combined['p_value'])
[0.5  0.1  0.01]

Originally posted by @simonhkswan in #32 (comment)

Configure metrics with YAML

The idea has two steps:

  1. adding Metric.to_dict() and Metric.from_dict(d: Dict[str, Any])
  2. Convert python dictionaries to/from YAML.
class Metric1(ABC):
    def __init__(self, a, b):
        self.a = a
        self.b = b

    @property
    def a(self) -> float:
       return self._a

    def to_dict(self) -> Dict[str, Any]:
        raise NotImplementedError

    @classmethod
    def from_dict(cls, d: Dict[str, Any]) -> Metric:
        raise NotImplementedError


class Metric2(Metric1):
    def __init__(self, a):
        super().__init__(a, None)

    @property
    def a(self) -> float:
       return self._a

    @property
    def b(self) -> float:
       return self._a

    def to_dict(self) -> Dict[str, Any]:
        raise NotImplementedError

    @classmethod
    def from_dict(cls, d: Dict[str, Any]) -> Metric:
        raise NotImplementedError

A good test of the functionality:

# for any metric
my_metric = Metric(**config)

d = my_metric.to_dict()
my_metric2 = Metric.from_dict(d)

Assert my_metric(df) == my_metric2(df)

If i am given a dictionary, which function to I call to recreate the right metric?

d = load_dict(...)

EarthMoversDistance.from_dict(d)
# or do i call
KolmogorovSmirnovDistance.from_dict(d, check: Check)
# or...
TwoColumnMetric.from_dict(d)
{
  "metric_name": "EarthMoversDistance",
  "other_parameter": 0.3 
}
class Metric(ABC):
    
    _registry: Dict[str, Type[Metric]] = dict()

    @classmethod
    def _register_class(cls):
        ...

Be able to effectively plot visualizations of different metrics

Say you have a Metric and you only know what its Abstract Type is (i.e., Column, TwoColumn, etc.)

  • You also have Two Data Frames:

  • df_a, df_b

Can you write some functions to evaluate the metric. on the dataframes and then provide useful plots of the metrics.

class AvgColumnMetric(DataFrameMetric):
    """Adapter class"""
    def __init__(self, metric: ColumnMetric):
        self._metric = metric
        self.name = f"avg_{self._metric.name}"

    def call(df: pd.DataFrame):
        values = [self._metric(df[col]) for col in df.columns]
        return np.mean(values)

Column -> value for each column, a colour for each dataframe as a bar chart
TwoColumn:

  • comparing df_a[col] to df_b[col] for col in df_aorb.columns. -> bar chart one colour -> a plot for each column showing the two distributions
  • comparing df_a[col1] to df_a[col2] -> confusion matrix -> two of them.
  •       df_b[col1] to df_b[col2]
    

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.