synthesized-io / insight Goto Github PK
View Code? Open in Web Editor NEW🧿 Metrics & Monitoring of Datasets
License: BSD 3-Clause "New" or "Revised" License
🧿 Metrics & Monitoring of Datasets
License: BSD 3-Clause "New" or "Revised" License
bin_centers = bin_edges[:-1] + np.diff(bin_edges) / 2
This computation of bin centers is incorrect. Needs to be this.
bin_centers = (bin_edges[:-1] + bin[1:]) / 2
Also if we want to handle the bin_edges = None
case we could have something like this.
def _compute_metric(self, x: pd.Series, y: pd.Series) -> float:
(p, q), bin_edges = utils.zipped_hist((x, y), bin_edges=self.bin_edges, ret_bins=True)
if bin_edges is None:
bin_centers = np.arange(len(p))
else:
bin_centers = (np.array(bin_edges[:-1]) + np.array(bin_edges[1:])) / 2
return wasserstein_distance(bin_centers, bin_centers, u_weights=p, v_weights=q)
If we change it to this there's no need to have a separate EMD (behaves the same as emd_samples from pyemd
).
The return type here looks pretty complicated. I suppose its because sometimes we are returning a dataframe of values and sometimes a dataframe of p_values as well right?
It would be nice if we could keep the return types as just a single DataFrame.
Maybe using something like this could help?
>>> import pandas as pd
>>> import numpy as np
>>> values = np.array([1, 1.5, 2])
>>> p_values = np.array([0.5, 0.1, 0.01])
>>> print(values, values.dtype)
[1. 1.5 2. ] float64
>>> print(p_values, p_values.dtype)
[0.5 0.1 0.01] float64
>>> metric = np.dtype([
... ('value', 'f4'),
... ('p_value', 'f4')
... ])
>>> combined = np.array([a for a in zip(values, p_values)], dtype=metric)
>>> print(combined, combined.dtype)
[(1. , 0.5 ) (1.5, 0.1 ) (2. , 0.01)] [('value', '<f4'), ('p_value', '<f4')]
>>> print(combined['value'],
[1. 1.5 2. ]
>>> print(combined['p_value'])
[0.5 0.1 0.01]
Originally posted by @simonhkswan in #32 (comment)
The idea has two steps:
Metric.to_dict()
and Metric.from_dict(d: Dict[str, Any])
class Metric1(ABC):
def __init__(self, a, b):
self.a = a
self.b = b
@property
def a(self) -> float:
return self._a
def to_dict(self) -> Dict[str, Any]:
raise NotImplementedError
@classmethod
def from_dict(cls, d: Dict[str, Any]) -> Metric:
raise NotImplementedError
class Metric2(Metric1):
def __init__(self, a):
super().__init__(a, None)
@property
def a(self) -> float:
return self._a
@property
def b(self) -> float:
return self._a
def to_dict(self) -> Dict[str, Any]:
raise NotImplementedError
@classmethod
def from_dict(cls, d: Dict[str, Any]) -> Metric:
raise NotImplementedError
A good test of the functionality:
# for any metric
my_metric = Metric(**config)
d = my_metric.to_dict()
my_metric2 = Metric.from_dict(d)
Assert my_metric(df) == my_metric2(df)
If i am given a dictionary, which function to I call to recreate the right metric?
d = load_dict(...)
EarthMoversDistance.from_dict(d)
# or do i call
KolmogorovSmirnovDistance.from_dict(d, check: Check)
# or...
TwoColumnMetric.from_dict(d)
{
"metric_name": "EarthMoversDistance",
"other_parameter": 0.3
}
class Metric(ABC):
_registry: Dict[str, Type[Metric]] = dict()
@classmethod
def _register_class(cls):
...
Some of the metrics in the insight package are returning infinity instead of NaN.
Will share more details here shortly.
Add:
mypy
gives an error: Cannot find implementation or library stub for module named "insight" [import]
Need to add those!
Looks like .DStore isnt in the .gitignore file
Numpy.typing only works with python 3.8 and numpy 1.21, hence making changes to do the same.
Provide an architectural overview of how metrics work in the current setup of synthesized insight.
Miro Board: https://miro.com/app/board/uXjVOrtAnvg=/
The current version of Cramer's V produces NaNs
and RuntimeWarnings
when run on categorical column pairs from the COMPAS
dataset or when used for generating heatmaps in Fairlens
.
It seems to work when used on UCI Adult
and German Credit
dataset, so it might be a more subtle issue.
Here is the specific RuntimeWarning
:
Say you have a Metric
and you only know what its Abstract Type is (i.e., Column, TwoColumn, etc.)
You also have Two Data Frames:
df_a, df_b
Can you write some functions to evaluate the metric. on the dataframes and then provide useful plots of the metrics.
class AvgColumnMetric(DataFrameMetric):
"""Adapter class"""
def __init__(self, metric: ColumnMetric):
self._metric = metric
self.name = f"avg_{self._metric.name}"
def call(df: pd.DataFrame):
values = [self._metric(df[col]) for col in df.columns]
return np.mean(values)
Column -> value for each column, a colour for each dataframe as a bar chart
TwoColumn:
df_b[col1] to df_b[col2]
The KT correlation metric is causing evaluation scripts to run very slowly.
When comparing a column with float type data and a column with int type data. EarthMoversDistance metric returns maximum distance
Changes are required so that the SDK's implementation of Check class can work without any issue. Also, infer_dtype needs to be accessible through the SDK's check implementation
For the tests of the new metrics, would be good to follow (you can search this approach too):
Let's have a look at pytest.fixture
s too.
Error:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Suppose i have the following column (it's from the noaa dataset): ['718270', '718090', '718090', '710680', '475840'] (in the original dataset the column also contains some nans)
The original dtype of this column is 'object' (because each item is a string). In KullbackLeiblerDivergence we have the following code:
def _compute_metric(self, sr_a: pd.Series, sr_b: pd.Series):
(p, q) = zipped_hist((sr_a, sr_b), check=self.check)
...
In zipped_hist we have:
joint = pd.concat(data)
is_continuous = check.continuous(pd.Series(joint))
if is_continuous:
np.histogram(...)
The error later is caused by the dtype=object column passed into the np.histogram function. It happens because we use check.continuous method to determine whether the column is continuous, and inside that method we use check.infer_dtype to change the dtype (to int in this case). But we never convert the original column, so np.histogram gets the series with dtype=object.
Is this a bug? If so, should this be a solution?
joint = pd.concat(data)
is_continuous = check.continuous(pd.Series(joint))
joint = check.infer_dtype(joint)
data = [check.infer_dtype(series) for series in data]
Remove metrics and related files from the synthesized.
CI to be setup in github actions:
We need to have some form of documentation and notebook examples of how this repo works.
In classes like KendallTauCorrelation
, there's a conversion of series to integer if they are of datetime type. This conversion is expensive.
Conversion to categorical vars in KendallTauCorrelation
is repeated for sr_a
and sr_b
, which could be abstracted away.
In Mean._compute_metric
, subtracting and then adding an array of zeros is redundant. It could be simplified to np.nanmean(sr.values)
.
The StandardDeviation
class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.
In EarthMoversDistance
, we can perform the operations we need on pd series more efficiently.
Also, we use dict
to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.
We can use np.nan_to_num
in EarthMoversDistanceBinned
to handle nans more explicitly.
In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical)
The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical)
Overall, it seems like the there could be less code duplication (not critical)
Port the following metrics to the insight repo:
DiffColumnMetricAdapter
ColumnVector
DataFrameVector
TwoDataFrameVector
ColumnMetricVector
ColumnComparisonVector
RollingColumnMetricVector
ChainColumnVector
DataFrameMatrix
TwoDataFrameMatrix
TwoColumnMetricMatrix
DiffMetricMatrix
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.