datalib's Introduction
datalib's People
datalib's Issues
TST Use scikit-learn general tests
There are some tests from scikit-learn that run across multiple modules checking for docstring quality and other desired checks such as:
sklearn.tests.test_docstring_parameters
sklearn.tests.test_public_functions
sklearn.tests.test_docstrings
These are some feel examples, but there may be others.
I'd like to use these same tests on our code. How should we do this? Is it possible to "import" tests from other libraries? Is it good practice? The other option is to copy the code literally. It looks like imb-learn
copies it, but I would say it might be automated (?).
Either way, we should implement the tests at some point.
CI CodeCov with GitHub Actions
I think we should add codecov for tests during PR checks.
FEA Create Ranked Probability Score
FEA Discrete lift curve + scores implementation
What I'm calling "discrete_lift_curve": I simply sort by score and see how the average proportion of score groups behave. Ideally groups with lower scores should also have lower average y
(both in the case of predict_proba for classification and in the case of predict for regression).
I also have some some statistics that I can get from this curve such as the mean difference between each bin. This can be interpreted as scores.
I have this python script that implements the metric, but I'm using pandas. Probably chatGPT can help us change the pandas to numpy.
Also, this implementation uses "mean" as aggregation for the "groupby" but I think we should let the user define what he wants (sometimes a quantile will be more valuable).
import numpy as np
import pandas as pd
from sklearn.utils.validation import _check_sample_weight
from pandas._libs.lib import is_integer
def weighted_qcut(values, weights, q, **kwargs):
# https://stackoverflow.com/questions/45528029/python-how-to-create-weighted-quantiles-in-pandas
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = pd.Series(weights).iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()#.values
def discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
sample_weight = _check_sample_weight(sample_weight, y_true)
bins_scores = weighted_qcut(y_pred, sample_weight, bins, labels=False)
return (np.array(list(range(bins))),
np.array([np.average(y_true[bins_scores==q], weights=sample_weight[bins_scores==q]) for q in range(bins)]))
def last_bin_of_discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
_, values = discrete_lift_curve(y_true, y_pred, bins=bins, sample_weight=sample_weight)
return values[-1]
def mean_diff_of_discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
_, values = discrete_lift_curve(y_true, y_pred, bins=bins, sample_weight=sample_weight)
return pd.Series(values).diff().mean()
Should this be in this library?
MAINT Make automatic validation for all public functions
In the past few months, scikit-learn started using a decorator for validating function parameters. We worked on the last features before that implementation, so it is ok to leave them in the state we created for some time, but we should update them someday.
Every function implementation at the main branch that does not have the decorator and should have (because it has some if logic):
-
metrics.delinquency_curve
-
metrics.cap_curve
Original issue from sklearn explaining how to use it:
scikit-learn/scikit-learn#24862
FEA Create KS score/curve/Display
Creating the issue for documentation.
The idea is to apply ks as a way to evaluate binary classification (also multiclass) as in this post.
MAINT Use scikit-learn-contrib template
Once we have time to understand, we should change our current template to scikit-learn-contrib's.
ENH Add `sample_weight` to `delinquency_curve`
Right now, delinquency_curve
is missing an important parameter: sample_weight
.
It should not be that hard to implement it using some similar logic to cap_curve
's:
datalib/datalib/metrics/_ranking.py
Lines 60 to 61 in fcff6c2
We can discuss this more if needed.
TST CAP curve needs tests for sample_weight
datalib/datalib/metrics/tests/test_ranking.py
Lines 155 to 158 in 9f175dd
Right now, this curve does not have checks for sample_weight which may cause wrong behaviour.
I suggest using tests similars to the ones implemented in #22.
MAINT Create organization and rename project
Name should be scikit-credit? :)
MAINT flake8: do the E402 ignore rule for examples folder
Because of flake8 E402, we get an error if we have imports in the middle of the code. This forces us to adapt the documentation of the examples for the time being for imports. However, we need to remove this rule in that folder once we ignore this restriction inside the examples folder.
Example:
datalib/examples/delinquency/plot_delinquency_curve.py
Lines 1 to 18 in 9f96f2d
See scikit-learn:
https://github.com/scikit-learn/scikit-learn/blob/616db5c03259336a83cf4b45588699c1647e43b6/setup.cfg#L70-L74
DOC Use Sphinx to create documentation
DOC Create a contributing guide
We need to develop a page in the Sphinx doc that shows to the user the process of opening a pull request. We can link to other places but it is important to have this inside our structure as well.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.