Git Product home page Git Product logo

datalib's Introduction

datalib

datalib's People

Contributors

dihanster avatar pibieta avatar jeancmaia avatar vitaliset avatar

Watchers

Isaac Sousa avatar  avatar

datalib's Issues

TST Use scikit-learn general tests

There are some tests from scikit-learn that run across multiple modules checking for docstring quality and other desired checks such as:

These are some feel examples, but there may be others.

I'd like to use these same tests on our code. How should we do this? Is it possible to "import" tests from other libraries? Is it good practice? The other option is to copy the code literally. It looks like imb-learn copies it, but I would say it might be automated (?).

Either way, we should implement the tests at some point.

FEA Create Ranked Probability Score

Build an RPS metric. Very useful in the context of dataset shift with several timestamps (a model that with X tries to predict the timestamp that is and has good performance allows us to identify dataset shift).

image

FEA Discrete lift curve + scores implementation

What I'm calling "discrete_lift_curve": I simply sort by score and see how the average proportion of score groups behave. Ideally groups with lower scores should also have lower average y (both in the case of predict_proba for classification and in the case of predict for regression).

I also have some some statistics that I can get from this curve such as the mean difference between each bin. This can be interpreted as scores.

I have this python script that implements the metric, but I'm using pandas. Probably chatGPT can help us change the pandas to numpy.
Also, this implementation uses "mean" as aggregation for the "groupby" but I think we should let the user define what he wants (sometimes a quantile will be more valuable).

import numpy as np
import pandas as pd

from sklearn.utils.validation import _check_sample_weight
from pandas._libs.lib import is_integer

def weighted_qcut(values, weights, q, **kwargs):
    # https://stackoverflow.com/questions/45528029/python-how-to-create-weighted-quantiles-in-pandas
    'Return weighted quantile cuts from a given series, values.'
    if is_integer(q):
        quantiles = np.linspace(0, 1, q + 1)
    else:
        quantiles = q
    order = pd.Series(weights).iloc[values.argsort()].cumsum()
    bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
    return bins.sort_index()#.values

def discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
    sample_weight = _check_sample_weight(sample_weight, y_true)
    bins_scores = weighted_qcut(y_pred, sample_weight, bins, labels=False)
    return (np.array(list(range(bins))),
            np.array([np.average(y_true[bins_scores==q], weights=sample_weight[bins_scores==q]) for q in range(bins)]))
     
def last_bin_of_discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
    _, values = discrete_lift_curve(y_true, y_pred, bins=bins, sample_weight=sample_weight)
    return values[-1]
 
def mean_diff_of_discrete_lift_curve(y_true, y_pred, bins=10, sample_weight=None):
    _, values = discrete_lift_curve(y_true, y_pred, bins=bins, sample_weight=sample_weight)
    return pd.Series(values).diff().mean()

Should this be in this library?

MAINT Make automatic validation for all public functions

In the past few months, scikit-learn started using a decorator for validating function parameters. We worked on the last features before that implementation, so it is ok to leave them in the state we created for some time, but we should update them someday.

Every function implementation at the main branch that does not have the decorator and should have (because it has some if logic):

  • metrics.delinquency_curve
  • metrics.cap_curve

Original issue from sklearn explaining how to use it:
scikit-learn/scikit-learn#24862

MAINT flake8: do the E402 ignore rule for examples folder

Because of flake8 E402, we get an error if we have imports in the middle of the code. This forces us to adapt the documentation of the examples for the time being for imports. However, we need to remove this rule in that folder once we ignore this restriction inside the examples folder.

Example:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datalib import DeliquencyDisplay
"""
==================
Delinquency curves
==================
The delinquency analysis is key to understand the default rate pondered
by approval rates. This example demonstrates how leverage deliquency curves for
a binary classifier.
"""
X, y = make_classification(
n_samples=100, n_features=5, n_informative=2, random_state=42
)

See scikit-learn:
https://github.com/scikit-learn/scikit-learn/blob/616db5c03259336a83cf4b45588699c1647e43b6/setup.cfg#L70-L74

DOC Create a contributing guide

We need to develop a page in the Sphinx doc that shows to the user the process of opening a pull request. We can link to other places but it is important to have this inside our structure as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.