Git Product home page Git Product logo

skippa's Introduction

pypi python versions downloads Build status



logo

Skippa

SciKIt-learn Pre-processing Pipeline in PAndas

Read more in the introduction blog on towardsdatascience

Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as scikit-pandas, but a different (and hopefully better) way to achieve it.

Installation

pip install skippa

Optional, if you want to use the gradio app functionality:

pip install skippa[gradio]

Basic usage

Import Skippa class and columns helper function

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns

Get some data

df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])

Define your pipeline:

pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)

and use it for fitting / predicting like this:

pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)

If you want details on your model, use:

model = pipe.get_model()
print(model.coef_)
print(model.intercept_)

(de)serialization

And of course you can save and load your model pipelines (for deployment). N.B. dill is used for ser/de because joblib and pickle don't provide enough support.

pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)

See the ./examples directory for more examples:

To Do

  • Support pandas assign for creating new columns based on existing columns
  • Support cast / astype transformer
  • Support for .apply transformer: wrapper around pandas.DataFrame.apply
  • Check how GridSearch (or other param search) works with Skippa
  • Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
  • Support PCA transformer
  • Facilitate random seed in Skippa object that is dispatched to all downstream operations
  • fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
  • Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
  • Use sklearn's new dataframe output setting
  • Validation of pipeline steps
  • Input validation in transformers
  • Transformer for replacing values (pandas .replace)
  • Support arbitrary transformer (if column-preserving)
  • Eliminate the need to call columns explicitly

Credits

skippa's People

Contributors

robert-dsl avatar

Stargazers

Guilherme D. F. Silva avatar Simone Bifani avatar Miguel Alejandro Martín avatar Kishan Parshotam  avatar Sya Raihan Heggi avatar Cees Kaandorp avatar JulianD avatar Ayush Kumar avatar Michelangelo D'Agostino avatar Olivier Maillot avatar Antonio Carlos Falcão Petri avatar Antonin avatar  avatar Jason Lee Ertle avatar Govind Lowanshi avatar  avatar Thiago Coelho Vieira avatar Oliver Pfaffel avatar Can Özmen avatar Adam Richie-Halford avatar Sergey Keller avatar Ivo Petiz avatar Tom Szumowski avatar Jan Van Haaren avatar  avatar Leo Nistor avatar Jonathan Benedict Sirait avatar Jason Hite avatar Ruben Helsloot avatar Srikanth K S avatar Angel Martínez-Tenor avatar  avatar Viktor Tisza avatar Paulo Haddad avatar  avatar Robert Milletich avatar  avatar Kevin Arvai avatar  avatar  avatar  avatar Mohamed El-Touny avatar

Watchers

Robert van Straalen avatar Kostas Georgiou avatar

Forkers

govind2210

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.