Git Product home page Git Product logo

kmeans_smote's Introduction

Oversampling for Imbalanced Learning based on K-Means and SMOTE

PyPI version Build Status Docs Status codecov

K-Means SMOTE is an oversampling method for class-imbalanced data. It aids classification by generating minority class samples in safe and crucial areas of the input space. The method avoids the generation of noise and effectively overcomes imbalances between and within classes.

This project is a python implementation of k-means SMOTE. It is compatible with the scikit-learn-contrib project imbalanced-learn.

Installation

Dependencies

The implementation is tested under python 3.6 and works with the latest release of the imbalanced-learn framework:

  • imbalanced-learn (>=0.4.0, <0.5)
  • numpy (numpy>=1.13, <1.16)
  • scikit-learn (>=0.19.0, <0.21)

Installation

Pypi

pip install kmeans-smote

From Source

Clone this repository and run the setup.py file. Use the following commands to get a copy from GitHub and install all dependencies:

git clone https://github.com/felix-last/kmeans_smote.git
cd kmeans-smote
pip install .

Documentation

Find the API documentation at https://kmeans_smote.readthedocs.io. As this project follows the imbalanced-learn API, the imbalanced-learn documentation might also prove helpful.

Example Usage

import numpy as np
from imblearn.datasets import fetch_datasets
from kmeans_smote import KMeansSMOTE

datasets = fetch_datasets(filter_data=['oil'])
X, y = datasets['oil']['data'], datasets['oil']['target']

[print('Class {} has {} instances'.format(label, count))
 for label, count in zip(*np.unique(y, return_counts=True))]

kmeans_smote = KMeansSMOTE(
    kmeans_args={
        'n_clusters': 100
    },
    smote_args={
        'k_neighbors': 10
    }
)
X_resampled, y_resampled = kmeans_smote.fit_sample(X, y)

[print('Class {} has {} instances after oversampling'.format(label, count))
 for label, count in zip(*np.unique(y_resampled, return_counts=True))]

Expected Output:

Class -1 has 896 instances
Class 1 has 41 instances
Class -1 has 896 instances after oversampling
Class 1 has 896 instances after oversampling

Take a look at imbalanced-learn pipelines for efficient usage with cross-validation.

About

K-means SMOTE works in three steps:

  1. Cluster the entire input space using k-means [1].
  2. Distribute the number of samples to generate across clusters:
    1. Filter out clusters which have a high number of majority class samples.
    2. Assign more synthetic samples to clusters where minority class samples are sparsely distributed.
  3. Oversample each filtered cluster using SMOTE [2].

Contributing

Please feel free to submit an issue if things work differently than expected. Pull requests are also welcome - just make sure that tests are green by running pytest before submitting.

Citation

If you use k-means SMOTE in a scientific publication, we would appreciate citations to the following paper:

@article{kmeans_smote,
    title = {Oversampling for Imbalanced Learning Based on K-Means and SMOTE},
    author = {Last, Felix and Douzas, Georgios and Bacao, Fernando},
    year = {2017},
    archivePrefix = "arXiv",
    eprint = "1711.00837",
    primaryClass = "cs.LG"
}

References

[1] MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations.” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, p. 281-297.

[2] Chawla, Nitesh V., et al. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research, vol. 16, Jan. 2002, p. 321357, doi:10.1613/jair.953.

kmeans_smote's People

Contributors

felix-last avatar dependabot-support avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.