Git Product home page Git Product logo

pollution-select-feature-selection's Introduction

Background

Pollution Select is a feature selection algorithm method based on ideas from boruta and other iterative selection methods. It finds features that consistently achieve a desired performance criteria and are more important than random noise in monte carlo cross-validation.

Algorithm

  • As input, Pollution Select receives a model, a performance evaluation function and a threshold.
  • For n_iters:
    • Generate k+2 polluted features by selecting k random features, shuffling them to decorrelate with the target, and additionally creating two noisy features by drawing from distributions
    • Train the model on a polluted training set with d + k + 2 features and checks that the desired performance threshold is met on the test set (else skip iteration)
    • Compare the importance of each original feature to every polluted feature. Assigns each feature a score of 1 for the iteration if its importance is greater than every noisy feature
    • Update the overall importance of each feature as cumulative_score / n_iterations

Install

The simplest way to install right now is to clone this repo and then do a local install:

git clone https://github.com/ZainNasrullah/feature-selection-experiments.git
cd feature-selection-experiments
pip install .

Quick Start

Simple example without dropping any features:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from pollution_select import PollutionSelect

iris = load_iris()
X = iris.data
y = iris.target
X_noise = np.concatenate(
    (np.random.rand(150, 1), X, np.random.rand(150, 1)), axis=1
)

def acc(y, preds):
    return np.mean(y == preds)

selector = PollutionSelect(
    RandomForestClassifier(),
    performance_function=acc,
    performance_threshold=0.7,
)

X_transform = selector.fit_transform(X_noise, y)
print(selector.feature_importances_)

More complex example with feature dropping:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from pollution_select import PollutionSelect

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, n_redundant=5
)

def acc(y, preds):
    return np.mean(y == preds)

selector = PollutionSelect(
    RandomForestClassifier(),
    n_iter=100,
    pollute_type="random_k",
    drop_features=True,
    performance_threshold=0.7,
    performance_function=acc,
    min_features=4,
)

selector.fit(X, y)

print(selector.retained_features_)
print(selector.dropped_features_)
print(selector.feature_importances_)

selector.plot_test_scores_by_iters()
selector.plot_test_scores_by_n_features()

pollution-select-feature-selection's People

Contributors

zainnasrullah avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.