Git Product home page Git Product logo

kmeans-feature-importance's Introduction

kmeans-feature-importance

kmeans_interp is a wrapper around sklearn.cluster.KMeans which adds the property feature_importances_ that will act as a cluster-based feature weighting technique. Features are weighted using either of the two methods: wcss_min or unsup2sup.

Methodology

WCSS Minimizer

wcss_min method explanation

The method is a direct analysis of each centroid's sub-optimal position. K-Means aim is to minimize the Within-Cluster Sum of Squares and consequently the Between-Cluster Sum of Squares, and assuming that the distance metric used is euclidean. The euclidean distance from a cluster centroid C_j and point p_i:

distance_eq

And the WCSS for one cluster C_j that has p_m points is (Excuse the usage of i differently):

wcss_eq

Then we will try to find the feature d_i that was responsible for the highest amount of WCSS (The sum of squares of each data point distance to its cluster centroid) minimization through finding the maximum absolute centroid dimensional movement.

Unsupervised 2 Supervised

unsup2sup method explanation

Another interpretation approach is to convert the unsupervised classification problem into a supervised classification settings using an easily interpretable model such as tree-based models (We will be using a Random Forest Classifier). The steps to do this is as follows:

  1. Change the cluster labels into One-vs-All for each label
  2. Train a classifier to discriminate between each cluster and all other clusters
  3. Extract the feature importances from the model (We will be using sklearn.ensemble.RandomForestClassifier)

Usage

  1. Clone the repository git clone https://github.com/YousefGh/kmeans-feature-importance.git
  2. Move kmeans_interp folder to your project directory
  3. Follow the instructions below

You can instantiate KMeansInterp in the same way you sklearn.cluster.KMeans is instantiated, but you will need to provide the feature names to ordered_feature_names parameter, which should have the order of X features.

from kmeans_interp.kmeans_feature_imp import KMeansInterp

X = pd.DataFrame(...) # DataFrame is an example to deliver the idea of features order

kms = KMeansInterp(
	n_clusters=5,
	ordered_feature_names=X.columns.tolist(), 
	feature_importance_method='wcss_min', # or 'unsup2sup'
).fit(X.values)

# A dictionary where the key [0] is the cluster label, and [:10] will refer to the first 10 most important features
kms.feature_importances_[0][:10] # Features here are words
# [('film', 0.39589216529770005),
#  ('award', 0.1605575985825074),
#  ('actor', 0.12619074083837967),
#  ('oscar', 0.1178746877093894),
#  ('star', 0.1048044246433086),
#  ('actress', 0.0805780582173732),
#  ('movie', 0.07849181814402928),
#  ('director', 0.07750076034520005),
#  ('year', 0.05714139742209183),
#  ('won', 0.05598607819724065)]

The method was applied on a natural language processing (NLP) example which could be considered as an unsupervised cluster based keyword extraction technique

Example Output

I have chosen to apply the interpretation technique on an NLP problem since we can easily relate to the feature importances (words) which could be considered as a corpus-based keyword extraction technique where our aim is to cluster similar documents together using K-Means, and then apply the methods above. The dataset I have used can be found here Kaggle BBC-News. This dataset presents a classification problem but we will be using the categories as a final comparison

Category Cluster Distribution

Category Cluster Feature Comparison min_wcss

cat_distrib

politics tech business entertainment min_wcss feature importance

Sports min_wcss feature importance

Category Cluster Feature Comparison unsup2sup

cat_distrib

politics tech business entertainment min_wcss feature importance

Sports min_wcss feature importance

Requirements

scikit-learn~=0.24.2
numpy

References

  1. Y. Liu, Z. Li, H. Xiong, X. Gao and J. Wu, "Understanding of Internal Clustering Validation Measures," 2010 IEEE International Conference on Data Mining, 2010, pp. 911-916, doi: 10.1109/ICDM.2010.35.

  2. Kriegel, HP., Schubert, E. & Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowl Inf Syst 52, 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2

  3. Ng, A., & Piech, C. (2021). CS221. Retrieved 18 July 2021, from https://stanford.edu/~cpiech/cs221/handouts/kmeans.html

  4. Ismaili, Oumaima & Lemaire, Vincent & Cornuéjols, Antoine. (2014). A Supervised Methodology to Measure the Variables Contribution to a Clustering. 159-166. 10.1007/978-3-319-12637-1_20.

kmeans-feature-importance's People

Contributors

yousefgh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.