nicodv / kmodes Goto Github PK

View Code? Open in Web Editor NEW

1.2K 51.0 413.0 490 KB

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data

License: MIT License

Python 100.00%

python clustering-algorithm k-modes k-prototypes scikit-learn

kmodes's Introduction

kmodes

Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

k-modes [HUANG97] [HUANG98]
k-modes with initialization based on density [CAO09]
k-prototypes [HUANG97]

The code is modeled after the clustering algorithms in scikit-learn and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

Installation

kmodes can be installed using `pip`:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install --upgrade kmodes

kmodes can also conveniently be installed with conda from the conda-forge channel:

conda install -c conda-forge kmodes

Alternatively, you can build the latest development version from source:

git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install

Usage

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)

The examples directory showcases simple use cases of both k-modes ('soybean.py') and k-prototypes ('stocks.py').

Parallel execution

The k-modes and k-prototypes implementations both offer support for multiprocessing via the joblib library, similar to e.g. scikit-learn's implementation of k-means, using the n_jobs parameter. It generally does not make sense to set more jobs than there are processor cores available on your system.

This potentially speeds up any execution with more than one initialization try, n_init > 1, which may be helpful to reduce the execution time for larger problems. Note that it depends on your problem whether multiprocessing actually helps, so be sure to try that out first. You can check out the examples for some benchmarks.

FAQ

Q: I'm seeing errors such as "TypeError: '<' not supported between instances of 'str' and 'float'" when using the kprototypes algorithm.

A: One or more of your numerical feature columns have string values in them. Make sure that all columns have consistent data types.

Q: How does k-protypes know which of my features are numerical and which are categorical?

A: You tell it which column indices are categorical using the categorical argument. All others are assumed numerical. E.g., clusters = KPrototypes().fit_predict(X, categorical=[1, 2])

Q: I'm getting the following error, what gives? "ModuleNotFoundError: No module named 'kmodes.kmodes'; 'kmodes' is not a package".

A: Make sure your working file is not called 'kmodes.py', because it might overrule the kmodes package.

Q: I'm getting the following error: "ValueError: Clustering algorithm could not initialize. Consider assigning the initial clusters manually."

A: This is a feature, not a bug. kmodes is telling you that it can't make sense of the data you are presenting it. At least, not with the parameters you are setting the algorithm with. It is up to you, the data scientist, to figure out why. Some hints to possible solutions:

Run with fewer clusters as the data might not support a large number of clusters
Explore and visualize your data, checking for weird distributions, outliers, etc.
Clean and normalize the data
Increase the ratio of rows to columns

Q: I'm getting the following error: "ValueError: Input contains NaN, infinity, or a value too large for dtype('float64')."

A: Following scikit-learn, the k-modes algorithm does not accept np.NaN values in the X matrix. Users are suggested to fill in the missing data in a way that makes sense for the problem at hand.

Q: How would like your library to be cited?

A: Something along these lines would do nicely:

@Misc{devos2015,
  author = {Nelis J. de Vos},
  title = {kmodes categorical clustering library},
  howpublished = {\url{https://github.com/nicodv/kmodes}},
  year = {2015--2024}
}

References

CAO09: Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.
HUANG97: Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.
HUANG98: Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.

kmodes's People

Contributors

Stargazers

Watchers

Forkers

meyerson thermalpilot fototo gaoch023 sericwong lenovor honglongwu honeyflyfish jsnono vickkyy xulukai qawsqaer lipengyu metricle sardhendu sherrylau raleighgee vadim- ageek faisal-w pranjalpatil30 nkhuyu vanessaby bilian1995 pascal-bernard salahuddin1234 kingofpersia darg0001 kishenjayanti finallybiubiu gilbertofp16 yalechang edelmoral lmcinnes dinomagri skirpichenko nuclearfishin hanifmahboobi qiujiaxin86 savourylie pgr-me ty01csbaidu vinoo999 akash-suresh shmerko jd155 chinkook123 gaobo07 benadaba joshloyal jliang3 pwang867 himanshu074805 bekerov satishdivakarla arita37 jimduncan guillermogsjc yiyio luffyhwl mohi7621 chztony j-v-k drschilling goodbyecaptain yogithapolavarapu liuglen aniketgurav mdmustafizurrahman rl3012 liuzhisheng1226 deppmish2 cataclysmus larisahax nhunhu9 nathan-nakatsuka tristanbarbe jonkuo benandow lisitskiy keita1 lmfrossard khetanmayank bharath25191 kristimans olarayej dingmeili vytautekedyte chuash yuwenzoe nikolaospapachristou ziyuanj lddyato lirc36 xiaoyidolly daquang xuhuan666 mark-bighan lasclocker aidev42

kmodes's Issues

init : 'Huang'

If init value for method Huang is greater than 4 the following error is thrown:

 IndexError: index 0 is out of bounds for axis 0 with size 0

What happens when gamma is set to 0?

My understanding is that setting gamma to 0 is like running K-Means on the data because the categorical similarity will always be elementated. However, the results are different than sklearn's K-Means. Am I missing something?

Very Slow with greater than 10k obs

It does not go ahead of this...

Init: initializing centroids
Init: initializing clusters
Starting iterations...

Even in gcs instance

Speed issues

Apologies for vague issue, I've not much time, I noticed that I got a speedup by converting my cluster centroids to integers. Granted my categories are integers and not strings, but you may be able to use np.unique() to swap out strings for integers.

To test this try generating a large matrix of integer classes (for me this was 70,000 rows by 20 columns with each column having 4 classes i.e. numbers 1 to 4) and running the following:

_labels_cost(X, np.uint8(km.cluster_centroids_))

vs.

_labels_cost(X, km.cluster_centroids_)

pip Installation Errors

When I tried installing kmodes via pip, it would fail to install. I've attached the error log. To install it properly, I built it directly from source. I'm assuming the pypi version is not updated with the latest which is why it's failing to install?
error.txt

Release 0.7

Decision tree

Hello

Is it useful to have a decision tree generator to classify a new item after the model is trained?

Thank you

Using 'predict' method always returns the same cluster no matter the input

I am probably doing something wrong.

I have categorical data, in total ~1600 points. I split it to train and test, then clusterise it. It works well. But when I try to fit the train data to any cluster, it always returns 0.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

km = kmodes.KModes(n_clusters=50, init='Huang', num_runs=5, verbose=1)
clusters = km.fit_predict(X_train)

The above works fine and the results are great, however the below code returns all zeroes:

test_data_clusters = km.predict(X_test)

Am I misunderstanding the predict function and doing something wrong?

Custom dissimilarity measure

Is there a way to use a custom function for dissimilarity measure?

Converting categorical variables to int / float?

Hi, I am trying to apply your k-prototypes algorithm on my dataset. I have a list of categorical and numerical attributes. Following the soybean examples, i got the following error:
Traceback (most recent call last):

File "", line 2, in
kmodes_huang.cluster(cas_orc_join_nonus_cat, init_method='Huang')

File "/Users/212448740/Desktop/python_ml/kmodes.py", line 75, in cluster
self._perform_clustering(x, _args, verbose=0, *_kwargs)

File "/Users/212448740/Desktop/python_ml/kmodes.py", line 108, in _perform_clustering
self.init_centroids(x)

File "/Users/212448740/Desktop/python_ml/kmodes.py", line 190, in init_centroids
self.centroids[ik, iattr] = random.choice(choices)

ValueError: could not convert string to float

Is that mean i have to convert all the categorical attributes into 0, 1, 2, etc for the function to works?

Sherry

Determining the optimal number of clusters

Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.

In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.

In the kmodes for python, how do you determine the optimal k?

Write better documentation, use Read the Docs

Add dissimilarity measure from Ng et al. (2007)

The dissimilarity measure supposedly improves convergence. Well-cited paper, too.

http://ieeexplore.ieee.org/abstract/document/4069266/

Better handling of NaNs

np.unique is used for encoding data values to integers. However, numpy currently treats every np.NaN as a unique value, creating many categories. (See: numpy/numpy#2111)

Solution is to check with np.isnan so that we can just ignore all NaNs when encoding. The encoder then simply assigns all NaNs to the -1 (i.e. "I don't know this value") category.

Support feature weights

Need to add a parameter or input vector specifying the weight of each categorical variable rather than treating them as same in dissimilarity calculation as its doing now.

kmodes dependency SciPy installation fails on Windows 7

Hello,

I am trying to install kmodes on my Windows 7 machine. I tried using pip install kmodes and the installation fails when building wheels for scikit-learn and scipy. The error output is huge and below is the last line

----------------------------------------
Rolling back uninstall of scikit-learn
Command "C:\Users\kkhitalishvili\AppData\Local\Continuum\Anaconda3\python.exe -u -c "import 
setuptools, tokenize;__file__='C:\\Users\\KKHITA~1\\AppData\\Local\\Temp\\pip-build-6i8wjh5v\\scikit-
learn\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', 
'\n');f.close();exec(compile(code, __file__, 'exec'))" install --record 
C:\Users\KKHITA~1\AppData\Local\Temp\pip-wrgaskfk-record\install-record.txt --single-version-
externally-managed --compile" failed with error code 1 in C:\Users\KKHITA~1\AppData\Local\Temp\pip-
build-6i8wjh5v\scikit-learn\

I also tried building from source and got the error below. It seems it has to do with the lapack and blas installations. I do not have either. Are they necessary dependencies to install kmodes?

$ python setup.py install
C:\Users\kkhitalishvili\AppData\Local\Continuum\Anaconda3\lib\distutils\dist.py:261: UserWarning: Unknown distribution option: 'summary'
  warnings.warn(msg)
running install
running bdist_egg
running egg_info
creating kmodes.egg-info
writing kmodes.egg-info\PKG-INFO
writing dependency_links to kmodes.egg-info\dependency_links.txt
writing requirements to kmodes.egg-info\requires.txt
writing top-level names to kmodes.egg-info\top_level.txt
writing manifest file 'kmodes.egg-info\SOURCES.txt'
reading manifest file 'kmodes.egg-info\SOURCES.txt'
writing manifest file 'kmodes.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
creating build
creating build\lib
creating build\lib\kmodes
copying kmodes\kmodes.py -> build\lib\kmodes
copying kmodes\kprototypes.py -> build\lib\kmodes
copying kmodes\__init__.py -> build\lib\kmodes
creating build\lib\kmodes\util
copying kmodes\util\dissim.py -> build\lib\kmodes\util
copying kmodes\util\__init__.py -> build\lib\kmodes\util
creating build\bdist.win-amd64
creating build\bdist.win-amd64\egg
creating build\bdist.win-amd64\egg\kmodes
copying build\lib\kmodes\kmodes.py -> build\bdist.win-amd64\egg\kmodes
copying build\lib\kmodes\kprototypes.py -> build\bdist.win-amd64\egg\kmodes
creating build\bdist.win-amd64\egg\kmodes\util
copying build\lib\kmodes\util\dissim.py -> build\bdist.win-amd64\egg\kmodes\util
copying build\lib\kmodes\util\__init__.py -> build\bdist.win-amd64\egg\kmodes\util
copying build\lib\kmodes\__init__.py -> build\bdist.win-amd64\egg\kmodes
byte-compiling build\bdist.win-amd64\egg\kmodes\kmodes.py to kmodes.cpython-36.pyc
byte-compiling build\bdist.win-amd64\egg\kmodes\kprototypes.py to kprototypes.cpython-36.pyc
byte-compiling build\bdist.win-amd64\egg\kmodes\util\dissim.py to dissim.cpython-36.pyc
byte-compiling build\bdist.win-amd64\egg\kmodes\util\__init__.py to __init__.cpython-36.pyc
byte-compiling build\bdist.win-amd64\egg\kmodes\__init__.py to __init__.cpython-36.pyc
creating build\bdist.win-amd64\egg\EGG-INFO
copying kmodes.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO
copying kmodes.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying kmodes.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying kmodes.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying kmodes.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating dist
creating 'dist\kmodes-0.7-py3.6.egg' and adding 'build\bdist.win-amd64\egg' to it
removing 'build\bdist.win-amd64\egg' (and everything under it)
Processing kmodes-0.7-py3.6.egg
Copying kmodes-0.7-py3.6.egg to c:\users\kkhitalishvili\appdata\local\continuum\anaconda3\lib\site-packages
Adding kmodes 0.7 to easy-install.pth file

Installed c:\users\kkhitalishvili\appdata\local\continuum\anaconda3\lib\site-packages\kmodes-0.7-py3.6.egg
Processing dependencies for kmodes==0.7
Searching for scipy==0.19.0
Reading https://pypi.python.org/simple/scipy/
Downloading https://pypi.python.org/packages/e5/93/9a8290e7eb5d4f7cb53b9a7ad7b92b9827ecceaddfd04c2a83f195d8767d/scipy-0.19.0.zip#md5=91b8396231eec780222a57703d3ec550
Best match: scipy 0.19.0
Processing scipy-0.19.0.zip
Writing C:\Users\KKHITA~1\AppData\Local\Temp\easy_install-6lv_v07_\scipy-0.19.0\setup.cfg
Running scipy-0.19.0\setup.py -q bdist_egg --dist-dir C:\Users\KKHITA~1\AppData\Local\Temp\easy_install-6lv_v07_\scipy-0.19.0\egg-dist-tmp-ifbntko2
C:\Users\KKHITA~1\AppData\Local\Temp\easy_install-6lv_v07_\scipy-0.19.0\setup.py:323: UserWarning: Unrecognized setuptools command, proceeding with generating Cython sources and expanding templates
  warnings.warn("Unrecognized setuptools command, proceeding with "
C:\Users\kkhitalishvili\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:572: UserWarning:
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  self.calc_info()
C:\Users\kkhitalishvili\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:572: UserWarning:
    Lapack (http://www.netlib.org/lapack/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [lapack]) or by setting
    the LAPACK environment variable.
  self.calc_info()
C:\Users\kkhitalishvili\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\distutils\system_info.py:572: UserWarning:
    Lapack (http://www.netlib.org/lapack/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [lapack_src]) or by setting
    the LAPACK_SRC environment variable.
  self.calc_info()
Running from scipy source directory.
non-existing path in 'scipy\\integrate': 'quadpack.h'
error: no lapack/blas resources found

After building from source I am able to import kmodes module, but it does not have any of the functions meaning kmodes.KModes(...) fails.

Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import kmodes
>>> kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'kmodes' has no attribute 'KModes'
>>>

Running kprototype with all numerical features?

By giving category list as empty i.e. []. It raises an error. Any work arounds?

Tests failing for Python 3.6

Seeing some weird stuff in the tests for Python 3.6 for:

test_kmodes_cao_soybean
test_kmodes_huang_soybean
test_kmodes_init_soybean

test_kprotoypes_huang_stocks also fails, but that is because clusters numbers are shuffled differently. Make this test more robust.

KPrototypes fit_predict error: "could not convert string to float"

I want to apply Kprototype into my dataset but it seems that the code can't convert into numpy arrays?
km = kprototypes.KPrototypes(n_clusters=10, init='Cao', verbose=2)
train=pd.read_csv('/home/lemma/train.csv')
train['clusters_KModes'] = km.fit_predict(train1,categorical=[1])
ValueError: could not convert string to float: MJ

Trying to convert into object to match the example given also not successful:
km = kprototypes.KPrototypes(n_clusters=10, init='Cao', verbose=2)
train=pd.read_csv('/home/lemma/train.csv')
train1=train1.values.astype(object)
train['clusters_KModes'] = km.fit_predict(train1,categorical=[1])
ValueError: could not convert string to float: MJ

Keep getting an Error - please help

Hi,
Im trying write a program to cluster some user click data. I keep getting this error and I cant seem to figure out how to fix it. I have not done any ML in the past, so dont fully understand what effect each variable here has. Please help.

This is the script I used
`import numpy as np
from kmodes import kprototypes

syms = np.genfromtxt('mlog1day_top100.csv', dtype=str, delimiter=',')[:, 0]
X = np.genfromtxt('mlog1day_top100.csv', dtype=object, delimiter=',')[:, 1:3]

X[:, 0] = X[:, 0].astype(float)
X[:, 1] = X[:, 1].astype(float)
X[:, 2] = X[:, 2].astype(str)
X[:, 3] = X[:, 3].astype(str)

kproto = kprototypes.KPrototypes(n_clusters=4, init='Cao', verbose=2)
clusters = kproto.fit_predict(X, categorical=[1, 2])

print(kproto.cluster_centroids_)

print(kproto.cost_)
print(kproto.n_iter_)

for s, c in zip(syms, clusters):
print("Symbol: {}, cluster:{}".format(s, c))`

This is what the input data looks like:
5 comma seperated columns
`
64870551, 3223,5, /offerId/759152/uSource/MGS,GoShop

80729030,889,3, /offerId/729119/categoryId/0/usource/STROFDEF,OfferDetail

66257932,129,5, /offerId/759120/categoryId/424/subCategoryId/0/uSource/CPTS,OfferDetail

85135356,3014,1, /uSource/hnmnu/product/vacations,Travel home page

74661688,396,1, /offerid/358634/uSource/PTMTILE,OfferDetail

84572627,3898,2, /product/hotels,Travel home page

75163088,138,2, /offerid/759057,OfferDetail
`

This is the error that I keep getting:

Initialization method and algorithm are deterministic. Setting n_init to 1. Traceback (most recent call last): File "stocks.py", line 18, in <module> clusters = kproto.fit_predict(X, categorical=[1, 2]) File "/Users/rweliwatta/anaconda/lib/python2.7/site-packages/kmodes/kmodes.py", line 345, in fit_predict return self.fit(X, **kwargs).labels_ File "/Users/rweliwatta/anaconda/lib/python2.7/site-packages/kmodes/kprototypes.py", line 353, in fit self.verbose) File "/Users/rweliwatta/anaconda/lib/python2.7/site-packages/kmodes/kprototypes.py", line 134, in k_prototypes "All columns are categorical, use k-modes instead of k-prototypes." AssertionError: All columns are categorical, use k-modes instead of k-prototypes.

Any help is really appreciated.

Thank you,
Roshen

TypeError: unhashable type

when i commit the code below:

clustermodel=kprototypes.KPrototypes(n_clusters=6, init='Cao', verbose=2)
clustermodel.fit(sd1,categorical=[0,1])

it can't run with the " TypeError: unhashable type", but kmode is ok. could you pls tell me how can solve this problem? Thank very much!

Centroids output format

Based on the results of running kprototypes on the stocks.csv file included in the examples, I have concluded that kprototypes.cluster_centroids_ represents the centroids in the following format:
[array([cluster 0 centroid coordinates in numerical space],
[cluster 1 centroid coordinates in numerical space], ...),
array([cluster 0 centroid coordinates in categorical space],
[cluster 1 centroid coordinates in categorical space], ...)]
where the i-th cluster centroid coordinates in either numerical or categorical space is of the form
[x_i,0, x_i,1, ...]
where
x_i,0 is the coordinate for the first (i.e. left-most) column of (categorical or numerical) data
x_i,1 is the coordinate for the second (i.e. second left-most) column of (categorical or numerical) data
...
and where the j-th cluster centroid coordinate values in categorical space are elements of the set
{0, 1, 2, ...}
where
a value of 0 represents the category value whose name is first in alphabetical order
a value of 1 represents the category value whose name is second in alphabetical order
...
i.e. the numerical values represent the mode (i.e. most frequently occurring) categorical value for the cluster, and where the numerical values shown are chosen by putting the category names in alphabetical order and representing the first name by 0, the second name by 1, etc.

Can you please tell me if my conclusions are correct? If there is documentation that describes all this, I apologize for this long-winded question, and I would greatly appreciate a pointer to that documentation.

Thank you very much.

Online clustering

Hi,

thanks a lot for the work here!
Do you plan on adding support for online training? Similar to scikit's partial_fit()?

Best,
Olana

Performance considerations

I am genuinely curious why the k-prototypes implementation takes 2 hours to complete for a certain dataset, whereas clustering using sklearn's k-means only takes about a minute. I've scrolled through Huang's paper, which states that k-prototypes is an extension of k-means to function properly in the presence of categorical data.

Is sklearn's k-means so well-optimized, is the computational complexity of Huang's algorithm higher than he lets on, or is there something else at play? I'd love a discussion on this.

AttributeError on import

Hi,
I have no trouble importing kmodes. However, when I run the example from the README:

import numpy as np
from kmodes import kmodes

# random categorical data
data = np.random.choice(20, (100, 10))

km = kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

I get the following error:
AttributeError: module 'kmodes.kmodes' has no attribute 'KModes'

Do you know why this is happening? I have cloned the most recent master branch and am building from this.

Cluster centroid

Hello
After fit data, is there anyway to get the value of clusters centroid?
Thank you

What do centroids mean and how are they computed?

Here is my code:

import numpy as np
from kmodes import kmodes

x = np.genfromtxt('custom.csv', dtype=str, delimiter=',')

kmodes_huang = kmodes.KModes(n_clusters=2, init='Huang', verbose=1)
kmodes_huang.fit(x)

# Print cluster centroids of the trained model.
print(kmodes_huang.cluster_centroids_)
# Print training statistics
print('Final training cost: {}'.format(kmodes_huang.cost_))
print('Training iterations: {}'.format(kmodes_huang.n_iter_))

Here is output:


Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 10.0
[[2 0]
 [6 2]]
Final training cost: 10.0
Training iterations: 1

Here is file with data (I had to change extension to .txt)
custom.txt

How should I interpret centroids? Would it be better to output medoids instead? Are you implicitly imbeding everything in R^n and using Eucleadian metric to find centroids?

K-protoypes can get stuck in initialization

K-protoypes can get stuck in initialization if it never reaches the point where there are no more empty clusters.

Set a maximum to how long it can initialize and warn the user of the problem.

Wrong variable assignment in kprototypes.py

In line 118, shouldn't "cl_attr_freq, membship = move_point_num"
be
"cl_attr_sum, membship = move_point_num" as it did in line 92?

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create seperate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

Question about the k-Prototype

Dear Nico
Thank you very much for this python code !
In fact, i have a question about the algorithm k-Prototype implemented in this work.
I wanted to know if it is possible to assign some weighted coefficients to each families of variables ( like various probabilities : 0.7 for numeric data and 0.3 for categories) in the order to give more importance to some classification than other.
Thank You for any response.

Don't freeze version of packages

Please, don't freeze versions of packages

'numpy==1.12.1',
'scikit-learn==0.18.1',
'scipy==0.19.0'

Categorical Data

Hi Nico,

thank you very much for your work! I really appreciate your implementation and would like to use it for determining prototypes. Unfortunately I found it hard to follow the parameters of KPrototypes. For a future version I would wish for a documentation explaining what KMeans offers.

As a practical question: Is it possible to cluster on categorical data? Given the stocks example, how could I cluster the data based on the country and print out the chosen centroids?

Modifying the code to the following:

syms = np.genfromtxt('stocks.csv', dtype=str, delimiter=',')[:, -1]    # changed index
X = np.genfromtxt('stocks.csv', dtype=object, delimiter=',')[:, :-1]   # changed index
kproto = kprototypes.KPrototypes(n_clusters=6, init='Huang', verbose=2)
clusters = kproto.fit_predict(X, categorical=[1, 2])

builds correctly syms and X:

syms = ['USA' 'USA' 'USA' 'USA' 'USA' 'USA' 'CN' 'USA' 'USA' 'USA' 'USA' 'NL']
X = [['AAPL' '738.5' 'tech']
 ['XOM' '369.5' 'nrg']
 ['GOOGL' '368.2' 'tech']
 ['MSFT' '346.7' 'tech']
 ['BRK-A' '343.5' 'fin']
 ['WFC' '282.4' 'fin']
 ['CHL' '282.1' 'tel']
 ['JNJ' '279.7' 'cons']
 ['WMT' '257.2' 'cons']
 ['VZ' '205.2' 'tel']
 ['ORCL' '192.1' 'tech']
 ['RDS-A' '195.7' 'nrg']]

but throws this error:

Traceback (most recent call last):
  File "stocks2.py", line 17, in <module>
    clusters = kproto.fit_predict(X, categorical=[1, 2])
  File "build/bdist.linux-x86_64/egg/kmodes/kmodes.py", line 325, in fit_predict
  File "build/bdist.linux-x86_64/egg/kmodes/kprototypes.py", line 339, in fit
  File "build/bdist.linux-x86_64/egg/kmodes/kprototypes.py", line 149, in k_prototypes
  File "build/bdist.linux-x86_64/egg/kmodes/kprototypes.py", line 42, in _split_num_cat
ValueError: could not convert string to float: RDS-A

Is there a parameter to make KPrototypes work on categorical data instead of float data?

Release 0.8

Centroids = unique values in data, if the latter is smaller than n_clusters

Scan the data for the number of unique values, and if it is smaller than or equal to n_clusters simply set the unique values as the centroids.

Scikit-learn's KMeans does this too.

Add util with cluster performance metrics

See Bai et al., 2011 for details:

Category utility function
Adjusted rand index (WON'T DO: sklearn.metrics already implements this)
Set matching error (WON'T DO: seems obscure)

Many missing values will distort clustering

Since NaNs are currently encoded as their own category (-1), they could potentially distort the clustering if there are too many of them. Is it worth it to have the dissimilarity function ignore -1 values?

Note: this could slow down the algorithm significantly, so make sure to check for any -1 values first.

Not an issue. So much as a doubt.

When I pass some data to kmeans and give number of clusters as 6. (Knowing the best fit is 3 clusters).
Kmeans actually clusters it into only 3 clusters. Overriding my number of clusters.

Do you know how and why this happens?

And also why the kprototypes, though based from kmeans doesn't work the same way?

PackageNotFoundError: Package not found: Conda could not find '

Hello,
I am able to install other packages using conda install xx but not with kmodes. Is it because this package is not available to be installed via anaconda?
Thanks!

Jaccard dissimilarity measure

Are we adding a jaccard dissimilary measure?

Mixed datatypes

Is there any way to separate categorical from quantitative features?

thank you

Clustering measurement/ clustering error

Recently, I have both categories and numeric data for clustering and found that k-prototypes fit my cases. Able to fit and predict my data but cannot find a good way to identify an "optimal"* number of cluster due to unable to extract "cost" for every centroid/ data point.

Jump into "k_prototypes" function, found that it will return the "best" centroid but not all centroid. So tried to train the model and predicting data and finding the cost for every single data point. However, "predict" function return label but not both label and cost. Does "cost" return of "_labels_cost" helps on this case?

After studied source code, does it good to

return "all_centroids" and "all_costs" in "k_prototypes" function, so that we may able to get the cost per centroid.
return "cost" in "predict" function, so that we can measure the distance for prediction/ classification?
return average cost rather than total cost. When try to find an optimal number of cluster, cost must be smaller when there is more cluster. It does not able to indicate whether the cost is good or not.

*In my current situation, lowest cost is the optimal result

Kprototypes clustering algorithm is not working.

After your latest commit. The algorithm is not working. It is not even working for the example. Please check it.

Untitled.html.zip

Keep getting NotImplementedError

Hey there,

First time user here!

I have been trying to run standard KModes on a numpy array which I have exported from a Pandas dataframe.

Here's a sample extract from the data which fails:

dt[:][0:10]

array([[ 60,   1,   0,   1,  65,   0,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   1],
       [ 60,   1,   0,   1,  68,   0,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   1],
       [ 60,   1,   0,   1,  68,   0,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   1],
       [ 60,   1,   0,   1,  65,   0,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   1],
       [ 60,   1,   0,   1, 123,   0,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   1],
       [243,   1,   1,   2,  68, 143,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   0],
       [243,   1,   1,   2, 118, 143,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   0],
       [243,   1,   1,   2,   8, 143,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   0],
       [243,   1,   1,   2, 206, 143,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   0],
       [243,   1,   1,   2,  65, 143,   0,   0,   0,   1,   0,   1,   0,
          0,   0,   2,   1,   0]], dtype=uint8)

It seems that if I reduce the number of samples to <= 6 then it runs (but obviously its kind of useless).

km = kmodes.KModes(n_clusters=4, init='huang', n_init=100, verbose=1)

dt_sample = dt[:][0:6]

clusters = km.fit_predict(dt_sample)

gives:

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/0, moves: 0, cost: 0.0

But,

km = kmodes.KModes(n_clusters=4, init='huang', n_init=5, verbose=1)

dt_sample = dt[:][0:7]

clusters = km.fit_predict(dt_sample)

Or more rows of training data, gives: the following error:

Init: initializing centroids
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-136-65e122904892> in <module>()
      3 dt_sample = dt[:][0:7]
      4 
----> 5 clusters = km.fit_predict(dt_sample)

/home/qzhao/anaconda/anaconda2/envs/py27/lib/python2.7/site-packages/kmodes/kmodes.pyc in fit_predict(self, X, y, **kwargs)
    372         predict(X).
    373         """
--> 374         return self.fit(X, **kwargs).labels_
    375 
    376     def predict(self, X, **kwargs):

/home/qzhao/anaconda/anaconda2/envs/py27/lib/python2.7/site-packages/kmodes/kmodes.pyc in fit(self, X, y, **kwargs)
    363                                                self.init,
    364                                                self.n_init,
--> 365                                                self.verbose)
    366         return self
    367 

/home/qzhao/anaconda/anaconda2/envs/py27/lib/python2.7/site-packages/kmodes/kmodes.pyc in k_modes(X, n_clusters, max_iter, dissim, init, n_init, verbose)
    217             centroids = np.asarray(init, dtype=np.uint8)
    218         else:
--> 219             raise NotImplementedError
    220 
    221         if verbose:

NotImplementedError:

Any ideas whats going on?

kmodes package dependencies

Is it possible to install kmodes with following latest packages
numpy 1.13.1
scipy 0.19.1
scikit-learn 0.19.0

Add scatter plot of results to examples

Do you suggest any easy way to plot input data and fitted clusters?

Thank you

kprototype gets stuck in loops.

This is the data I am passing

Below is my cluster function.

def kp_cluster(df,cat_list,n_c):
    X = np.array(df)
    kproto = kprototypes.KPrototypes(n_clusters=n_c, init='Cao', verbose=1, n_init=10)
    clusters = kproto.fit_predict(X, categorical=cat_list)
    return clusters,kproto.cost_

Here is the cluster call -

labels,cost = kp_cluster(data,cat_list,n)

where, cat_list = [1] and n = 6.

It goes on printing

Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters

over and over till it crashes my ipython notebook.

Can't build package: AttributeError: module 'kmodes' has no attribute 'version'

Hello,
I also tried using python script to install the package: python setup.py install. However, I get the error message:
Traceback (most recent call last):
File "setup.py", line 6, in
VERSION = kmodes.version
AttributeError: module 'kmodes' has no attribute 'version'

'random_state' parameter for Kmodes

KModes is returning different results each time for the same data set. I don't see parameter 'random_state' to resolve the issue. Is there any way to make KModes to be consistent.

Manual centroid initialization for k prototypes clustering

If the dataset has only one numerical variable, then manual initialization of centroids never works as the condition checks always fail. In this case, if init is the initialization vector,init[0] is a 1D vector and so shape[1] gives an error.