Comments (15)
Hi @KristofMorva, nice to see you're interested.
First up, a big disclaimer, as I'm pretty sure I broke at least KModes and the original API of this package for my own intentions. The fork is not production-ready at-all, but last time I checked (which is a while ago), it worked (meaning K-Prototypes worked).
With that out of the way: you don't need to do magic stuff to install the package, just make sure you have a compiler toolchain installed. I'll give an example below:
- Make sure you have a Python 3 environment with
Cython
andnumpy
installed. I would not recommend a global install so I'd use a virtual environment. If you really want a global install you can skip this step.
mkdir kmodes_test && cd kmodes_test
python3 -m venv env
source env/bin/activate
- Clone my fork and enter the repository
git clone [email protected]:rphes/kmodes && cd kmodes
- Install the package
pip install -e .
The -e
flag for pip installs in editable mode, so no files are moved to the site-packages
directory, but only a link is placed there so python knows where to find the stuff.
You can now import the package using
import cluster
# or perhaps better:
from cluster import KPrototypes
(I think I renamed it)
Usage is as it was, but your data can now be either a tuple of (num, cat)
, a dict of {'num': num, 'cat': cat}
or a pandas dataframe with categorical datatypes in it.
Check it out and let me know if it worked!
from kmodes.
@NinjaTuna , I just released v0.8, where some of slowness of k-prototypes has been alleviated (#45).
What version are you basing your benchmark on?
from kmodes.
I am running 0.8, installed using pip in Python 3. To be complete: we are talking about a dataset of 100k records, with 14 scalar features and 6 categorical ones. I am happy to further provide whatever information is helpful.
from kmodes.
The k-prototypes implementation in this package is heavily based on the (pretty optimized) k-modes implementation. My focus was initially more focused on readability than optimization, so I'm sure there is room for improvement on the k-means side of k-prototypes.
The scikit-learn version of k-means is likely highly optimized. For example, I see that the squared norms of data points are pre-computed. I have no intentions currently to implement further speed optimizations for k-prototypes, but PRs are of course welcome.
Theoretically, I see no reason why k-prototypes would be much slower than k-means. There might be some minor overhead because you have to do both k-modes and k-means updates, but of course you only do the former on the categorical and the latter on the numerical features so the computational complexity wouldn't change.
from kmodes.
Good Morning, thank you for your work.
Is it a parallelise implementation or do you know a parallels implementation for k-prototype ?
Because I need to use the algorithm for 2 millions of rows so I have a serious doubt about the execution time.
I see this article for map/reduce implementation : https://www.researchgate.net/publication/282853243_Parallel_K-prototypes_for_Clustering_Big_Data
What go you think about that ?
Thank you
from kmodes.
This is not a parallel implementation, and doing so would require lots of work, unfortunately.
With 2M rows, you'll be in a good position to profile the code, finding bottlenecks and report back here. :)
Finally, just an idea, but perhaps you can convert your problem to an all-numeric one, so you can use scikit-learn's k-means.
from kmodes.
I'm working on an implementation using Cython, similar to how sklearn achieves its performance. Currently we're at about 100x speedup for K-prototypes, which makes it more usable for bigger datasets, but not yet quite as fast sklearn's K-means. Parallel execution is also a possibility, but my priority right now.
@nicodv, if you're interested, we can discuss incorporating my changes into this repository. I have to note, though, that moving stuff to Cython has not been quite friendly to your original codebase. K-modes is not implemented as of yet.
Moreover, I would like to discuss getting this work merged into scikit-learn, as I believe it would be a valuable asset to their clustering suite.
from kmodes.
@rphes , that is awesome! Would love to see what you did. I understand it might require a painful rewrite of the original code base, but let's have a look and we'll see.
I've once reached out to the scikit-learn guys. Because the API of k-prototypes class is different than other cluster algorithms (the categorical=...
argument), they seemed not too fond to include it in the core package. Now that there is https://github.com/scikit-learn-contrib, that seems like a great place to move, when ready.
from kmodes.
Great! That's good to hear. You can check out my fork. It is still very much a work in progress. The repo you link seems like a good starting point, but I think we can modify the API or convince the people at sklearn so that it might be included after all. The need for a categorical parameter seems a technicality which can be solved or worked around.
from kmodes.
Hi, sry for the delay to answer.
It's complicated to transform data into numeric values, because this unbalances the dataset.
It finally took a night, it's ok :)
from kmodes.
Hey @rphes that speedup certainly sounds promising, may I ask for some doc about how exactly to use your fork to test it out (how to install it as a global package and/or import it after a cython build)?
from kmodes.
@rphes thank you very much!
The only thing I had to change in k_proto
was line 93
, as in case of tuples X.shape[0]
raises an error, so I've rewritten it to Xcat.shape[0]
(as Xcat
and Xnum
should have the same shape on 0
). But besides this, it's working perfectly, and more than 10 times faster!
I think it would be worth to integrate the contents of the two repositories, so the efforts can be more focused.
from kmodes.
Also, I have stumbled upon this new paper, which might help in performance (I have only read the abstract): http://www.mdpi.com/2073-8994/9/4/58/pdf
from kmodes.
Closing this, as @rphes 's work on parallel initialization went in a while ago.
For folks looking for speedups, I think looking into GPU acceleration is your best bet: #168
from kmodes.
@nicodv Is there any parameter we can somehow fasten the K-Prototype Clustering. Like I am trying get clusters from 3L data points. And it is taking a long time to find. Please guide.
Thanks
from kmodes.
Related Issues (20)
- k-prototype seems to focus on one continuous variable HOT 1
- Reduce memory usage in array initialization HOT 2
- GPU ( cuda ) support? HOT 1
- Add L1 as a dissimilarity function option for continuous variables HOT 1
- Performance over binary data HOT 1
- parallelization HOT 4
- KPrototypes fit_predict fails with sample_weight HOT 2
- Apologies if this is redundant but I could not find documentation ... how do you extract class membership from an object created by the function KPrototypes HOT 1
- What are the minimum characteristics that a binary matrix must meet to avoid the following error: "Insufficient Number of data since union is 0"? HOT 1
- ValueError: All arrays must be of the same length HOT 3
- Euclidean distance definiton lacks a square root HOT 2
- Support Arm64 macos HOT 1
- Please add conda installation information HOT 1
- Different clusters when K-Prototypes trained on same data in numpy array and pandas dataframe HOT 1
- Li
- Estimation of Gamma in K-Prototypes HOT 1
- [BUG] Badge not rendering in readme HOT 2
- Incorrect dtype conversion of categoricals when dealing with manually assigned centroids HOT 2
- Create equal-sized clusters within kmodes HOT 1
- Value Error when I pass a NumPy array as init parameter HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kmodes.