Git Product home page Git Product logo

jasonlaska / spherecluster Goto Github PK

View Code? Open in Web Editor NEW
328.0 10.0 76.0 1.64 MB

Clustering routines for the unit sphere

Home Page: https://medium.com/@jaska_at_clara/simple-datetime-disambiguation-fd2374ce664a

License: MIT License

Python 100.00%
sphericalclustering scikit-learn k-means circular-statistics von-mises-fisher spherical-k-means clustering-algorithm sampling directional-statistics

spherecluster's Introduction

Clustering on the unit hypersphere in scikit-learn

Mixture of von Mises Fisher

Algorithms

This package implements the three algorithms outlined in "Clustering on the Unit Hypersphere using von Mises-Fisher Distributions", Banerjee et al., JMLR 2005, for scikit-learn.

  1. Spherical K-means (spkmeans)

    Spherical K-means differs from conventional K-means in that it projects the estimated cluster centroids onto the the unit sphere at the end of each maximization step (i.e., normalizes the centroids).

  2. Mixture of von Mises Fisher distributions (movMF)

    Much like the Gaussian distribution is parameterized by mean and variance, the von Mises Fisher distribution has a mean direction $\mu$ and a concentration parameter $\kappa$. Each point $x_i$ drawn from the vMF distribution lives on the surface of the unit hypersphere $\S^{N-1}$ (i.e., $\|x_i\|_2 = 1$) as does the mean direction $\|\mu\|_2 = 1$. Larger $\kappa$ leads to a more concentrated cluster of points.

    If we model our data as a mixture of von Mises Fisher distributions, we have an additional weight parameter $\alpha$ for each distribution in the mixture. The movMF algorithms estimate the mixture parameters via expectation-maximization (EM) enabling us to cluster data accordingly.

    • soft-movMF

      Estimates the real-valued posterior on each example for each class. This enables a soft clustering in the sense that we have a probability of cluster membership for each data point.

    • hard-movMF

      Sets the posterior on each example to be 1 for a single class and 0 for all others by selecting the location of the max value in the estimator soft posterior.

    Beyond estimating cluster centroids, these algorithms also jointly estimate the weights of each cluster and the concentration parameters. We provide an option to pass in (and override) weight estimates if they are known in advance.

    Label assigment is achieved by computing the argmax of the posterior for each example.

Relationship between spkmeans and movMF

Spherical k-means is a special case of both movMF algorithms.

  • If for each cluster we enforce all of the weights to be equal $\alpha_i = 1/n_clusters$ and all concentrations to be equal and infinite $\kappa_i \rightarrow \infty$, then soft-movMF behaves as spkmeans.

  • Similarly, if for each cluster we enforce all of the weights to be equal and all concentrations to be equal (with any value), then hard-movMF behaves as spkmeans.

Other goodies

  • A utility for sampling from a multivariate von Mises Fisher distribution in spherecluster/util.py.

Installation

Clone this repo and run

python setup.py install

or via PyPI

pip install spherecluster

The package requires that numpy and scipy are installed independently first.

Usage

Both SphericalKMeans and VonMisesFisherMixture are standard sklearn estimators and mirror the parameter names for sklearn.cluster.kmeans.

# Find K clusters from data matrix X (n_examples x n_features)

# spherical k-means
from spherecluster import SphericalKMeans
skm = SphericalKMeans(n_clusters=K)
skm.fit(X)

# skm.cluster_centers_
# skm.labels_
# skm.inertia_

# movMF-soft
from spherecluster import VonMisesFisherMixture
vmf_soft = VonMisesFisherMixture(n_clusters=K, posterior_type='soft')
vmf_soft.fit(X)

# vmf_soft.cluster_centers_
# vmf_soft.labels_
# vmf_soft.weights_
# vmf_soft.concentrations_
# vmf_soft.inertia_

# movMF-hard
from spherecluster import VonMisesFisherMixture
vmf_hard = VonMisesFisherMixture(n_clusters=K, posterior_type='hard')
vmf_hard.fit(X)

# vmf_hard.cluster_centers_
# vmf_hard.labels_
# vmf_hard.weights_
# vmf_hard.concentrations_
# vmf_hard.inertia_

The full set of parameters for the VonMisesFisherMixture class can be found here in the doc string for the class; see help(VonMisesFisherMixture).

Notes:

  • X can be a dense numpy.array or a sparse scipy.sparse.csr_matrix

  • VonMisesFisherMixture has been tested successfully with sparse documents of dimension n_features = 43256. When n_features is very large the algorithm may encounter numerical instability. This will likely be due to the scaling factor of the log-vMF distribution.

  • cluster_centers_ in VonMisesFisherMixture are dense vectors in current implementation

  • Mixture weights can be manually controlled (overriden) instead of learned.

Testing

From the base directory, run:

python -m pytest spherecluster/tests/

Examples

Small mix

We reproduce the "small mix" example from Section 6.3 in examples/small_mix.py. We've adjusted the parameters such that one distribution in the mixture has much lower concentration than the other to distinguish between movMF performance and (spherical) k-means which do not estimate weight or concentration parameters. We also provide a 3D version of this example in examples/small_mix_3d.py for fun.

Running these scripts will spit out some additional performance metrics for each algorithm.

Small mix 2d

Small mix 3d

It is clear from the figures that the movMF algorithms do a better job by taking advantage of the concentration estimate.

Document clustering

We also reproduce this scikit-learn tfidf (w optional lsa) + k-means demo in examples/document_clustering.py. The results are different on each run, here's a chart comparing the algorithms' performances for a sample run:

Document clustering

Spherical k-means, which is a simple low-cost modification to the standard k-means algorithm performs quite well on this example.

References

Attribution

See also

spherecluster's People

Contributors

cv3d avatar jasonlaska avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spherecluster's Issues

AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data'

I've been getting a
AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data'

error every time I try to run a SphericalKMeans fit. Diving a little further, it looks like the _check_fit_data method only exists in two locations in the repository...

Line 318 in spherical_kmeans.py (the line that throws the error in this case).

Line 772 in von_mises_fisher_mixture.py where it seems to be a defined method.

Based on what I can see from the imports, it looks like the _check_fit_data doesn't actually exist in the context of spherical_kmeans.py, so the error kind of makes sense.

Could this be the result of some accidental deletions? I went through the commit history and couldn't find anything that immediately seemed like the issue. Or is there something very obvious that I'm missing... wouldn't be the first time :)

Also as an FYI, I'm running Python 3.6.4.

ImportError: cannot import name '_k_means'

Hi. I install spherecluster usingpip install spherecluster successfully in Ubuntu 18.04. But when I call from spherecluster import SphericalKMeans, I got an ImportError.

Traceback (most recent call last):
  File "LM/vectors_cluster.py", line 9, in <module>
    from spherecluster import SphericalKMeans
  File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/__init__.py", line 2, in <module>
    from .spherical_kmeans import SphericalKMeans
  File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/spherical_kmeans.py", line 16, in <module>
    from sklearn.cluster import _k_means
ImportError: cannot import name '_k_means'

Here is my environmental information:

Package Version
numpy 1.14.3
scipy 1.1.0
scikit-learn 0.22.2.post1
pytest 3.5.1
nose 1.3.7
joblib 0.14.1
spherecluster 0.1.7

If anyone can help me, I would really appreciate it!

Question about sample_vMF

Hi,
@jasonlaska Thanks for your codes! I learned a lot from the codes!

Recently I met a problem: when I used 'VonMisesFisherMixture' to estimate a distribution of sequence data, and then used 'sample_vMF' to produce some pseudo samples, I found all pseudo samples have the same trend as the real ones, but the Y value is always smaller than the real samples.

Later, I created a list of 10 sequences, all with the value of [0.95 , 0.9, 0.85, 0.8, ..., 0.05, 0]. In my opinion, when I produced pseudo samples from vmf distribution, I should get the same sequence. However, I got [3.82291703e-01, 3.62182865e-01, 3.42056000e-01, 3.21941382e-01, 3.01821386e-01, 2.81705993e-01, 2.61568550e-01, 2.41452144e-01, 2.21317166e-01, 2.01214795e-01, 1.81081612e-01, 1.60967944e-01, 1.40855783e-01, 1.20740353e-01, 1.00600080e-01, 8.04905383e-02, 6.03600387e-02, 4.02586200e-02, 2.01175857e-02, -8.79339154e-06], which is still a straight line, but each value becomes smaller (that is, 0.95 -> 0.382). Why is that? How to solve this problem?

Thank you!

How to use spherecluster in Jupyter notebook

Hi Jason,
I am trying to use the package 'spherecluster' in Jupyter notebook, but I encounter the following message:
ModuleNotFoundError: No module named 'spherecluster'
after the command "import spherecluster"
I have installed the package through the Windows command window without problem.
Could you give me an insight of what I have done wrong?
Thank you.
Dmitry

installation: no module spherical kmeans

I'm using anaconda, installing in the environment where i have python3.4 version.
I tried downloading via pip and via python setup.py install in this environment and also in a global, python2.7 environment, each time uninstalling everything and trying again. No luck. Everywhere I get a no-module "spherical_kmeans" message. What can be done here? Thanks a lot!

1
2
3

Mistake in sample_vMF

In the function _sample_weight, b is wrongly calculated as b = dim / (np.sqrt(4. * kappa ** 2 + dim ** 2) + 2 * kappa).
The reference material (eq 4 in Wood 1994) has it as b = (-2 * kappa + sqrt(4 * kappa ** 2 + dim ** 2)) / dim

Cannot import spherecluster with scikit-learn 1.0.2: sklearn.cluster.k_means_ has been renamed

The traceback is

Traceback (most recent call last):
File "//python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
import spherecluster
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/init.py", line 2, in
from .spherical_kmeans import SphericalKMeans
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/spherical_kmeans.py", line 7, in
from sklearn.cluster.k_means
import (
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'sklearn.cluster.k_means
'

The problem is that 'sklearn.cluster.k_means_' has been renamed to 'sklearn.cluster._kmeans' in some intermediate scikit-learn version.

ValueError: Data l2-norm must be 1, found 0.0

Thank you for your great source code.
While I using soft von_mises_fisher_mixture, I got this error. The error only happen with 1000 short documents, it run well with small amount. Below is the full error log.

Could you please show me how to fix it? Thank you so much

File "mvmf_document_clustering.py", line 65, in <module>
    vmf_soft.fit(X)
  File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 826, in fit
    X = self._check_fit_data(X)
  File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 789, in _check_fit_data
    raise ValueError("Data l2-norm must be 1, found {}".format(n))
ValueError: Data l2-norm must be 1, found 0.0

pip package not up to date?

Hi

First of all, thank you for sharing this package!

I'm installing spherecluster with pip and had to manually edit spherical_kmeans.py to fix the import of _k_means (I changed it to from sklearn.cluster import _k_means_fast as _k_means).

I can see this change has already been made in the repo.

Maybe the pip package isn't up to date?

Best
Mehdi

SphericalKMeans does not converge

Hi,
for some reason SphericalKMeans doesn't find any valid centroids. On my data. The same happens with randomly generated data, which clearly has clusters.

Running the following code always leads to the cluster centers [1. 1. 1.]:

from scipy.stats import vonmises
import numpy as np
from spherecluster import SphericalKMeans

ang = vonmises.rvs(25, loc=1, size=100)
ang = np.hstack((ang, vonmises.rvs(25, loc=3, size=100)))
ang = np.hstack((ang, vonmises.rvs(25, loc=5, size=100)))

skm = SphericalKMeans(n_clusters=3)
skm.fit(ang.reshape((-1, 1)))
print skm.cluster_centers_

Python 2.7.12, spherecluster 0.1.2

Am I missing something?

Spherical K-Means is producing different results each run even when fixing `random_state` at an integer

It seems to me that fixing the random_state parameter at an integer when calling SphericalKMeans constructor is still seeding a numpy.random.RandomState pseudo-random number generator and thus not making Spherical K-Means completely deterministic. The consequent call to _k_init provides it with a random_state of the type RandomState instance instead of int, which preserves the randomness as per the routine's documentation:

    random_state : int, RandomState instance
        The generator used to initialize the centers. Use an int to make the
        randomness deterministic.

As a result, I get different close results across runs.

I want the algorithm to be deterministic for the sake of research, am I missing something in that regard?

P.S. I am initialising the algorithm with k-means++.

black dependency breaks python3.5 install

black is only available for python3.6 and higher; its presence in requirements.txt makes installation of this package on python3.5 (under, for me, Ubuntu 16.04 LTS) fail.

Since black doesn't seem to be used by the package (just in the dev process?) maybe it can be removed from requirements.txt? Doing so makes local installation work fine for me.

Initialization is using euclidean distance

centers = _init_centroids(
X, n_clusters, init, random_state=random_state, x_squared_norms=x_squared_norms
)
if verbose:
print("Initialization complete")

I might be getting this wrong, but the code here seems to be using initialization function from sklearn. This could cause issue since the kmeans++ initialization in sklearn is based on euclidean distance. It should be replaced with cosine distance.

Add `normalize=True` parameter

Add normalize=True parameter that normalizes data (optional to user) to both classes so that check_estimator can be applied in tests.

Error in _sample_orthonormal_to

In line 55 in spherecluster/util.py, in the _sample_orthonormal_to function, it reads:

proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)

Shouldn't it instead be:

proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)**2

If norm(mu)=1 then it doesn't make a difference, but otherwise they are quite different.

Strict Matplotlib requirement prevents installation on python 3.6

Hello,

Thanks for making this package, it is really useful!

I am using python 3.6 on OS X 10.12.6 and I tried to install the package through pip. The installation fails while trying to install the matplotlob version as specified in requirements.txt file. This is, I think, a known issue for matplotlib which was fixed in later versions (see issues like matplotlib/matplotlib#3889). I have successfully installed matplotlib version 2.0.2 in my environment.

I think that there are 3 ways of solving this issue.

  1. Bump the version of matplotlib this package depends on
  2. Remove the strict dependency (==) requirement
  3. Remove the matplotlib dependency altogether. From what I understand, matplotlib is only used in the examples and is thus not needed for the packaged version of the library (similar to seaborn and tabulate)

VMF scaling denominator was inf

Hi
I tried to run the soft version, and this error appears 'VMF scaling denominator was inf'. what could be the reasons for that.
Another question is there any way to estimate the number of the clusters.

Thanks

Source install fails due exceptions in setup.py

Hello,

I stumbled over following issue.

When installing a package with spherecluster dependency and spherecluster is installed from source distribution (.tgz) then it fails with exception numpy is required during installation raised from setup.py#12 even when package has correctly numpy (and scipy) dependencies listed.

How to reproduce:

  1. make a toy setup.py

from setuptools import setup

setup(
    name="spherecluster_test",
    version="1.0.0",
    install_requires=["numpy", "scipy", "spherecluster"]
)
  1. build it to wheel
python setup.py bdist_wheel
  1. try to install it with --no-binary option to force spherecluser source distribution:
pip install --no-binary "spherecluster" dist/spherecluster_test-1.0.0-py3-none-any.whl

Processing ./dist/spherecluster_test-1.0.0-py3-none-any.whl
Collecting scipy (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a1730/scipy-1.1.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/16/21/2e88568c134cc3c8d22af290865e2abbd86efa58a1358ffcb19b6c74f9a3/numpy-1.15.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting spherecluster (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/27/27/614b9e568e9a9a8d46938310b7caf092657343bf037b9fae416baf611d06/spherecluster-0.1.6.tar.gz
    Complete output from command python setup.py egg_info:
    numpy is required during installation

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-gtqjhbjz/spherecluster/

I suggest to remove following lines from setup.py#L9-L19.

try:
    import numpy  # NOQA
except ImportError:
    print('numpy is required during installation')
    sys.exit(1)

try:
    import scipy  # NOQA
except ImportError:
    print('scipy is required during installation')
    sys.exit(1)

TypeError: Expected sequence or array-like, got <class 'NoneType'>

self.clus.fit(self.data)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 363, in fit
return_n_iter=True,
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 189, in spherical_k_means
random_state=random_state,
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 39, in _spherical_kmeans_single_lloyd
sample_weight = _check_sample_weight(X, sample_weight)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\sklearn\utils\validation.py", line 1215, in _check_sample_weight
n_samples = _num_samples(X)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\sklearn\utils\validation.py", line 147, in _num_samples
raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'NoneType'>

hi,the params of " _check_sample_weight(sample_weight, X, dtype=None):" defined by sklearn is "sample_weight,X" ,but in spherical_kmeans(line 39) call this fuction : "_check_sample_weight(X, sample_weight)" . does the order of params lead to this?

Using Spherical clustering for Mini-Batch K-Means

Thank you for sharing this great package,
I wanted to experiment with K-Means on a big enough data set that it would require a Mini-Batch version of K-Means,
Do you have any direction for me to follow on and extend your implementation to Mini-Batch?

a question about updating centroids

hi ~ I have a question about updating centroids in your code as follows:

        # computation of the means
        if sp.issparse(X):
            centers = _k_means._centers_sparse(X, labels, n_clusters,
                                               distances)
        else:
            centers = _k_means._centers_dense(X, labels, n_clusters, distances)

        # l2-normalize centers (this is the main contibution here)
        centers = normalize(centers)

When using cosine similarity in clustering, if you just normalize the centers calculated with _k_means._centers_XXX, which were designed to update centers when using eu distance, won't the result centers have different directions from what they should be?
Hope I've describe my question clearly and looking forward to your reply~ Thanks~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.