lilykos / clusterix Goto Github PK

Visual exploration of clustered data.

Python 8.96% HTML 12.27% JavaScript 74.94% CSS 3.83%

clustering visualization visual-analytics tf-idf decomposition plot

clusterix's Introduction

Clusterix: A visual analytics approach to data clustering

Clusterix is a web-based visual analytics tool that aspires to support clustering tasks by users, while having analysts at the center of the workflow. Clusterix provides the facilities to:

Load and preview CSV data files;
create a 2D projection of the dataset
select any combination of fields to be used for projection/clustering;
select and run one or more clustering algorithms (K-Means, Agglomerative Clustering, Mean Shift) with varying parameters;
view and interact with the results in a browser environment;
save time and use an iterative approach;
modify the parameters or input data to correct the clustering output.

Such an iterative, visual analytics approach allows users to quickly determine the best clustering algorithm and parameters for their data, and to correct any errors in the clustering output. Clusterix has been applied to the clustering of heterogeneous data sets

Usage

First you need to install the requirements:

pip install -r requirements.txt

To run the project:

python manage.py runserver

This command will run Clusterix on http://127.0.0.1:5000 where you will be able to use the interface to upload data files, and select the algorithms/options that you want.

Features

File input (CSV only currently)

Data Preview
Field selection
Text Features (Vectorizers, stemming, stopwords, etc)

Vectorizers

Count Vactorizer
Tf-Idf Vectorizer
Hashing Vectorizer

Decomposition

PCA
SVD
MDS
t-SNE

Algorithms

K-Means
Agglomerative Clustering
Mean Shift
DBSCAN

Plot Features

Scatterplot vizualizations
Full text/column search for the nodes
Brushing and zoom for targeted inspection
Various clustering metrics (TF-IDF, etc)

Instructions

Clusterix works iteratively, so there are certain steps that need to be followed:

Upload a data file. the necessary information/preprocessing will happen and the options will be shown
First you need to get a projection of the data, so use all the text and field options to tune your decomposition.
The decomposition model and the coordinates are saved, so that you can iterate through clustering models really fast.
In case you need to try a new decomposition, create a new projection.
Use brushing to get TF-IDF (if applicable) and a zoomed area for browsing.
The Search function works using the SQLite syntax, so everytime you want to write something imagine that it starts like this: SELECT * FROM dataframe WHERE...

Screenshots

Wine Data

clusterix's People

Contributors

Stargazers

Watchers

Forkers

jalavik eamonnmag panos512 matswillemsen danpaulsmith therockstardba edvench yvokeller

(2017-10-12 08:31:54) - INFO	127.0.0.1 - - [12/Oct/2017 08:31:54] "POST /get_clustering_results HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/brray/Documents/Exxon/Demo/clusterix-master/src/routes/projections.py", line 17, in get_clustering_results
    return jsonify(**{'results': cluster_data(attrs)})
  File "/Users/brray/Documents/Exxon/Demo/clusterix-master/src/config.py", line 43, in wrapper
    return func(*args)
  File "/Users/brray/Documents/Exxon/Demo/clusterix-master/src/cluster/process.py", line 38, in cluster_data
    df['clx_cluster'] = labels
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py", line 2429, in __setitem__
    self._set_item(key, value)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py", line 2495, in _set_item
    value = self._sanitize_column(key, value)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/frame.py", line 2666, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/Users/brray/anaconda2/anaconda/envs/py3k/lib/python3.5/site-packages/pandas/core/series.py", line 2879, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index```

[out_emails_labled.csv.zip](https://github.com/Lilykos/clusterix/files/1379398/out_emails_labled.csv.zip)