Git Product home page Git Product logo

hw3-clustering's Introduction

Project 5

Implementation of KMeans and Silhouette Scoring

Assignment

Overview

The goal of this assignment is to implement the k-means clustering algorithm and the silhouette scoring metric.

For a refresher on the kmeans algorithm you can check out this overview or the wiki.

For a refresher on the silhouette scoring metric you can check out this overview which gives a broad overview of the intuition behind the function (though it uses the sklearn implementation while you will need to build your own from scratch) or the wiki

Unlike previous assignments where we give a particular bioinformatics use case and problem - we will keep this assignment abstract and operate on matrices of arbitrary sizes (i.e. some random number of observations and some random number of features). The reasoning behind this is that it is sometimes best to consider the scope of your inputs and outputs to an algorithm and not bring in any domain specific logic or naming to the implementation.

Instead of providing you with a fixed input to the KMeans algorithm I've provided some code to generate a clustered 2D matrix with known labels. Use this as your input and validate your results visually with the plotting functions I've provided.

Consider the scope of the inputs and how the different parameters of the input data could break the bounds of your implementation.

  • should your model still run if the provided k=0?
  • should your model still run if the number of observations < the provided k?
  • can your model handle very high k?
  • can your model handle very high dimensionality? what about only a single dimension?

API

This implementation must support a fixed and predefined API. Please consider what methods of your implementation should be public and private and denote them using the appropriate underscored style

class SomeClass:
  
  def public_method(self):
    """
    """

  def _private_method(self):
    """
    """

Your KMeans algorithm will be implemented using a scikit-learn style API (i.e. init/fit/predict) which will require 3 steps to run.

The intuition behind splitting up the fit/predict methods is that you may want to cluster novel data using the centroids calculated from some initial training set.

Your Silhouette scoring metric will be implemented using a scikit-learn style metric API (i.e. init/score) which will require 2 steps to run. For this, we will also ask you to write a test that compares your Silhouette scoring function with the one in sklearn--sklearn.metrics.silhouette_score.

When you are writing your code, try to make sure you have some methods in place to start to anticipate end-user mis-use of your code. Should the code produce a result if, for example, the functions are run out of order? What if the data that is provided is invalid somehow?

Try to anticipate some of these sorts of errors in your code so that it's more robust.

The intuition behind splitting this up will not be immediately apparent for this silhouette scoring implementation but in the real world there are usually multiple scoring methods you are calculating which would be subclassed from some shared Metric class.

Building up your models with this in mind will allow for easy integration of subclasses with superclasses down the line if you ever end up implementing them.

While these methods are the only ones we are requiring we highly recommend you create more private methods for specific subtasks within your implementation.

Example Usage

# instantiation of model with some params
kmeans = KMeans(*args, **kwargs)
silhouette = Silhouette(*args, **kwargs)

# fit the model using some input data
kmeans.fit(input_data)

# predict the labels on some input data
labels = kmeans.predict(input_data)
scores = silhouette.score(labels)

Tasks

  • Note: we will check that you can run your code on a medium-size dataset

[ TODO ] Complete the KMeans class with your implementation of the algorithm

[X] complete the fit method
[X] complete the predict method
[X] complete the get_error method
[X] complete the get_centroid method

[ TODO ] Complete the Silhouette class with your implementation of the metric [X] complete the score method

[ TODO ] Unit Testing
[X] KMeans Class
[X] Silhouette Class -- make sure to test against sklearn

[ TODO ] Packaging
[X] pip installable module
[X] github actions (install + pytest)

For those who are particularly interested: try to implement the k-means++ initialization algorithm. This is a method used in sklearn to initialize the clusters to best guesses and dramatically increases the speed of convergence for the algorithm.

We won't give extra points for this, but it is a good exercise.

Getting Started

To get started you will need to fork this repo onto your own github account. Work on your codebase from your own repo and commit changes. We have listed the minimum (and maximum) python module requirements in requirements.txt, although keep in mind beyond these you will also need sklearn to implement the Silhouette scoring test.

It will save you a lot of work to implement scipy.spatial's functions for distance computation, for example, rather than write some of your own, although of course you are welcome to do so.

Using Utility Functions

We've built up some utility functions for you to use to test and visualize your implementation of K-Means and Silhouette scoring. There are 3 functions that you can use for these tests and you can read more about how to use them in their docstrings + usage below:

from cluster import (
  make_clusters, 
  plot_clusters, 
  plot_multipanel)

"""
Cluster Generation
"""

# here we are making tightly clustered data by lowering the scale param
t_clusters, t_labels = make_clusters(scale=0.3) 

# here we are making loosely clustered data by increasing the scale param
l_clusters, l_labels = make_clusters(scale=2) 

# here we are making many clusters by adjusting the `k` param
m_clusters, m_labels = make_clusters(k=10)

# here we are directly controlling the dimensionality of our data 
#   1000 observations 
#   200 features 
#   3 clusters)
d_clusters, d_labels = make_clusters(n=1000, m=200, k=3)


"""
Cluster Visualization
"""
# show the visualization for some clusters and their associated labels
plot_clusters(t_clusters, t_labels)

# show a multipanel visualization of clusters with true labels, predicted labels, and silhouette scores
# you will need to calculate these predicted labels with your kmeans implementation and the scores 
# with your silhouette score implementation
plot_multipanel(t_clusters, t_labels, pred_labels, scores)

Example Visualizations

Here are some examples of visualizations for some tightly and loosely clustered data.

I've also included an example output for the multipanel visualization

Tightly Clustered Data

tightly-clustered-data

Loosely Clustered Data

loosely-clustered-data

Multipanel Visualization

Note that the cluster labels themselves may change after clustering

(i.e. expected cluster 0 is not guaranteed to be named cluster 0 after clustering)

multipanel-data

Grading (Total: 10)

  • KMeans Implementation (3)
  • Silhouette Implementation (2)
  • Correct API (1)
  • Unit Tests (2)
  • Pip Installable / Actions (2)

hw3-clustering's People

Contributors

ggaviles avatar alexj-lee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.