Git Product home page Git Product logo

parallel-kmeans's Introduction

CS6068 Parallel Computing Final Project

Parallel K-Means

Background

K-Means Clustering is one of the popularly used unsupervised learning algorithms because of its simplicity and versatility. It's called simple as it has an iterative set of steps that is done until the results converge at an acceptable point.

I am a machine learning enthusiast & have used this algorithm in a few of my projects before. I always thought that K-Means is a simple but versatile clustering algorithm with a decent time complexity.

Given a dataset of n data points, the time complexity is O(nkt) where n is the number of data points, k is number of clusters formed and t is the number of iterations it takes for convergence.

The problem is that it works slowly for a data set which is huge and/or has a relatively high number of clusters. To solve this issue, using data parallelism to calculate distances may result in better runtime.

Objectives

The sequential K-Means algorithms runs iterations of the same set of steps until it reaches and acceptable convergence. Initially k cluster centroids are chosen from the datapoints randomly/ using some criteria. Enhancing choosing methods for the centroids is not a part of this project.

K-Means Algorithm Steps:

  1. Find the distance of each data point to all centroids.
  2. Assign each data point to cluster with closest centroid.
  3. Update Centroid Values

Repeat until there is no difference between the consecutive iteration results.

Out of the above-mentioned steps, maximum time taken is to calculate the distance from each centroid. The same operation is done on each datapoint though at each iteration there is a variation in the centroids used.
The main goal for this project is to improve run time for large datasets with more than 3 attributes by parallelizing the distance calculation for each datapoint.

Files:

pkmeans.py

Contains the code to sequential and parallel implementations of the K-Means algorithm.

Data

Datasets used are stored in the data folder. Make sure download code with data for proper execution.

PK-Means - Report & PK-Means - Presentation

Explaination about the project and how it was executed

parallel-kmeans's People

Contributors

nithyasribabu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.