Git Product home page Git Product logo

hadoop's Introduction

Assignment 2 : Hadoop MapReduce for Big Data Management Systems

Authors:

  • Emmanouil Dellatollas (8200037)
  • Panagiotis Alexios Spanakis (8200158)

1. Introduction

The goal of this assignment is to implement a MapReduce program that will implement the K-means algorithm. The program will be implemented in Python and will be executed on a Hadoop cluster. The program will be executed on a dataset of points in 2 dimensions. The dataset is provided in the file data.txt and generated by generator.py. The dataset is a text file with one point per line. Each line contains two numbers separated by a comma. The first number is the x coordinate of the point and the second number is the y coordinate of the point.

The file used to generate the dataset is the following:

generator.py

This file is used to generate the dataset (data.txt). It reads the centers.txt file and generates the dataset accordingly.

(To customize the centers of the clusters you can change the centers.txt file).

It is used by the following command:

python generator.py

2. K-means algorithm Implementation 1

The files that are used for this implementation of the K-means algorithm are located in the mrjob folder. and are the following:

k_means_job.py

This file is used to configure the Hadoop job.

k_means_run.py

This file is used to run the K-means algorithm on the Hadoop cluster. It is used by the following command:

python k_means_run.py data.txt -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar

Note: you need to provide the correct path to Hadoop Jar file in the command and to change the centers inside the k_means_run.py file.

3. K-means algorithm Implementation 2

mapper.py

The mapper file is used to read the dataset and emit the closest center of each point.

reducer.py

The reducer file is used to calculate the new centers of the clusters.

run.sh or run.ps1

The run file is used to run the full k-means algorithm on the Hadoop cluster. You can customize the centers of the clusters by changing the centers.txt file.

./run.sh

or

./run.ps1

Attention: you need to provide the correct path to Hadoop Jar file in the run file.

hadoop's People

Contributors

manosdell avatar panos-span avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.