Assignment 2 : Hadoop MapReduce for Big Data Management Systems

Authors:

Emmanouil Dellatollas (8200037)
Panagiotis Alexios Spanakis (8200158)

1. Introduction

The goal of this assignment is to implement a MapReduce program that will implement the K-means algorithm. The program will be implemented in Python and will be executed on a Hadoop cluster. The program will be executed on a dataset of points in 2 dimensions. The dataset is provided in the file data.txt and generated by generator.py. The dataset is a text file with one point per line. Each line contains two numbers separated by a comma. The first number is the x coordinate of the point and the second number is the y coordinate of the point.

The file used to generate the dataset is the following:

generator.py

This file is used to generate the dataset (data.txt). It reads the centers.txt file and generates the dataset accordingly.

(To customize the centers of the clusters you can change the centers.txt file).

It is used by the following command:

python generator.py

2. K-means algorithm Implementation 1

The files that are used for this implementation of the K-means algorithm are located in the mrjob folder. and are the following:

k_means_job.py

This file is used to configure the Hadoop job.

k_means_run.py

This file is used to run the K-means algorithm on the Hadoop cluster. It is used by the following command:

python k_means_run.py data.txt -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar

Note: you need to provide the correct path to Hadoop Jar file in the command and to change the centers inside the k_means_run.py file.

3. K-means algorithm Implementation 2

mapper.py

The mapper file is used to read the dataset and emit the closest center of each point.

reducer.py

The reducer file is used to calculate the new centers of the clusters.

run.sh or run.ps1

The run file is used to run the full k-means algorithm on the Hadoop cluster. You can customize the centers of the clusters by changing the centers.txt file.

./run.sh

./run.ps1

Attention: you need to provide the correct path to Hadoop Jar file in the run file.

panos-span / hadoop Goto Github PK

hadoop's Introduction

Assignment 2 : Hadoop MapReduce for Big Data Management Systems

Authors:

1. Introduction

generator.py

2. K-means algorithm Implementation 1

k_means_job.py

k_means_run.py

3. K-means algorithm Implementation 2

mapper.py

reducer.py

run.sh or run.ps1

hadoop's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent