- Emmanouil Dellatollas (8200037)
- Panagiotis Alexios Spanakis (8200158)
The goal of this assignment is to implement a MapReduce program that will implement the K-means algorithm.
The program will be implemented in Python and will be executed on a Hadoop cluster. The program will be executed on a
dataset of points in 2 dimensions. The dataset is provided in the file data.txt
and generated by generator.py
.
The dataset is a text file with one point per line.
Each line contains two numbers separated by a comma.
The first number is the x coordinate of the point and the second number is the y coordinate of the point.
The file used to generate the dataset is the following:
This file is used to generate the dataset (data.txt
). It reads the centers.txt
file and generates the dataset
accordingly.
(To customize the centers of the clusters you can change the centers.txt
file).
It is used by the following command:
python generator.py
The files that are used for this implementation of the K-means algorithm are located in the mrjob folder. and are the following:
This file is used to configure the Hadoop job.
This file is used to run the K-means algorithm on the Hadoop cluster. It is used by the following command:
python k_means_run.py data.txt -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
Note: you need to provide the correct path to Hadoop Jar file in the command
and to change the centers inside the k_means_run.py
file.
The mapper file is used to read the dataset and emit the closest center of each point.
The reducer file is used to calculate the new centers of the clusters.
The run file is used to run the full k-means algorithm on the Hadoop cluster.
You can customize the centers of the clusters by changing the centers.txt
file.
./run.sh
or
./run.ps1
Attention: you need to provide the correct path to Hadoop Jar file in the run file.