Project Description can be found here
Connection Setup
vim .ssh/config
- Add ssh connection details as
Host <host>
HostName <host name>
User <user>
Port <port>
Host *
AddKeysToAgent yes
IdentityFile <path to private key>
bash
ssh <host>
sudo su
To list all the running java processes
jps
To view current status of the HDFS cluster:
ssh -L 9870:10.10.1.1:9870 <user>@<host>
- On a browser, go to URL
localhost:9870/dfshealth.html
To view currently running Spark job(s):
ssh -L 4040:10.10.1.1:4040 <user>@<host>
- On a browser, go to URL
localhost:4040
To view job(s) history on the Spark History Server:
ssh -L 18080:10.10.1.1:18080 <user>@<host>
- On a browser, go to URL
localhost:18080
To run part 2:
sudo su
spark-submit --master spark://10.10.1.1:7077 part2/sort.py
OR
Run the entire codebase
To run part 3:
sudo su
spark-submit --master spark://10.10.1.1:7077 part3/pagerank.py --iterations <num> --partitions <num> --persist <"Memory_Only"/"Memory_And_Disk"> --out_partitions <num>
OR
Run the entire codebase
sudo su
chmod 777 run.sh
./run.sh
sudo su
chmod 777 clean_up.sh
./clean_up.sh
sudo su
cd scripts
chmod 777 <filename>.sh
./<filename>.sh
- We use Ganglia for acquiring the CPU, Network I/O and Memory usage for each node in the cluster.
- For other metric wrt the job, we use Spark's History Server.