This repository contains the Docker image for Cloud Assignment 2. You can pull the image from Docker Hub using the following command:
- Setup AWS Cluster for Training the Model
- Setup of EMR on AWS
- 1. Go to EC2 -> Create Cluster
- 2. Give cluster name
- 3. Select Amazon EMR release -> emr-6.15.0
- 4. Click Cluster Configuration -> Add an instance group to add one more instance to create a 4 cluster group
- 5. Security configuration and EC2 key pair -> Add Amazon EC2 key pair for SSH to the cluster
- 6. Select Role
- 7. Create Cluster
- Connect to EMR Instance
- Setup of EMR on AWS
- Setup of Prediction Application on EC2 Instance on AWS
- Docker Image for Cloud Assignment 2
4. Click Cluster Configuration -> Add an instance group to add one more instance to create a 4 cluster group
- Amazon EMR service role: EMR_DefaultRole
- EC2 instance profile for Amazon EMR: EMR_DefaultRole
Once the cluster is created, go to the security of EC2 instances and open port 22 and custom IP Address.
ssh -i "CS643-Cloud.pem" [email protected]
This guide will walk you through the steps to install Apache Spark on Ubuntu using the standard package manager apt
.
- Ubuntu operating system
- sudo privileges
Ensure that your package list is up-to-date:
sudo apt update
Apache Spark requires Java. Install OpenJDK 8 or later:
sudo apt install openjdk-8-jdk
Visit the Apache Spark Downloads page and copy the link to the latest pre-built version. Replace with the actual version number.
wget https://archive.apache.org/dist/spark/spark-<version>/spark-<version>-bin-hadoop2.7.tgz
Extract the downloaded archive:
tar -xvzf spark-<version>-bin-hadoop2.7.tgz
Move the extracted Spark directory to the /opt directory (you may need sudo):
sudo mv spark-<version>-bin-hadoop2.7 /opt/spark
Add Spark's binaries to the PATH and set the SPARK_HOME variable. Open your shell configuration file (e.g., ~/.bashrc or ~/.zshrc) and add the following lines:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3
Source the updated configuration:
source ~/.bashrc
Run the following command to check if Spark is installed successfully:
spark-shell
To properly configure your environment for Apache Spark and Hadoop, add the following lines to your shell configuration file. Depending on your shell, this file may be .bash_profile
, .zshrc
, or another relevant file. Open the file in a text editor and add the following lines:
export SPARK_HOME=/usr/local/opt/apache-spark/libexec
export HADOOP_HOME=/usr/local/opt/hadoop
This guide will walk you through the steps to install the AWS Command Line Interface (AWS CLI) on Ubuntu.
- Ubuntu operating system
- sudo privileges
Ensure that your package list is up-to-date:
sudo apt update
Install the AWS CLI using the package manager:
sudo apt install awscli
After the installation is complete, you can verify it by checking the AWS CLI version:
aws --version
To use AWS CLI, you need to configure it with your AWS credentials. Run the following command and follow the prompts:
aws configure
Exit from the EMR instance and use the below command:
scp -i CS643-Cloud.pem ~/Desktop/ProgrammingAssignment2-main/training.py [email protected]:~/trainingModel
Reconnect to the server using the SSH command.
Navigate to your project folder and create a virtual environment (replace "venv" with your preferred name):
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.2 training.py
- Go to EC2
- Click on Launch Instance
- Use the steps provided below the image
Once done, you can now connect with the SSH command:
ssh -i your-key.pem ec2-user@your-instance-ip
Python Environment Setup This document provides instructions on setting up the Python environment for this project.
Install Python Download and install the latest version of Python from python.org.
Install virtualenv If you don't have virtualenv installed, run the following command:
pip install virtualenv
Create a Virtual Environment Navigate to your project folder and create a virtual environment (replace "venv" with your preferred name):
python -m venv venv
Activate the Virtual Environment
source venv/bin/activate
Install Project Dependencies
source venv/bin/activate
Execute the command
spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.2 predict.py
This Dockerfile sets up an environment for Cloud Assignment 2, including Python, Java, Spark, and Hadoop.
To build the Docker image, navigate to the directory containing the Dockerfile and run:
docker build -t cloud-assignment-2 .
After building the image, you can run a Docker container with the following command:
docker run -it cloud-assignment-2
To build the Docker image, use the following command in the directory containing your Dockerfile:
docker build -t your-dockerhub-username/cloud-assignment-2:latest .
Login to Docker Hub
docker tag your-dockerhub-username/cloud-assignment-2:latest your-dockerhub-username/cloud-assignment-2:version-tag
Replace version-tag with the desired version or tag for your Docker image.
docker push your-dockerhub-username/cloud-assignment-2:version-tag