Big Data Systems projects

Big data projects that consist of Java Spark Submit, Spark Streaming and Spark ML application. All projects are running in a docker containers. The data that is being processed with the Apache Spark platform and is uploaded to a HDFS that has one namenode and one datanode. Hadoop, Spark and Cassandra DB are also running inside a docker containers. There is one master node and two worker nodes.

Dataset

Dataset contains data on New York City Taxi Cab trips. This dataset represents a subset of the total data and contains data collected for the 3 months of 2014. Dataset can be found at the following link.

The data is stored in CSV file with the next columns:

vendor_id: A code indicating the TPEP provider that provided the record. Values are: 1= Creative Mobile Technologies, LLC and 2= VeriFone Inc.
pickup_datetime: The date and time when the meter was engaged.
dropoff_datetime: The date and time when the meter was disengaged.
passenger_count: The number of passengers in the vehicle. This is a driver-entered value.
trip_distance: The elapsed trip distance in miles reported by the taximeter.
pickup_longitude: The longitude where the meter was engaged.
pickup_latitude: The latitude where the meter was engaged.
rate_code: The final rate code in effect at the end of the trip. Values are: 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare and 6= Group ride.
store_and_fwd_flag: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Values are: Y= store and forward trip and N= not a store and forward trip.
dropoff_longitude The longitude where the meter was disengaged.
dropoff_latitude: The latitude where the meter was disengaged.
payment_type: A numeric code signifying how the passenger paid for the trip. Values are: 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.
fare_amount: The time-and-distance fare calculated by the meter.
surcharge: Miscellaneous extras and surcharges.
mta_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use.
tip_amount: – This field is automatically populated for credit card tips. Cash tips are not included.
tolls_amount: The total amount of all tolls paid in trip.
total_amount: The total amount charged to passengers. Does not include cash tips.

Preparing the projects

Clone or download project.
Create a directory in the root directory of the project called 'big-data'. Download the dataset from link. Move the downloaded dataset file to newly created directory 'big-data' and rename the file name to 'data.csv'.
Run docker commands to start Hadoop and Spark required containers and put data on the HDFS.

docker-compose up -d
docker exec -it namenode hdfs dfs -mkdir /big-data
docker exec -it namenode hdfs dfs -put /big-data/data.csv /big-data/data.csv

Running the Projects

Each project is started by running the appropriate docker-compose file. Project start commands are:

Spark submit project

docker-compose -f docker-compose-1.yaml up --build

Spark streaming project

docker-compose -f docker-compose-2.yaml up --build

Spark ml project (batch and streaming project separately)

docker-compose -f docker-compose-3-batch.yaml up --build
docker-compose -f docker-compose-3-streaming.yaml up --build

Each project can be stopped by executing the appropriate docker command.

docker-compose -f `docker compose file name` down

drstojanovic / big-data-systems Goto Github PK

big-data-systems's Introduction

Big Data Systems projects

Dataset

Preparing the projects

Running the Projects

big-data-systems's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent