DataTalks Club DE Zoomcamp Capstone Project by Amey Kokane

This repository contains my code for DataTalkClub's DE Zoomcamp Capstone Project. Final dashboard is available here. If you have stumbled upon this repo in future and are not able to access dashboard link, it means my Google Cloud free credits have expired. I have included a screen shot of the dashboard below so please scroll down if interested in dashboard.

Kudos to DataTalks Club Team for putting this DE Zoomcamp together!!

Objective:

Commute in Chicago during office hours is painfully slow. Most of the times traffic slows down due to a crash on a roadway. I have always wondered how many crashes happen in around Chicagoland. City of Chicago's open data portal lets user download datasets for personal analysis. Hence I wanted to create a data pipeline to extract data from various Chicago traffic crashes related datasets, transform those datasets into a single stable dataset that can be loaded into a data warehouse on a certain cadence to be used by data scientists/analysts is the goal of this exercise.

Data Sources:

Traffic Crashes - Crashes Crash data shows information about each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD). Data are shown as is from the electronic crash reporting system (E-Crash) at CPD, excluding any
personally identifiable information.
Traffic Crashes - Vehicles This dataset contains information about vehicles (or units as they are identified in crash reports) involved in a traffic crash.
Traffic Crashes - People This data contains information about people involved in a crash and if any injuries were sustained.

Tools & Tech Stack Used:

Infrastructure as Code --> Terraform
Cloud Platform --> Google Cloud
Data Lake --> Google Cloud Storage
Data Warehouse --> Google BigQuery
Data Transformation: a.Pre Load --> Python Pandas Library and Python Pyarrow Library b.Post Load Batch Processing --> Apache Spark and Google DataProc
Workflow Orchestration --> Airflow
Containerization --> Docker and Docker Compose
Data Vizualization Tool --> Google Data Studio

Data Pipeline Architecture:

Final Analytical Dashboard:

Step-by-Step Guide:

Provision Cloud Infrastructure a. Create Google Cloud Platform Account b. Create new project c. Configure Identity and Access Management (IAM) for service account. You will need to assign this account BigQuery Admin, Storage Admin, Storage Object Admin, Viewer, DataProc Admin, DataProc Service Agent previliges. d. Download the JSON credentials and save it to ~/google/credentials folder.
d. Dont forget to Enable Compute Engine API for GCP and DataProc
Create a folder named dtc_capstone_proj and run bash shell from this folder. Clone the repo using `git clone https://github.com/AmeyKokane/DTC_Capstone_Project'
Go to 01_Terraform folder in your bash and run below code to provision infrastructure.

terraform init
terraform plan
terraform apply

We need to save Spark SQL file that is saved in 03_Apache_Spark in a folder on GCS bucket. This file will be used by DataProc task in our Airflow DAG. Use below code to move this file from local folder to GCS bucket.

cd 03_Apache_Spark \
gsutil cp spark_sql_dataproc_v2.py gs://ENTER-YOUR-GCP-BUCKEt_NAME/dataproc/spark_sql_dataproc_v3.py

Go to 02_Airflow folder, build the docker image using dockerfile. Then run the docker container. In your web browser go to localhost:8080 to access Airflow Webserver. Run the DAG and voila~

mithranvm / dtc_capstone_project Goto Github PK

dtc_capstone_project's Introduction

DataTalks Club DE Zoomcamp Capstone Project by Amey Kokane

Objective:

Data Sources:

Tools & Tech Stack Used:

Data Pipeline Architecture:

Final Analytical Dashboard:

Step-by-Step Guide:

Future Development Roadmap:

dtc_capstone_project's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent