Git Product home page Git Product logo

dtc_capstone_project's Introduction

DataTalks Club DE Zoomcamp Capstone Project by Amey Kokane

This repository contains my code for DataTalkClub's DE Zoomcamp Capstone Project. Final dashboard is available here. If you have stumbled upon this repo in future and are not able to access dashboard link, it means my Google Cloud free credits have expired. I have included a screen shot of the dashboard below so please scroll down if interested in dashboard.

Kudos to DataTalks Club Team for putting this DE Zoomcamp together!!

Objective:

Commute in Chicago during office hours is painfully slow. Most of the times traffic slows down due to a crash on a roadway. I have always wondered how many crashes happen in around Chicagoland. City of Chicago's open data portal lets user download datasets for personal analysis. Hence I wanted to create a data pipeline to extract data from various Chicago traffic crashes related datasets, transform those datasets into a single stable dataset that can be loaded into a data warehouse on a certain cadence to be used by data scientists/analysts is the goal of this exercise.

Data Sources:

  1. Traffic Crashes - Crashes Crash data shows information about each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD). Data are shown as is from the electronic crash reporting system (E-Crash) at CPD, excluding any
    personally identifiable information.

  2. Traffic Crashes - Vehicles This dataset contains information about vehicles (or units as they are identified in crash reports) involved in a traffic crash.

  3. Traffic Crashes - People This data contains information about people involved in a crash and if any injuries were sustained.

Tools & Tech Stack Used:

  1. Infrastructure as Code --> Terraform
  2. Cloud Platform --> Google Cloud
  3. Data Lake --> Google Cloud Storage
  4. Data Warehouse --> Google BigQuery
  5. Data Transformation: a.Pre Load --> Python Pandas Library and Python Pyarrow Library b.Post Load Batch Processing --> Apache Spark and Google DataProc
  6. Workflow Orchestration --> Airflow
  7. Containerization --> Docker and Docker Compose
  8. Data Vizualization Tool --> Google Data Studio

Data Pipeline Architecture:

alt text

Final Analytical Dashboard:

alt text

Step-by-Step Guide:

  1. Provision Cloud Infrastructure a. Create Google Cloud Platform Account b. Create new project c. Configure Identity and Access Management (IAM) for service account. You will need to assign this account BigQuery Admin, Storage Admin, Storage Object Admin, Viewer, DataProc Admin, DataProc Service Agent previliges. d. Download the JSON credentials and save it to ~/google/credentials folder.
    d. Dont forget to Enable Compute Engine API for GCP and DataProc
  2. Create a folder named dtc_capstone_proj and run bash shell from this folder. Clone the repo using `git clone https://github.com/AmeyKokane/DTC_Capstone_Project'
  3. Go to 01_Terraform folder in your bash and run below code to provision infrastructure.
terraform init
terraform plan
terraform apply
  1. We need to save Spark SQL file that is saved in 03_Apache_Spark in a folder on GCS bucket. This file will be used by DataProc task in our Airflow DAG. Use below code to move this file from local folder to GCS bucket.
cd 03_Apache_Spark \
gsutil cp spark_sql_dataproc_v2.py gs://ENTER-YOUR-GCP-BUCKEt_NAME/dataproc/spark_sql_dataproc_v3.py
  1. Go to 02_Airflow folder, build the docker image using dockerfile. Then run the docker container. In your web browser go to localhost:8080 to access Airflow Webserver. Run the DAG and voila~

Future Development Roadmap:

dtc_capstone_project's People

Contributors

ameykokane avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.