Git Product home page Git Product logo

data-engineering-zoomcamp-2024's Introduction

Problem (Imagined ofcourse)

As a part of Helinski city effort to encourage the usage of sustainable transportation methods, and to reduce the dependency on cars, the city is trying to promote the usage of bike share solutions for last mile trips. To monitor the adoption and the sucess of such solutions in Helinsky, the bearu of transportation needs a dashboard that summarizes the trends of using bike share solutions, based on trips data for years 2016-2020, before covid lockdown.

In order to have a better understanding of the solution adoption, the following is required:

  • A graph showing the total trips year-to-year
  • A map showing the popularity of each station: how much trips were made from each station.

Results

Dashboard

  • The pipe line takes the raw data from CSV in a tar.gz file
  • Then using spark, it reads the data and cleans it, adding missing departure_id based on station coordinates.
  • Then it saves the cleaned data as parquet files to GCS, partitioned by year.
  • Similarly, it will save the data into big-query table
    • The table is partitioned by year to reduce per-year query costs
    • And the clustering is done on departure_id, to facilate the map dashboard, which count trips by departure station.
  • Data looker is used to create the dashboard as shown in the image
  • You can view the dashboard from here
    • unfortunately I couldn't find an easy way to automate the dashboard creation

Requirements

  • Google cloud account
  • gcloud cli
  • terraform

Steps

  • The Helsinki City bikes data set from kaggle is available on this url. The pipeline will load it automatically.

  • In Google-Cloud-Services console, create a new project.

  • Init your shell, gcloud tool and terraform

# Set the project name in shell
export GCS_PROJECT=<your-gcp-project-id>;
export GCS_REGION=<your-gcp-project-region>;

#login to gcs using gcloud tool
gcloud auth login;

#Set the current project to the one created
gcloud config set project $GCS_PROJECT;

#set the project for terraform in the current shell
export TF_VAR_project=$GCS_PROJECT;
export TF_VAR_location=$GCS_REGION;
  • init terraform state file
terraform init
  • Check changes to new infra plan. Notice that terraform commands must run in the same shell you used to login to GCS, in order to have credentials to access the cloud.
terraform plan
  • Create new infra
terraform apply
  • run the notebook using dataproc
# copy the pipeline script to GCS
export SPARK_JOB=gs://$GCS_PROJECT-datalake/process_bikes-trips.py;
gsutil cp ./process_bikes-trips.py $SPARK_JOB;

gcloud dataproc jobs submit pyspark $SPARK_JOB --cluster=$GCS_PROJECT-dataproc --region=$GCS_REGION --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar,gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.30.0.jar; 
  • When done, take down the infrastructure.
terraform destroy

Resources

data-engineering-zoomcamp-2024's People

Contributors

deathwaiting avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.