Git Product home page Git Product logo

formula1_databricks's Introduction

formula1 project

Data Engineer project using Azure Data Factory, SQL, Data Lake, Databricks, HDInsight, CI/CD based on Formula1 Racing datasets


Goals (learn / gain / acquire)

  • How to create notebooks, dashboards, clusters, cluster pools and jobs in Azure Databricks
  • How to transform and analyse data using Spark SQL in Azure Databricks
  • How to create Azure Data Factory pipelines to execute Databricks notebooks
  • The skills required around Azure Databricks and Data Factory to pass the Azure Data Engineer Associate certification exam DP203
  • A comprehensive understanding about Unity Catalog and the data governance capabilities offered by Unity Catalog
  • Acquire professional level data engineering skills in Azure Databricks, Delta Lake, Spark Core, Azure Data Lake Gen2 and Azure Data Factory (ADF)
  • How to ingest and transform data using PySpark in Azure Databricks
  • About Data Lake architecture and Lakehouse Architecture.
  • How to implement a Lakehouse architecture using Delta Lake
  • How to create Azure Data Factory triggers to schedule pipelines as well as monitor them
  • How to connect to Azure Databricks from PowerBI to create reports
  • To implement a data governance solution using Unity Catalog enabled Databricks workspace

Description

  • Formula1 season happens once a year, roughly over 20 races
  • Each race happens over a weekend
  • There are various race circuits
  • There are roughly ten teams that participate in each season and they are also called constructors in Formula1
  • Each team has two drivers and each driver is assigned a specific car
  • There are also reserve drivers in each team, but only to participate in races
  • Raceway can spans from Friday to Sunday.
  • There are two practice sessions on a Friday and a final practice session on Saturday morning. They don't count towards any points or achievements
  • There is a qualifying session on Saturday afternoon and it happens over three different stages.
  • The qualifying results decide the grid position of the driver as to where he'll start the race.
  • The qualifying results decide the grid position of the driver as to where he'll start the race. The higher the driver qualifies the forward, he gets to start the race, which is a massive advantage
  • Unlike qualifying sessions, which are decided over a single lap, races span multiple laps, roughly between 50 and 70 laps, depending on the length of the circuit
  • Also during the race, drivers make pit stops to change the tires or to replace the damaged car
  • Based on the race results. Drivers and constructors standings are decided. Whichever driver is on the top of the standings at the end of the season is the drivers champion. And similarly, the team that's leading the constructors standings becomes the Constructors champion

Formula1 data source & datasets

  • There is the third party developer API called Ergast that makes the data available for all races from 1950 onwards.
  • So let's visit the website and explore it a bit further. In order to get to the documentation, visit the Ergast website. And from here you can look at things like terms and conditions of use and the API documentation itself. And also if we didn't want to use the API, we can look at the manual interface or even download the data in a CSV format.
  • You can use this API for personal non-commercial applications and services, including educational and research purposes. So we are okay to use this data from this API or from this website for our educational purposes.
  • main page > Database Images> CSV Database Tables
    • f1db_csv.zip : the zip file contains all the data in a CSV format. So that includes all the race data starting from 1950 onwards up to now.
  • Files converted in different formats to deal with different file types

Data Files

Entity Relationship Diagram


Prepare the data for the project

Project Requirements

Data Ingestion Requirements

  • Ingest All 8 files into the data lake
  • Ingested data must have the schema applied
  • Ingested data must have audit columns
  • Ingested data must be stored in columnar format (i.e., Parquet)
  • Must be able to analyze the ingested data via SQL
  • Ingestion logic must be able to handle incremental load

Data Transformation Requirements

  • Join the key information required for reporting to create a new table
  • Join the key information required for Analysis to create a new table
  • Transformed tables must have audit columns
  • Must be able to analyze the transformed data via SQL
  • Transformed data must be stored in columnar format (i.e., Parquet)
  • Transformation logic must be able to handle incremental load

Reporting Requirements

  • Driver Standings
  • Constructor Standings

Analysis Requirements

  • Dominant Drivers
  • Dominant Teams
  • Visualize the outputs
  • Create Databricks Dashboards

Scheduling Requirements

  • Pipeline to run every Sunday 10 PM
  • Ability to monitor pipelines
  • Ability to re-run failed pipelines
  • Ability to set-up alerts on failures

Other Non-Functional Requirements

  • Ability to delete individual records
  • Ability to see history and time travel
  • Ability to roll back to a previous version

Solution Architecture

Azure Databricks Modern Analytics Architecture

The solution used in this project is based on the Modern analytics architecture with Azure Databricks from the Azure Architecture Center ...

Databricks Architecture

.. and from Databricks Architecture Solutions

Solution Architecture

  • Ergast API : The Ergast API serves as the primary data source for Formula 1 racing data
  • ADLS RAW layer : The data from the Ergast API is ingested into the Azure Data Lake Storage (ADLS) RAW layer, which acts as the initial storage layer for the raw data
  • Ingest (Databricks Spark) : Using Databricks Spark, the data is ingested from the ADLS RAW layer, performing necessary data transformations and cleaning operations
  • ADLS ingest layer : The transformed data is stored in the ADLS ingest layer, serving as an intermediate storage layer for further processing
  • Transform (Databricks Spark) : Databricks Spark is used again to perform additional data transformations and aggregations on the data stored in the ADLS ingest layer
  • ADLS presentation layer : The transformed and processed data is stored in the ADLS presentation layer, providing a structured and optimized format for analysis
  • Analysis (Databricks) : Leveraging the power of Databricks, the data in the ADLS presentation layer is analyzed using various analytical techniques and algorithms
  • ADLS presentation layer to Power BI : The analyzed data from the ADLS presentation layer is visualized and presented using Power BI for intuitive and interactive data exploration

Stack

Project Structure

  • data_________: contains sample raw data from Ergast API.
  • set-up_______: notebooks to mount ADLS storages (raw, ingested, presentaton) in Databricks
  • raw__________: contains SQL file to create ingested tables using Spark SQL
  • ingestion____: contains notebooks to ingest all the data files from raw layer to ingested layer. Handles the incremental data for files results, pitstopes, laptimes and qualifying.
  • transform____: contains notebooks to transform the data from ingested layer to presentation layer. Notebook performs transformations to setup for analysis
  • analysis_____: contains SQL files for finding the dominant drivers and teams and to prepare the results for visualization
  • includes_____: includes notebooks containing helper functions used in transformations
  • utils________: contains SQL file to drop all databases for incremental load

formula1_databricks's People

Contributors

oussou-dev avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.