Data Engineer project using Azure Data Factory, SQL, Data Lake, Databricks, HDInsight, CI/CD based on Formula1 Racing datasets
- How to create notebooks, dashboards, clusters, cluster pools and jobs in Azure Databricks
- How to transform and analyse data using Spark SQL in Azure Databricks
- How to create Azure Data Factory pipelines to execute Databricks notebooks
- The skills required around Azure Databricks and Data Factory to pass the Azure Data Engineer Associate certification exam DP203
- A comprehensive understanding about Unity Catalog and the data governance capabilities offered by Unity Catalog
- Acquire professional level data engineering skills in Azure Databricks, Delta Lake, Spark Core, Azure Data Lake Gen2 and Azure Data Factory (ADF)
- How to ingest and transform data using PySpark in Azure Databricks
- About Data Lake architecture and Lakehouse Architecture.
- How to implement a Lakehouse architecture using Delta Lake
- How to create Azure Data Factory triggers to schedule pipelines as well as monitor them
- How to connect to Azure Databricks from PowerBI to create reports
- To implement a data governance solution using Unity Catalog enabled Databricks workspace
- Formula1 season happens once a year, roughly over 20 races
- Each race happens over a weekend
- There are various race circuits
- There are roughly ten teams that participate in each season and they are also called constructors in Formula1
- Each team has two drivers and each driver is assigned a specific car
- There are also reserve drivers in each team, but only to participate in races
- Raceway can spans from Friday to Sunday.
- There are two practice sessions on a Friday and a final practice session on Saturday morning. They don't count towards any points or achievements
- There is a qualifying session on Saturday afternoon and it happens over three different stages.
- The qualifying results decide the grid position of the driver as to where he'll start the race.
- The qualifying results decide the grid position of the driver as to where he'll start the race. The higher the driver qualifies the forward, he gets to start the race, which is a massive advantage
- Unlike qualifying sessions, which are decided over a single lap, races span multiple laps, roughly between 50 and 70 laps, depending on the length of the circuit
- Also during the race, drivers make pit stops to change the tires or to replace the damaged car
- Based on the race results. Drivers and constructors standings are decided. Whichever driver is on the top of the standings at the end of the season is the drivers champion. And similarly, the team that's leading the constructors standings becomes the Constructors champion
- There is the third party developer API called Ergast that makes the data available for all races from 1950 onwards.
- So let's visit the website and explore it a bit further. In order to get to the documentation, visit the Ergast website. And from here you can look at things like terms and conditions of use and the API documentation itself. And also if we didn't want to use the API, we can look at the manual interface or even download the data in a CSV format.
- You can use this API for personal non-commercial applications and services, including educational and research purposes. So we are okay to use this data from this API or from this website for our educational purposes.
- main page > Database Images> CSV Database Tables
- f1db_csv.zip : the zip file contains all the data in a CSV format. So that includes all the race data starting from 1950 onwards up to now.
- Files converted in different formats to deal with different file types
- The structure of the database is shown in an Entity Relationship Diagram and explained in the Database User Guide
- Ingest All 8 files into the data lake
- Ingested data must have the schema applied
- Ingested data must have audit columns
- Ingested data must be stored in columnar format (i.e., Parquet)
- Must be able to analyze the ingested data via SQL
- Ingestion logic must be able to handle incremental load
- Join the key information required for reporting to create a new table
- Join the key information required for Analysis to create a new table
- Transformed tables must have audit columns
- Must be able to analyze the transformed data via SQL
- Transformed data must be stored in columnar format (i.e., Parquet)
- Transformation logic must be able to handle incremental load
- Driver Standings
- Constructor Standings
- Dominant Drivers
- Dominant Teams
- Visualize the outputs
- Create Databricks Dashboards
- Pipeline to run every Sunday 10 PM
- Ability to monitor pipelines
- Ability to re-run failed pipelines
- Ability to set-up alerts on failures
- Ability to delete individual records
- Ability to see history and time travel
- Ability to roll back to a previous version
The solution used in this project is based on the Modern analytics architecture with Azure Databricks from the Azure Architecture Center ...
.. and from Databricks Architecture Solutions
- Ergast API : The Ergast API serves as the primary data source for Formula 1 racing data
- ADLS RAW layer : The data from the Ergast API is ingested into the Azure Data Lake Storage (ADLS) RAW layer, which acts as the initial storage layer for the raw data
- Ingest (Databricks Spark) : Using Databricks Spark, the data is ingested from the ADLS RAW layer, performing necessary data transformations and cleaning operations
- ADLS ingest layer : The transformed data is stored in the ADLS ingest layer, serving as an intermediate storage layer for further processing
- Transform (Databricks Spark) : Databricks Spark is used again to perform additional data transformations and aggregations on the data stored in the ADLS ingest layer
- ADLS presentation layer : The transformed and processed data is stored in the ADLS presentation layer, providing a structured and optimized format for analysis
- Analysis (Databricks) : Leveraging the power of Databricks, the data in the ADLS presentation layer is analyzed using various analytical techniques and algorithms
- ADLS presentation layer to Power BI : The analyzed data from the ADLS presentation layer is visualized and presented using Power BI for intuitive and interactive data exploration
- Python
- PySpark
- Azure Databricks
- Azure Data Factory (ADF)
- Azure Data Lake Storage Gen2 (ADLS)
- Delta Lake
- data_________: contains sample raw data from Ergast API.
- set-up_______: notebooks to mount ADLS storages (raw, ingested, presentaton) in Databricks
- raw__________: contains SQL file to create ingested tables using Spark SQL
- ingestion____: contains notebooks to ingest all the data files from raw layer to ingested layer. Handles the incremental data for files results, pitstopes, laptimes and qualifying.
- transform____: contains notebooks to transform the data from ingested layer to presentation layer. Notebook performs transformations to setup for analysis
- analysis_____: contains SQL files for finding the dominant drivers and teams and to prepare the results for visualization
- includes_____: includes notebooks containing helper functions used in transformations
- utils________: contains SQL file to drop all databases for incremental load