Git Product home page Git Product logo

premier-league's Introduction

Premier League Data Pipeline

Warning

After a year and some change of building this project, it's time for me to archive it. I've started to use these tools in my current position so learning these on my own and spending my own money on paying for the Football API and Google Cloud services no longer makes sense. I'm switching my focus on learning Golang!

Overview

This repository contains a personal project designed to enhance my skills in Data Engineering. It focuses on developing data pipelines that extract, transform, and load data from various sources into diverse databases. Additionally, it involves creating a dashboard with visualizations using Streamlit.

Important

Many architectural choices and decisions in this project may not make the most efficent sense on purpose for the sake of practicing and learning.

Infrastructure

Tools & Services

cloud streamlit terraform docker prefect dbt

Databases

firestore postgres bigquery

Code Quality

pre-commit

Security Linter Code Formatting Type Checking Code Linting
bandit ruff-format mypy ruff

Data and CI/CD Pipelines

Data Pipelines

Data Pipeline 1

Orchestrated with Prefect, a Python file is ran to extract stock data for Manchester United.

  1. Data from the Financial Modeling Prep API is extracted with Python using the /quote endpoint.
  2. The data is loaded directly into a PostgreSQL database hosted on Cloud SQL with no transformations.
  3. Once the data is loaded into PostgreSQL, Datastream replicates the data into BigQuery. Datastream checks for staleness every 15 minutes.
  4. dbt is used to transform the data in BigQuery and create a view with transformed data.

Data Pipeline 2

Orchestrated with Prefect, Python files are ran that perform a full ETL process.

  1. Data is extracted from multiple API sources:
    • Data from the Football Data API is extracted to retrieve information on the current standings, team statistics, top scorers, squads, fixtures, and the current round. The following endpoints are used:
      • /standings
      • /teams
      • /top_scorers
      • /squads
      • /fixtures/current_round
      • /fixtures
    • Data from the NewsAPI is extracted to retrieve news article links with filters set to the Premier League from Sky Sports, The Guardian, and 90min. The following endpoints are used:
      • /everything
    • Data from a self-built API written in Golang is extracted to retrieve information on teams' stadiums. The following endpoints are used:
      • /stadiums
    • Data from the YouTube API is extracted to retrieve the latest highlights from NBC Sports YouTube channel.
  2. Python performs any necessary transformations such as coverting data types or checking for NULL values
  3. Majority of the data is then loaded into BigQuery in their respective tables. Fixture data is loaded into Firestore as documents categoirzed by the round number.

Data Pipeline 3

1. Daily exports of the standings and top scorers data in BigQuery are exported to a Cloud Storage bucket using Cloud Scheduler to be used in another project. * The other project is a [CLI](https://github.com/digitalghost-dev/pl-cli/) tool written in Golang.

Pipeline Diagram

data-pipeline-flowchart

CI/CD Pipeline

The CI/CD pipeline is focused on building the Streamlit app into a Docker container that is then pushed to Artifact Registry and deployed to Cloud Run as a Service. Different architecutres are buit for different machine types and pushed to Docker Hub.

  1. The repository code is checked out and a Docker image containing the updated streamlit_app.py file will build.
  2. The newly built Docker image will be pushed to Artifact Registry.
  3. The Docker image is then deployed to Cloud Run as a Service.

Pipeline Diagram

cicd_pipeline


Security

  • Syft and Grype work together to scan the Streamlit Docker image. Syft creates an SBOM and Grype scans the SBOM for vulnerabilities. The results are sent to the repository's Security tab.
  • Snyk is also used to scan the repository for vulnerabilities in the Python packages.

premier-league's People

Contributors

dependabot[bot] avatar digitalghost-dev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

premier-league's Issues

[2.14.0] - Team squads

Add a new tab called Squads that displays the current squad for each team in the league.

[2.11.4] - Remove dependence on secrets.toml file

Remove the need of a .streamlit/secrets.toml for the dashboard.
Instead use the following authentication methods:

  • Local Development: Authenticate with import google.auth and set credentials with:
credentials, project = google.auth.default()
  • Local Docker Build: Authenticate with mounted service account keys.
  • Cloud Run: Authentication should "just work". This has been verified with a test Streamlit Docker image deployed to Cloud Run with mounting a .streamlit/secrets.toml file when deploying.

[2.10.0] - News Tab

Add a new tab called News that displays the latest news from the Premier League.

[2.9.0] - Tab Names

Change tab names from Standings to Standings & Overview and Statistics to Top Teams & Scorers.

[2.11.4] - Fix ser.iloc for pandas

With pandas 2.1.3, searching for an item in a dataframe in the following manner is set to depreciate:

team_goals = teams_df_average_goals.iloc[0][6]

Change all occurrences to follow the new method:

team_goals = teams_df_average_goals.iloc[0, 6]

[2.11.1] - Remove toast() function

The toast() function was added in v2.9.0 to let the app load all the data before allowing the user to navigate the dashboard. Setting a timer of 3 seconds allowed the dashboard to fully load. With a more optimized fixtures import, this is no longer necessary.

[2.10.1] - Fix IndexError

Fix IndexError from news table when there aren't enough rows in the table by using a try/except block.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.