Fantasy Premier League data ingestion and analysis ⚽

Overview

The core premise of this project is to showcase what i have learned whilst partaking in the Data Talks Club Data Engineering course. I will be utilising multiple tools in order to create an effective pipeline that can ingest and manipulate the sourced FPL data into a finalised visual dashboard which you can view here!

What is Fantasy Premier League

Fantasy Premier league is an online game that casts you in the role of a Fantasy manager of Premier League players. You must pick a squad of 15 players from the current Premier League season, who score points for your team based on their performances for their clubs in PL matches.

Problem description

The project will aim to extract multiple years of FPL data for analysis so that we can take a deeper look into individual stats of players and teams across the 2016 to 2023 seasons.

Key insights to be extracted:

Who are the most inform goal scorers
Who are the most inform assisters
What players influence their teams the most
What players have the highest points
Who are the most expensive players
How many goals are scored per season

Technologies

I will use the technolgies below to help with the creation of the project:

Cloud: GCP
- Data Lake: GCS
- Data warehouse: Big Query
Terraform: Infrastructure as code (IaC) - creates project configuration for GCP to bypass cloud GUI.
Workflow orchestration: Prefect (docker)
Transforming data: DBT
Data Visualisation: SAS Visual Analytics

Architecture visualised:

Dashboard examples

The dashboard allows the user to ingest a highlevel analysis of both players and teams across several seasons in the Barclays Premier League. You can view the dashboard here

Home page for visualisation:

Overview analysis of all seasons:

Individual team analysis:

How to run the project

Clone the repo and install the neccesary packages

pip install -r requirements.txt

Next you will want to setup your Google Cloud environment

Create a [Google Cloud Platform project] if you do not already have one(https://console.cloud.google.com/cloud-resource-manager)
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges: BigQuery Admin, Storage Admin and Storage Object Admin
Download the JSON credentials and save it somehwere you'll remember....
Install the Google Cloud SDK
Let the environment variable point to your GCP key, authenticate it and refresh the session token

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Set up the infrastructure of the project using Teeraform

If you do not have Terraform installed you can install it here and then add it to your PATH
Once donwloaded run the following commands:

cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"

Run python code in Prefect folder

After installing the required python packages, prefect should be installed
You can setup the prefect server so that you can access the UI using the command below:

prefect orion start

access the UI at: http://127.0.0.1:4200/
You will then want to change out the blocks so that they are registered to your credentials for GCS and Big Query. This can be done in the Blocks options
You can keep the blocks under the same names as in the code or change them. If you do change them make sure to change the code to reference the new block name
Go back to the terminal and run:

cd flows/
python etl_gcs_player.py

The data will then be stored both in your GCS bucket and in Big Query
If you want to run the process in Docker you can run the commands below:

cd Prefect/
docker image build -t <docker-username>/fantasy:fpl .
docker image push <docker-username>/fantasy:fpl

the docker_deploy.py will load the flows into deployment area of prefect so that they can then be run directly from your container.

cd flows/
python docker_deploy.py

will start the agent to listen for job flows to run

prefect agent start

run the containerized flow from CLI:

prefect deployment run etl-parent-flow/docker_player_flow --param yr=[16,17,18,19,20,21,22] --param yrs=[17,18,19,20,21,22,23]"

Running the dbt flow

Create a dbt account and log in using dbt cloud here
Once logged in clone the repo for use
in the cli at the bottom run the following command:

dbt run

this will run all the models and create our final dataset "final_players"
final_players will then be placed within the schema chosen when setting up the project in dbt.

How the lineage should look once run:
Visualisation choices

You can now take the final_players dataset and use it within Looker or another data visualisation tool like SAS VA which i used.

nlarki / fantasy-league-pipeline Goto Github PK

fantasy-league-pipeline's Introduction

Fantasy Premier League data ingestion and analysis ⚽

Overview

What is Fantasy Premier League

Problem description

Technologies

Architecture visualised:

Dashboard examples

Home page for visualisation:

Overview analysis of all seasons:

Individual team analysis:

How to run the project

fantasy-league-pipeline's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent