Git Product home page Git Product logo

mlopscourse's Introduction

The Full Stack 7-Steps MLOps Framework

πŸ”₯ LIVE DEMO πŸ”₯ | WEB APP - FORECASTING | WEB APP - MONITORING


This repository is a 7-lesson course that will walk you step-by-step through how to design, implement, and deploy an ML system using MLOps good practices. During the course, you will build a production-ready model forecasting energy consumption for the next 24 hours across multiple consumer types from Denmark.

This course targets mid/advanced machine learning engineers who want to level up their skills by building their own end-to-end projects.

Following the documentation and the Medium articles, you can reproduce and understand every piece of the code!

At the end of the course, you will know how to build everything from the diagram below.

Don't worry if something doesn't make sense to you. I will explain everything in detail in the Medium series.

You can safely use this code as you like, as long as you respect the terms and agreement of the MIT License.

Table of Contents

  1. What You Will Learn
  2. Lessons & Tutorials
  3. Data
  4. Code Structure
  5. Set Up Additional Tools
  6. Usage
  7. Installation & Usage for Development
  8. Licensing & Contributing

πŸ€” What You Will Learn

At the end of this 7 lessons course, you will know how to:

  • design a batch-serving architecture
  • use Hopsworks as a feature store
  • design a feature engineering pipeline that reads data from an API
  • build a training pipeline with hyper-parameter tunning
  • use W&B as an ML Platform to track your experiments, models, and metadata
  • implement a batch prediction pipeline
  • use Poetry to build your own Python packages
  • deploy your own private PyPi server
  • orchestrate everything with Airflow
  • use the predictions to code a web app using FastAPI and Streamlit
  • use Docker to containerize your code
  • use Great Expectations to ensure data validation and integrity
  • monitor the performance of the predictions over time
  • deploy everything to GCP
  • build a CI/CD pipeline using GitHub Actions

If that sounds like a lot, don't worry. After you cover this course, you will understand everything I said before. Most importantly, you will know WHY I used all these tools and how they work together as a system.

🀌 Lessons & Tutorials

The course consists of 7 lessons hosted on Medium Towards Data Science publication. To get the best out of this course, you should also run the code while you read the articles.

πŸ‘‡ Access the step-by-step lessons on Medium πŸ‘‡

  1. Batch Serving. Feature Stores. Feature Engineering Pipelines.
  2. Training Pipelines. ML Platforms. Hyperparameter Tuning.
  3. Batch Prediction Pipeline. Package Python Modules with Poetry.
  4. Private PyPi Server. Orchestrate Everything with Airflow.
  5. Build Your Own App with FastAPI and Streamlit.
  6. Data Validation and Integrity using GE. Monitor Model Performance.
  7. Deploy Everything on GCP. Build a CI/CD Pipeline using GitHub Actions.

πŸ“Š Data

We used an open API that provides hourly energy consumption values for all the energy consumer types within Denmark.

They provide an intuitive interface where you can easily query and visualize the data. You can access the data here.

The data has 4 main attributes:

  • Hour UTC: the UTC datetime when the data point was observed.Β 
  • Price Area: Denmark is divided into two price areas: DK1 and DK2β€Š-β€Šdivided by the Great Belt. DK1 is west of the Great Belt, and DK2 is east of the Great Belt.
  • Consumer Type: The consumer type is the Industry Code DE35, owned and maintained by Danish Energy.
  • Total Consumption: Total electricity consumption in kWh

Note: The observations have a lag of 15 days! But for our demo use case, that is not a problem, as we can simulate the same steps as it would be in real-time.

The data points have an hourly resolution. For example: "2023–04–15 21:00Z", "2023–04–15 20:00Z", "2023–04–15 19:00Z", etc.

We will model the data as multiple time series. Each unique price area and consumer type tuple represents its unique time series.Β 

Thus, we will build a model that independently forecasts the energy consumption for the next 24 hours for every time series.

Check out our live demo to better understand how the data looks.

🧬 Code Structure

The code is split into two main components: the pipeline and the web app.

The pipeline consists of 3 modules:

  • feature-pipeline
  • training-pipeline
  • batch-prediction-pipeline

The web app consists of other 3 modules:

  • app-api
  • app-frontend
  • app-monitoring

Also, we have the following folders:

  • airflow : Airflow files | Orchestration
  • .github : GitHub Actions files | CI/CD
  • deploy : Build & Deploy


To follow the structure in its natural flow, read the folders in the following order:

  1. feature-pipeline
  2. training-pipeline
  3. batch-prediction-pipeline
  4. airflow
  5. app-api
  6. app-frontend & app-monitoring
  7. .github

Read the Medium articles listed in the Lessons & Tutorials section for the whole experience.

πŸ”§ Set Up Additional Tools

The code is tested only on Ubuntu 20.04 and 22.04 using Python 3.9.

If you have problems during the setup, please leave us an issue, and we will respond to you and update the README for future readers.

Also, if you have any questions, you can contact me directly on LinkedIn.

Poetry

Install Python system dependencies:

sudo apt-get install -y python3-distutils

Download and install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

Open the .bashrc file to add the Poetry PATH:

nano ~/.bashrc

Add export PATH=~/.local/bin:$PATH

to ~/.bashrc

Check if Poetry is installed:

source ~/.bashrc
poetry --version

Official Poetry installation instructions.

Docker


Install Docker on Ubuntu.
Install Docker on Mac.
Install Docker on Windows.

Configure Credentials for the Private PyPi Server

We will run the private PyPi server using Docker down the line. But it will already expect the credentials configured.

Create credentials using passlib:

# Install dependencies.
sudo apt install -y apache2-utils
pip install passlib

# Create the credentials under the energy-forecasting name.
mkdir ~/.htpasswd
htpasswd -sc ~/.htpasswd/htpasswd.txt energy-forecasting

Set poetry to use the credentials:

poetry config repositories.my-pypi http://localhost
poetry config http-basic.my-pypi energy-forecasting <password>

Check that the credentials are set correctly in your poetry auth.toml file:

cat ~/.config/pypoetry/auth.toml

Hopsworks

You will use Hopsworks as your serverless feature store. Thus, you have to create an account and a project on Hopsworks. We will show you how to configure the code to use your Hopsworks project later.

I explained in this lesson how to create an API Key on Hopsworks. But long story short, you can go to your Hopsworks account settings and create the API Key from there.

If you want everything to work with the default settings, use the following naming conventions:

  • create a project called energy_consumption

Click here to start with Hopsworks.

Weights & Biases

You will use Weights & Biases as your serverless ML platform. Thus, you must create an account and a project on Weights & Biases. We will show you how to configure the code to use your W&B project later.

[I explained in this lesson how to create an API Key on W&B.](placeholder Medium article) But long story short, you can go to your W&B user settings and create the API Key from there.

If you want everything to work with the default settings, use the following naming conventions:

  • create an entity called teaching-mlops
  • create a project called energy_consumption

Click here to start with Weights & Biases.

GCP

First, you must install the gcloud GCP CLI on your machine.

Follow this tutorial to install it.

If you only want to run the code locally, go straight to the "Storage" section.

As before, you have to create an account and a project on GCP. Using solely the bucket as storage will be free of charge.

When I am writing this documentation, GCS is free until 5GB.

If you want everything to work with the default settings, use the following naming conventions:

  • create a project called energy_consumption

Storage

At this step, you have to do 5 things:

  • create a project
  • create a bucket
  • create a service account that has admin permissions to the newly created bucket
  • create a service account that has read-only permissions to the newly created bucket
  • download a JSON key for the newly created service accounts.

Docs for creating a bucket on GCP.
Docs for creating a service account on GCP.
Docs for creating a JSON key for a GCP service account.

Your bucket admin service account should have assigned the following role: Storage Object Admin
Your bucket read-only service account should have assigned the following role: Storage Object Viewer

Reminder: When I write this course, GCP storage is free until 5GB.

If you want everything to work with the default settings, use the following naming conventions:

  • create a bucket called hourly-batch-predictions
  • rename your downloaded admin JSON service key to admin-buckets.json
  • rename your downloaded read-only JSON service key to read-buckets.json

Check out our [Medium article](placeholder Medium article) for more step-by-step instructions.

Deployment

This step must only be finished if you want to deploy the code on GCP VMs and build the CI/CD with GitHub Actions.

Note that this step might result in a few costs on GCP. It won't be much. While developing this course, I spent only ~20$, which will probably be less for you.

Also, you can get some free credits if you have a new GCP account (I had 300$). Just be sure to delete the resources after you finish the course.

See this document for detailed instructions.

πŸ”Ž Usage

The code is tested only on Ubuntu 20.04 and 22.04 using Python 3.9.

If you have problems during the usage instructions, please leave us an issue, and we will respond to you and update the README for future readers.

Also, if you have any questions, you can contact me directly on LinkedIn.

The Pipeline

Run

You will run the pipeline using Airflow. Don't be scared. Docker makes everything very simple to set up.

NOTE: We also hooked the private PyPi server in the same docker-compose.yaml file with Airflow. Thus, everything will start with one command.

# Move to the airflow directory.
cd airflow

# Make expected directories and environment variables
mkdir -p ./logs ./plugins
sudo chmod 777 ./logs ./plugins

# It will be used by Airflow to identify your user.
echo -e "AIRFLOW_UID=$(id -u)" > .env
# This shows where our project root directory is located.
echo "ML_PIPELINE_ROOT_DIR=/opt/airflow/dags" >> .env

Now move to the DAGS directory:

cd ./dags

# Make a copy of the env default file.
cp .env.default .env
# Open the .env file and complete the WANDB_API_KEY and FS_API_KEY credentials 

# Create the folder where the program expects its GCP credentials.
mkdir -p credentials/gcp/energy_consumption
# Copy the GCP service credetials that gives you admin access to GCS. 
cp -r /path/to/admin/gcs/credentials/admin-buckets.json credentials/gcp/energy_consumption
# NOTE that if you want everything to work outside the box your JSON file should be called admin-buckets.json.
# Otherwise, you have to manually configure the GOOGLE_CLOUD_SERVICE_ACCOUNT_JSON_PATH variable from the .env file. 

# Initialize the Airflow database
docker compose up airflow-init

# Start up all services
# Note: You should set up the private PyPi server credentials before running this command.
docker compose --env-file .env up --build -d

Read the official Airflow installation using Docker, but NOTE that we modified their official docker-compose.yaml file.

Wait a while for the containers to build and run. After access 127.0.0.1:8080 to login into Airflow.
Use the following default credentials to log in:

  • username: airflow
  • password: airflow

Before starting the pipeline DAG, you must deploy the modules to the private PyPi server. Go back to the root folder of the energy-forecasting repository and run the following to build and deploy the pipeline modules to your private PyPi server:

# Set the experimental installer of Poetry to False. For us, it crashed when it was on True.
poetry config experimental.new-installer false
# Build & deploy the pipelines modules.
sh deploy/ml-pipeline.sh

Airflow will know how to install the packages from the private PyPi server.

One final step is to configure the parameters used to run the pipeline. Go to the Admin tab, then hit Variables. There you can click on the blue + button to add a new variable. These are the three parameters you can configure with our suggested values:

  • ml_pipeline_days_export = 30
  • ml_pipeline_feature_group_version = 5
  • ml_pipeline_should_run_hyperparameter_tuning = False

Now, go to the DAGS/All section and search for the ml_pipeline DAG. Toggle the activation button. It should automatically start in a few seconds. Also, you can manually run it by hitting the play button from the top-right side of the ml_pipeline window.

That is it. You can run the entire pipeline with a single button if all the credentials are set up correctly. How cool is that?

Here is what the DAG should look like πŸ‘‡

Clean Up

docker compose down --volumes --rmi all

Backfil Using Airflow

Find your airflow-webserver docker container ID:

docker ps

Start a shell inside the airflow-webserver container and run airflow dags backfill as follows (in this example, we did a backfill between 2023/04/11 00:00:00 and 2023/04/13 23:59:59):

docker exec -it <container-id-of-airflow-airflow-webserver> sh
airflow dags backfill --start-date "2023/04/11 00:00:00" --end-date "2023/04/13 23:59:59" ml_pipeline

If you want to clear the tasks and run them again, run these commands:

docker exec -it <container-id-of-airflow-airflow-webserver> sh
airflow tasks clear --start-date "2023/04/11 00:00:00" --end-date "2023/04/13 23:59:59" ml_pipeline

Run Private PyPi Server Separately

The private PyPi server is already hooked to the airflow docker compose file. But if you want to run it separately for whatever reason, you can run this command instead:

docker run -p 80:8080 -v ~/.htpasswd:/data/.htpasswd pypiserver/pypiserver:latest run -P .htpasswd/htpasswd.txt --overwrite

The Web App

Here, everything is a lot simpler. This time, we need to set up only a few credentials.

Copy the bucket read-only GCP credentials to the root directory of your energy-forecasting project:

# Create the folder where the program expects its GCP credentials.
mkdir -p credentials/gcp/energy_consumption
# Copy the GCP service credetials that gives you read-only access to GCS. 
cp -r /path/to/admin/gcs/credentials/read-buckets.json credentials/gcp/energy_consumption
# NOTE that if you want everything to work outside the box your JSON file should be called read-buckets.json.
# Otherwise, you have to manually configure the APP_API_GCP_SERVICE_ACCOUNT_JSON_PATH variable from the .env file of the API.

Go to the API folder and make a copy of the .env.default file:

cd ./app-api
cp .env.default .env

NOTE: You shouldn't change anything else if you respect all the naming conventions suggested in this README.

That is it!

Go back to the root directory of your energy-forecasting project and run the following docker command, which will build and run all the docker containers of the web app:

docker compose -f deploy/app-docker-compose.yml --project-directory . up --build

If you want to run it in development mode, run the following command:

docker compose -f deploy/app-docker-compose.yml -f deploy/app-docker-compose.local.yml --project-directory . up --build

Now you can see the apps running at:

πŸ§‘β€πŸ’» Installation & Usage for Development

All the modules support Poetry. Thus the installation is straightforward.

NOTE: Just ensure you have installed Python 3.9, not Python 3.8 or Python 3.10.

The Pipeline

We support Docker to run the whole pipeline. Check out the Usage section if you only want to run it as a whole.

If Poetry is not using Python 3.9, you can follow the next steps:

  1. Install Python 3.9 on your machine.
  2. cd /path/to/project, for example, cd ./feature-pipeline
  3. run which python3.9 to find where Python3.9 is located
  4. run poetry env use /path/to/python3.9
Set Up the ML_PIPELINE_ROOT_DIR Variable

!!! Before installing every module individually, one key step is to set the ML_PIPELINE_ROOT_DIR variable to your root directory of the energy-forecasting project:

gedit ~/.bashrc
export ML_PIPELINE_ROOT_DIR=/path/to/root/directory/energy-forecasting/repository

Another option is to run every Python script with the ML_PIPELINE_ROOT_DIR variables. For example:

ML_PIPELINE_ROOT_DIR=/path/to/root/directory/energy-forecasting/repository python -m feature_pipeline.pipeline

Deploy the Code to GCP

Check out this section.

Set UP CI/CD with GitHub Actions

Check out this section.


See here how to install every project individually:

The Web App

We support Docker to run the web app. Check out the Usage section if you only want to run it as a whole.

See here how to install every project individually:

You can also run the whole web app in development mode using Docker:

docker compose -f deploy/app-docker-compose.yml -f deploy/app-docker-compose.local.yml --project-directory . up --build

πŸ† Licensing & Contributing

The code is under the MIT License. Thus, as long as you keep distributing the License, feel free to share, clone, or change the code as you like.

Also, if you find any bugs or missing pieces in the documentation, I encourage you to add an issue on GitHub. I will respond to you and adapt the code and docs for future readers.

Furthermore, you can contact me directly on LinkedIn if you have any questions.

mlopscourse's People

Contributors

iusztinpaul avatar kurtispykes avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.