Git Product home page Git Product logo

taxifare-v5's Introduction

πŸͺ Enter the Dimension of Cloud Computing! πŸš€

In the previous unit, you have packaged πŸ“¦ the notebook of the WagonCab Data Science team, and updated the code with chunk-processing so that the model could be trained on the full TaxiFare dataset despite running "small" local machine.

☁️ In this unit, you will learn how to dispatch work to a pool of cloud resources instead of using your local machine.

πŸ’ͺ As you can (in theory) now access machine with the RAM-size of your choice, we'll consider that you don't need any "chunk-by-chunk" logic anymore!

🎯 Today, you will refactor previous unit codebase so as to:

  • Fetch all your environment variable from a single .env file instead of updating params.py
  • Load raw data from Le Wagon Big Query all at once on memory (no chunk)
  • Cache a local CSV copy to avoid query it twice
  • Process data
  • Upload processed data on your own Big Query table
  • Download processed data (all at once)
  • Cache a local CSV copy to avoid query it twice
  • Train your model on this processed data
  • Store model weights on your own Google Cloud Storage (GCS bucket)

Then, you'll provision a Virtual Machine (VM) so as to run all this workflow on the VM !

Congratulation, you just grow from a Data Scientist into an full ML Engineer! You can now sell your big GPU-laptop and buy a lightweight computer like real ML practitioners 😝



1️⃣ New taxifare package setup

❓Instructions (expand me)

Project Structure

πŸ‘‰ From now on, you will start each new challenge with the solution of the previous challenge

πŸ‘‰ Each new challenge will bring in an additional set of features

Here are the main files of interest:

.
β”œβ”€β”€ .env                            # βš™οΈ Single source of all config variables
β”œβ”€β”€ .envrc                          # 🎬 .env automatic loader (used by direnv)
β”œβ”€β”€ Makefile                        # New commands "run_train", "run_process", etc..
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
β”œβ”€β”€ taxifare
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ interface
β”‚   β”‚   └── main_local.py           # πŸšͺ (OLD) entry point
β”‚   β”‚   └── main.py                 # πŸšͺ (NEW) entry point: No more chunks πŸ˜‡ - Just process(), train()
β”‚   β”œβ”€β”€ ml_logic
β”‚       β”œβ”€β”€ data.py                 # (UPDATED) Loading and storing data from/to Big Query !
β”‚       β”œβ”€β”€ registry.py             # (UPDATED) Loading and storing model weights from/to Cloud Storage!
β”‚       β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ params.py                   # Simply load all .env variables into python objects
β”‚   └── utils.py
└── tests

βš™οΈ .env.sample

This file is a template designed to help you create a .env file for each challenge. The .env.sample file contains the variables required by the code and expected in the .env file. 🚨 Keep in mind that the .env file should never be tracked with Git to avoid exposing its content, so we have added it to your .gitignore.

πŸšͺ main.py

Bye bye taxifare.interface.main_local module, you served us well ❀️

Long live taxifare.interface.main, our new package entry point ⭐️ to:

  • preprocess: preprocess the data and store data_processed
  • train: train on processed data and store model weights
  • evaluate: evaluate the performance of the latest trained model on new data
  • pred: make a prediction on a DataFrame with a specific version of the trained model

🚨 One main change in the code of the package is that we chose to delegate some of its work to dedicated modules in order to limit the size of the main.py file. The main changes concern:

  • The project configuration: Single source of truth is .env

    • .envrc tells direnv to loads the .env as environment variables
    • params.py then loads all these variable in python, and should not be changed manually anymore
  • registry.py: the code evolved to store the trained model either locally or - spoiler alert - in the cloud

    • Notice the new env variable MODEL_TARGET (local or gcs)
  • data.py has refactored 2 methods that we'll use heavily in main.py

    • get_data_with_cache() (get some data from Big Query or cached CSV if exists)
    • load_data_to_bq() (upload some data to BQ)

Setup

Install taxifare version 0.0.7

πŸ’» Install the new package version

make reinstall_package # always check what make do in Makefile

πŸ§ͺ Check the package version

pip list | grep taxifare
# taxifare               0.0.7

Setup direnv & .env

Our goal is to be able to configure the behavior of our package πŸ“¦ depending on the value of the variables defined in a .env project configuration file.

πŸ’» In order to do so, we will install the direnv shell extension. Its job is to locate the nearest .env file in the parent directory structure of the project and load its content into the environment.

# MacOS
brew install direnv

# Ubuntu (Linux or Windows WSL2)
sudo apt update
sudo apt install -y direnv

Once direnv is installed, we need to tell zsh to load direnv whenever the shell starts

code ~/.zshrc

The list of plugins is located in the beginning of the file and should look like this when you add direnv:

plugins=(...direnv)

Start a new zsh window in order to load direnv

πŸ’» At this point, direnv is still not able to load anything, as there is no .env file, so let's create one:

  • Duplicate the env.sample file and rename the duplicate as .env
  • Enable the project configuration with direnv allow . (the . stands for current directory)

πŸ§ͺ Check that direnv is able to read the environment variables from the .env file:

echo $DATA_SIZE
# 1k --> Let's keep it small!

From now on, every time you need to update the behavior of the project:

  1. Edit .env, save it
  2. Then
direnv reload . # to reload your env variables 🚨🚨

☝️ You will forget that. Prove us wrong 😝

# Ok so, for this unit, alway keep data size values small (good practice for dev purposes)
DATA_SIZE=1k
CHUNK_SIZE=200

2️⃣ GCP Setup

❓Instructions (expand me)

Google Cloud Platform will allow you to allocate and use remote resources in the cloud. You can interact with it via:

  • 🌐 console.cloud.google.com
  • πŸ’» Command Line Tools
    • gcloud
    • bq (big query - SQL)
    • gsutils (cloud storage - buckets)

a) gcloud CLI

  • Find the gcloud command that lists your own GCP project ID.
  • πŸ“ Fill in the GCP_PROJECT variable in the .env project configuration with the ID of your GCP project
  • πŸ§ͺ Run the tests with make test_gcp_project
πŸ’‘ Hint

You can use the -h or the --help (more details) flags in order to get contextual help on the gcloud commands or sub-commands; use gcloud billing -h to get the gcloud billing sub-command's help, or gcloud billing --help for more detailed help.

πŸ‘‰ Pressing q is usually the way to exit help mode if the command did not terminate itself (Ctrl + C also works)

Also note that running gcloud without arguments lists all the available sub-commands by group.

b) Cloud Storage (GCS) and the gsutil CLI

The second CLI tool that you will use often allows you to deal with files stored within buckets on Cloud Storage.

We'll use it to store large & unstructured data such as model weights :)

πŸ’» Create a bucket in your GCP account using gsutil

  • Make sure to create the bucket where you are located yourself (use GCP_REGION in the .env)
  • Fill also the BUCKET_NAME variable with the name of your choice (must be globally unique and lower case!)

e.g.

BUCKET_NAME = taxifare_<user.github_nickname>
  • direnv reload . ;)

Tips: The CLI can interpolate .env variables by prefix them with a $ sign (e.g. $GCP_REGION)

🎁 Solution
gsutil ls                               # list buckets

gsutil mb \                             # make bucket
    -l $GCP_REGION \
    -p $GCP_PROJECT \
    gs://$BUCKET_NAME                     # make bucket

gsutil rm -r gs://$BUCKET_NAME               # delete bucket

You can also use the Cloud Storage console in order create a bucket or list the existing buckets and their content.

Do you see how much slower the GCP console (web interface) is compared to the command line?

πŸ§ͺ Run the tests with make test_gcp_bucket

c) Big Query and the bq CLI

Biq Query is a data-warehouse, used to store structured data, that can be queried rapidly.

πŸ’‘ To be more precise, Big Query is an online massively-parallel Analytical Database (as opposed to Transactional Database)

  • Data is stored by columns (as opposed to rows on PostGres for instance)
  • It's optimized for large transformation such as group-by, join, where etc...easily
  • But it's not optimized for frequent row-by-row insert/delete

Le WagonCab is actually using a managed postgreSQL (e.g. Google Cloud SQL) as its main production database on which it's Django app is storing / reading hundred thousands of individual transactions per day!

Every night, Le WagonCab launch a "database replication" job that applies the daily diffs of the "main" postgresSQL into the "replica" Big Query warehouse. Why?

  • Because you don't want to run queries directly against your production-database! That could slow down your users.
  • Because analysis is faster/cheaper on columnar databases
  • Because you also want to integrate other data in your warehouse to JOIN them (e.g marketing data from Google Ads...)

πŸ‘‰ Back to our business:

πŸ’» Let's create our own dataset where we'll store & query preprocessed data !

  • Using bq and the following env variables, create a new dataset called taxifare on your own GCP_PROJECT
BQ_DATASET=taxifare
BQ_REGION=...
GCP_PROJECT=...
  • Then add 3 new tables processed_1k, processed_200k, processed_all
πŸ’‘ Hints

Although the bq command is part of the Google Cloud SDK that you installed on your machine, it does not seem to follow the same help pattern as the gcloud and gsutil commands.

Try running bq without arguments to list the available sub-commands.

What you are looking for is probably in the mk (make) section.

🎁 Solution
bq mk \
    --project_id $GCP_PROJECT \
    --data_location $BQ_REGION \
    $BQ_DATASET

bq mk --location=$GCP_REGION $BQ_DATASET.processed_1k
bq mk --location=$GCP_REGION $BQ_DATASET.processed_200k
bq mk --location=$GCP_REGION $BQ_DATASET.processed_all

bq show
bq show $BQ_DATASET
bq show $BQ_DATASET.processed_1k

πŸ§ͺ Run the tests with make test_big_query

🎁 Look at make reset_all_files directive --> It resets all local files (csvs, models, ...) and data from bq tables and buckets, but preserve local folder structure, bq tables schema, and gsutil buckets.

Very useful to reset state of your challenge if you are uncertain and you want to debug yourself!

πŸ‘‰ Run make reset_all_files safely now, it will remove files from unit 01 and make it clearer

πŸ‘‰ Run make show_sources_all to see that you're back from a blank state!

βœ… When you are all set, track your results on Kitt with make test_kitt (don't wait, this takes > 1min)

3️⃣ βš™οΈ Train locally, with data on the cloud !

❓Instructions (expand me)

🎯 Your goal is to fill-up taxifare.interface.main so that you can run every 4 routes one by one

if __name__ == '__main__':
    # preprocess()
    # train()
    # evaluate()
    # pred()

To do so, you can either:

  • πŸ₯΅ Uncomment the routes above, one after the other, and run python -m taxifare.interface.main from your Terminal

  • πŸ˜‡ Smarter: use each of the following make commands that we created for you below

πŸ’‘ Make sure to read each function docstring carefully πŸ’‘ Don't try to parallelize route completion. Fix them one after the other. πŸ’‘ Take time to read carefully the tracebacks, and add breakpoint() to your code or to the test itself (you are 'engineers' now)!

Preprocess

πŸ’‘ Feel free to refer back to main_local.py when needed! Some of the syntax can be re-used

# Call your preprocess()
make run_preprocess
# Then test this route, but with all combinations of states (.env, cached_csv or not)
make test_preprocess

Train

πŸ’‘ Be sure to understand what happens when MODEL_TARGET = 'gcs' vs 'local' πŸ’‘ We advise you to set verbose=0 on model training to shorter your logs!

make run_train
make test_train

Evaluate

Be sure to understand what happens when MODEL_TARGET = 'gcs' vs 'local'

make run_evaluate
make test_evaluate

Pred

This one is easy

make run_pred
make test_pred

βœ… When you are all set, track your results on Kitt with make test_kitt

🏁 Congrats for the heavy refactoring! You now have a very robust package that can be deployed in the cloud to be used with DATA_SIZE='all' πŸ’ͺ

4️⃣ Train in the Cloud with Virtual Machines

❓Instructions (expand me)

Enable the Compute Engine Service

In GCP, many services are not enabled by default. The service to activate in order to use virtual machines is Compute Engine.

❓How do you enable a GCP service?

Find the gcloud command to enable a service.

πŸ’‘ Hints

Enabling an API

Create your First Virtual Machine

The taxifare package is ready to train on a machine in the cloud. Let's create our first Virtual Machine instance!

❓Create a Virtual Machine

Head over to the GCP console, specifically the Compute Engine page. The console will allow you to easily explore the available options. Make sure to create an Ubuntu instance (read the how-to below and have a look at the hint after it).

πŸ—Ί How to configure your VM instance

Let's explore the options available. The top right of the interface gives you a monthly estimate of the cost for the selected parameters if the VM remains online all the time.

The default options should be enough for what we want to do now, except for one: we want to choose the operating system that the VM instance will be running.

Go to the "Boot disk" section, click on "CHANGE" at the bottom, change the operating system to Ubuntu, and select the latest Ubuntu xx.xx LTS x86/64 (Long Term Support) version.

Ubuntu is the Linux distro that will resemble the configuration on your machine the most, following the Le Wagon setup. Whether you are on a Mac, using Windows WSL2 or on native Linux, selecting this option will allow you to play with a remote machine using the commands you are already familiar with.

πŸ’‘ Hint

In the future, when you know exactly what type of VM you want to create, you will be able to use the gcloud compute instances command if you want to do everything from the command line; for example:

INSTANCE=taxi-instance
IMAGE_PROJECT=ubuntu-os-cloud
IMAGE_FAMILY=ubuntu-2204-lts

gcloud compute instances create $INSTANCE --image-project=$IMAGE_PROJECT --image-family=$IMAGE_FAMILY

πŸ’» Fill in the INSTANCE variable in the .env project configuration

Setup your VM

You have access to virtually unlimited computing power at your fingertips, ready to help with trainings or any other task you might think of.

❓How do you connect to the VM?

The GCP console allows you to connect to the VM instance through a web interface:

gce vm sshgce console ssh

You can disconnect by typing exit or closing the window.

A nice alternative is to connect to the virtual machine right from your command line 🀩

gce ssh

All you need to do is to gcloud compute ssh on a running instance and to run exit when you want to disconnect πŸŽ‰

INSTANCE=taxi-instance

gcloud compute ssh $INSTANCE
πŸ’‘ Error 22

If you encounter a port 22: Connection refused error, just wait a little more for the VM instance to complete its startup.

Just run pwd or hostname if you ever wonder on which machine you are running your commands.

❓How do you setup the VM to run your python code?

Let's run a light version of the Le Wagon setup.

πŸ’» Connect to your VM instance and run the commands of the following sections

βš™οΈ zsh and omz (expand me)

The zsh shell and its Oh My Zsh framework are the CLI configuration you are already familiar with. When prompted, make sure to accept making zsh the default shell.

sudo apt update
sudo apt install -y zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

πŸ‘‰ Now the CLI of the remote machine starts to look a little more like the CLI of your local machine

βš™οΈ pyenv and pyenv-virtualenv (expand me)

Clone the pyenv and pyenv-virtualenv repos on the VM:

git clone https://github.com/pyenv/pyenv.git ~/.pyenv
git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv

Open ~/.zshrc in a Terminal code editor:

nano ~/.zshrc

Add pyenv, ssh-agent and direnv to the list of zsh plugins on the line with plugins=(git) in ~/.zshrc: in the end, you should have plugins=(git pyenv ssh-agent direnv). Then, exit and save (Ctrl + X, Y, Enter).

Make sure that the modifications were indeed saved:

cat ~/.zshrc | grep "plugins="

Add the pyenv initialization script to your ~/.zprofile:

cat << EOF >> ~/.zprofile
export PYENV_ROOT="\$HOME/.pyenv"
export PATH="\$PYENV_ROOT/bin:\$PATH"
eval "\$(pyenv init --path)"
EOF

πŸ‘‰ Now we are ready to install Python

βš™οΈ Python (expand me)

Add dependencies required to build Python:

sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
python3-dev

ℹ️ If a window pops up to ask you which services to restart, just press Enter:

gce apt services restart

Now we need to start a new user session so that the updates in ~/.zshrc and ~/.zprofile are taken into account. Run the command below πŸ‘‡:

zsh --login

Install the same python version that you use for the bootcamp, and create a lewagon virtual env. This can take a while and look like it is stuck, but it is not:

# e.g. with 3.10.6
pyenv install 3.10.6
pyenv global 3.10.6
pyenv virtualenv 3.10.6 taxifare-env
pyenv global taxifare-env
βš™οΈ git authentication with GitHub (expand me)

Copy your private key πŸ”‘ to the VM in order to allow it to access your GitHub account.

⚠️ Run this single command on your machine, not in the VM ⚠️

INSTANCE=taxi-instance

# scp stands for secure copy (cp)
gcloud compute scp ~/.ssh/id_ed25519 $USER@$INSTANCE:~/.ssh/

⚠️ Then, resume running commands in the VM ⚠️

Register the key you just copied after starting ssh-agent:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Enter your passphrase if asked to.

πŸ‘‰ You are now able to interact with your GitHub account from the virtual machine

βš™οΈ Python code authentication to GCP (expand me)

The code of your package needs to be able to access your Big Query data warehouse.

To do so, we will login to your account using the command below πŸ‘‡

gcloud auth application-default login

❗️ Note: In a full production environment we would create a service account applying the least privilege principle for the vm but this is the easiest approach for development.

Let's verify that your Python code can now access your GCP resources. First, install some packages:

pip install -U pip
pip install google-cloud-storage

Then, run Python code from the CLI. This should list your GCP buckets:

python -c "from google.cloud import storage; \
    buckets = storage.Client().list_buckets(); \
    [print(b.name) for b in buckets]"

Your VM is now fully operational with:

  • A python venv (lewgon) to run your code
  • The credentials to connect to your GitHub account
  • The credentials to connect to your GCP account

The only thing that is missing is the code of your project!

πŸ§ͺ Let's run a few tests inside your VM Terminal before we install it:

  • Default shell is /usr/bin/zsh
    echo $SHELL
  • Python version is 3.10.6
    python --version
  • Active GCP project is the same as $GCP_PROJECT in your .env file
    gcloud config list project

Your VM is now a data science beast πŸ”₯

Train in the Cloud

Let's run your first training in the cloud!

❓How do you setup and run your project on the virtual machine?

πŸ’» Clone your package, install its requirements

πŸ’‘ Hint

You can copy your code to the VM by cloning your GitHub project with this syntax:

git clone [email protected]:<user.github_nickname>/{{local_path_to("07-ML-Ops/02-Cloud-training/01-Cloud-training")}}

Enter the directory of today's taxifare package (adapt the command):

cd <path/to/the/package/model/dir>

Create directories to save the model and its parameters/metrics:

make reset_local_files

Create a .env file with all required parameters to use your package:

cp .env.sample .env

Fill in the content of the .env file (complete the missing values, change any values that are specific to your virtual machine):

nano .env

Install direnv to load your .env:

sudo apt update
sudo apt install -y direnv

ℹ️ If a window pops up to ask you which services to restart, just press Enter.

Reconnect (simulate a user reboot) so that direnv works:

zsh --login

Allow your .envrc:

direnv allow .

Install the taxifare package (and all its dependencies)!

pip install .

πŸ”₯ Run the preprocessing and the training in the cloud πŸ”₯!

make run_all

gce train ssh

Project not set error from GCP services? You can add a GCLOUD_PROJECT environment variable that should be the same as your GCP_PROJECT

πŸ§ͺ Track your progress on Kitt to conclude (from your VM)

make test_kitt

πŸ‹πŸ½β€β™‚οΈ Go Big: re-run everything with DATA_SIZE = 'all' CHUNK_SIZE=100k chunks for instance πŸ‹πŸ½β€β™‚οΈ!

🏁 Switch OFF your VM to finish πŸŒ’

You can easily start and stop a VM instance from the GCP console, which allows you to see which instances are running.

gce vm start

πŸ’‘ Hint

A faster way to start and stop your virtual machine is to use the command line. The commands still take some time to complete, but you do not have to navigate through the GCP console interface.

Have a look at the gcloud compute instances command in order to start, stop, or list your instances:

INSTANCE=taxi-instance

gcloud compute instances stop $INSTANCE
gcloud compute instances list
gcloud compute instances start $INSTANCE

🚨 Computing power does not grow on trees 🌳, do not forget to switch the VM off whenever you stop using it! πŸ’Έ


🏁 Remember: Switch OFF your VM with gcloud compute instances stop $INSTANCE

taxifare-v5's People

Contributors

ngolisa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.