🪐 Enter the Dimension of Cloud Computing! 🚀

In the previous unit, you have packaged 📦 the notebook of the WagonCab Data Science team, and updated the code with chunk-processing so that the model could be trained on the full TaxiFare dataset despite running "small" local machine.

☁️ In this unit, you will learn how to dispatch work to a pool of cloud resources instead of using your local machine.

💪 As you can (in theory) now access machine with the RAM-size of your choice, we'll consider that you don't need any "chunk-by-chunk" logic anymore!

🎯 Today, you will refactor previous unit codebase so as to:

Fetch all your environment variable from a single .env file instead of updating params.py
Load raw data from Le Wagon Big Query all at once on memory (no chunk)
Cache a local CSV copy to avoid query it twice
Process data
Upload processed data on your own Big Query table
Download processed data (all at once)
Cache a local CSV copy to avoid query it twice
Train your model on this processed data
Store model weights on your own Google Cloud Storage (GCS bucket)

Then, you'll provision a Virtual Machine (VM) so as to run all this workflow on the VM !

Congratulation, you just grow from a Data Scientist into an full ML Engineer! You can now sell your big GPU-laptop and buy a lightweight computer like real ML practitioners 😝

1️⃣ New taxifare package setup

❓Instructions (expand me)

Project Structure

👉 From now on, you will start each new challenge with the solution of the previous challenge

👉 Each new challenge will bring in an additional set of features

Here are the main files of interest:

.
├── .env                            # ⚙️ Single source of all config variables
├── .envrc                          # 🎬 .env automatic loader (used by direnv)
├── Makefile                        # New commands "run_train", "run_process", etc..
├── README.md
├── requirements.txt
├── setup.py
├── taxifare
│   ├── __init__.py
│   ├── interface
│   │   └── main_local.py           # 🚪 (OLD) entry point
│   │   └── main.py                 # 🚪 (NEW) entry point: No more chunks 😇 - Just process(), train()
│   ├── ml_logic
│       ├── data.py                 # (UPDATED) Loading and storing data from/to Big Query !
│       ├── registry.py             # (UPDATED) Loading and storing model weights from/to Cloud Storage!
│       ├── ...
│   ├── params.py                   # Simply load all .env variables into python objects
│   └── utils.py
└── tests

⚙️ `.env.sample`

This file is a template designed to help you create a .env file for each challenge. The .env.sample file contains the variables required by the code and expected in the .env file. 🚨 Keep in mind that the .env file should never be tracked with Git to avoid exposing its content, so we have added it to your .gitignore.

🚪 `main.py`

Bye bye taxifare.interface.main_local module, you served us well ❤️

Long live taxifare.interface.main, our new package entry point ⭐️ to:

preprocess: preprocess the data and store data_processed
train: train on processed data and store model weights
evaluate: evaluate the performance of the latest trained model on new data
pred: make a prediction on a DataFrame with a specific version of the trained model

🚨 One main change in the code of the package is that we chose to delegate some of its work to dedicated modules in order to limit the size of the main.py file. The main changes concern:

The project configuration: Single source of truth is .env
- .envrc tells direnv to loads the .env as environment variables
- params.py then loads all these variable in python, and should not be changed manually anymore
registry.py: the code evolved to store the trained model either locally or - spoiler alert - in the cloud
- Notice the new env variable MODEL_TARGET (local or gcs)
data.py has refactored 2 methods that we'll use heavily in main.py
- get_data_with_cache() (get some data from Big Query or cached CSV if exists)
- load_data_to_bq() (upload some data to BQ)

Setup

Install `taxifare` version `0.0.7`

💻 Install the new package version

make reinstall_package # always check what make do in Makefile

🧪 Check the package version

pip list | grep taxifare
# taxifare               0.0.7

Setup direnv & .env

Our goal is to be able to configure the behavior of our package 📦 depending on the value of the variables defined in a .env project configuration file.

💻 In order to do so, we will install the direnv shell extension. Its job is to locate the nearest .env file in the parent directory structure of the project and load its content into the environment.

# MacOS
brew install direnv

# Ubuntu (Linux or Windows WSL2)
sudo apt update
sudo apt install -y direnv

Once direnv is installed, we need to tell zsh to load direnv whenever the shell starts

code ~/.zshrc

The list of plugins is located in the beginning of the file and should look like this when you add direnv:

plugins=(...direnv)

Start a new zsh window in order to load direnv

💻 At this point, direnv is still not able to load anything, as there is no .env file, so let's create one:

Duplicate the env.sample file and rename the duplicate as .env
Enable the project configuration with direnv allow . (the . stands for current directory)

🧪 Check that direnv is able to read the environment variables from the .env file:

echo $DATA_SIZE
# 1k --> Let's keep it small!

From now on, every time you need to update the behavior of the project:

Edit .env, save it
Then

direnv reload . # to reload your env variables 🚨🚨

☝️ You will forget that. Prove us wrong 😝

# Ok so, for this unit, alway keep data size values small (good practice for dev purposes)
DATA_SIZE=1k
CHUNK_SIZE=200

2️⃣ GCP Setup

❓Instructions (expand me)

Google Cloud Platform will allow you to allocate and use remote resources in the cloud. You can interact with it via:

🌐 console.cloud.google.com
💻 Command Line Tools
- gcloud
- bq (big query - SQL)
- gsutils (cloud storage - buckets)

a) `gcloud` CLI

Find the gcloud command that lists your own GCP project ID.
📝 Fill in the GCP_PROJECT variable in the .env project configuration with the ID of your GCP project
🧪 Run the tests with make test_gcp_project

💡 Hint

You can use the -h or the --help (more details) flags in order to get contextual help on the gcloud commands or sub-commands; use gcloud billing -h to get the gcloud billing sub-command's help, or gcloud billing --help for more detailed help.

👉 Pressing q is usually the way to exit help mode if the command did not terminate itself (Ctrl + C also works)

Also note that running gcloud without arguments lists all the available sub-commands by group.

b) Cloud Storage (GCS) and the `gsutil` CLI

The second CLI tool that you will use often allows you to deal with files stored within buckets on Cloud Storage.

We'll use it to store large & unstructured data such as model weights :)

💻 Create a bucket in your GCP account using gsutil

Make sure to create the bucket where you are located yourself (use GCP_REGION in the .env)
Fill also the BUCKET_NAME variable with the name of your choice (must be globally unique and lower case!)

e.g.

BUCKET_NAME = taxifare_<user.github_nickname>

direnv reload . ;)

Tips: The CLI can interpolate .env variables by prefix them with a $ sign (e.g. $GCP_REGION)

🎁 Solution

gsutil ls                               # list buckets

gsutil mb \                             # make bucket
    -l $GCP_REGION \
    -p $GCP_PROJECT \
    gs://$BUCKET_NAME                     # make bucket

gsutil rm -r gs://$BUCKET_NAME               # delete bucket

You can also use the Cloud Storage console in order create a bucket or list the existing buckets and their content.

Do you see how much slower the GCP console (web interface) is compared to the command line?

🧪 Run the tests with make test_gcp_bucket

c) Big Query and the `bq` CLI

Biq Query is a data-warehouse, used to store structured data, that can be queried rapidly.

💡 To be more precise, Big Query is an online massively-parallel Analytical Database (as opposed to Transactional Database)

Data is stored by columns (as opposed to rows on PostGres for instance)
It's optimized for large transformation such as group-by, join, where etc...easily
But it's not optimized for frequent row-by-row insert/delete

Le WagonCab is actually using a managed postgreSQL (e.g. Google Cloud SQL) as its main production database on which it's Django app is storing / reading hundred thousands of individual transactions per day!

Every night, Le WagonCab launch a "database replication" job that applies the daily diffs of the "main" postgresSQL into the "replica" Big Query warehouse. Why?

Because you don't want to run queries directly against your production-database! That could slow down your users.
Because analysis is faster/cheaper on columnar databases
Because you also want to integrate other data in your warehouse to JOIN them (e.g marketing data from Google Ads...)

👉 Back to our business:

💻 Let's create our own dataset where we'll store & query preprocessed data !

Using bq and the following env variables, create a new dataset called taxifare on your own GCP_PROJECT

BQ_DATASET=taxifare
BQ_REGION=...
GCP_PROJECT=...

Then add 3 new tables processed_1k, processed_200k, processed_all

💡 Hints

Although the bq command is part of the Google Cloud SDK that you installed on your machine, it does not seem to follow the same help pattern as the gcloud and gsutil commands.

Try running bq without arguments to list the available sub-commands.

What you are looking for is probably in the mk (make) section.

🎁 Solution

bq mk \
    --project_id $GCP_PROJECT \
    --data_location $BQ_REGION \
    $BQ_DATASET

bq mk --location=$GCP_REGION $BQ_DATASET.processed_1k
bq mk --location=$GCP_REGION $BQ_DATASET.processed_200k
bq mk --location=$GCP_REGION $BQ_DATASET.processed_all

bq show
bq show $BQ_DATASET
bq show $BQ_DATASET.processed_1k

🧪 Run the tests with make test_big_query

🎁 Look at make reset_all_files directive --> It resets all local files (csvs, models, ...) and data from bq tables and buckets, but preserve local folder structure, bq tables schema, and gsutil buckets.

Very useful to reset state of your challenge if you are uncertain and you want to debug yourself!

👉 Run make reset_all_files safely now, it will remove files from unit 01 and make it clearer

👉 Run make show_sources_all to see that you're back from a blank state!

✅ When you are all set, track your results on Kitt with make test_kitt (don't wait, this takes > 1min)

3️⃣ ⚙️ Train locally, with data on the cloud !

❓Instructions (expand me)

🎯 Your goal is to fill-up taxifare.interface.main so that you can run every 4 routes one by one

if __name__ == '__main__':
    # preprocess()
    # train()
    # evaluate()
    # pred()

To do so, you can either:

🥵 Uncomment the routes above, one after the other, and run python -m taxifare.interface.main from your Terminal
😇 Smarter: use each of the following make commands that we created for you below

💡 Make sure to read each function docstring carefully 💡 Don't try to parallelize route completion. Fix them one after the other. 💡 Take time to read carefully the tracebacks, and add breakpoint() to your code or to the test itself (you are 'engineers' now)!

Preprocess

💡 Feel free to refer back to main_local.py when needed! Some of the syntax can be re-used

# Call your preprocess()
make run_preprocess
# Then test this route, but with all combinations of states (.env, cached_csv or not)
make test_preprocess

Train

💡 Be sure to understand what happens when MODEL_TARGET = 'gcs' vs 'local' 💡 We advise you to set verbose=0 on model training to shorter your logs!

make run_train
make test_train

Evaluate

Be sure to understand what happens when MODEL_TARGET = 'gcs' vs 'local'

make run_evaluate
make test_evaluate

Pred

This one is easy

make run_pred
make test_pred

✅ When you are all set, track your results on Kitt with make test_kitt

🏁 Congrats for the heavy refactoring! You now have a very robust package that can be deployed in the cloud to be used with DATA_SIZE='all' 💪

4️⃣ Train in the Cloud with Virtual Machines

❓Instructions (expand me)

Enable the Compute Engine Service

In GCP, many services are not enabled by default. The service to activate in order to use virtual machines is Compute Engine.

❓How do you enable a GCP service?

Find the gcloud command to enable a service.

💡 Hints

Enabling an API

Create your First Virtual Machine

The taxifare package is ready to train on a machine in the cloud. Let's create our first Virtual Machine instance!

❓Create a Virtual Machine

Head over to the GCP console, specifically the Compute Engine page. The console will allow you to easily explore the available options. Make sure to create an Ubuntu instance (read the how-to below and have a look at the hint after it).

🗺 How to configure your VM instance

Let's explore the options available. The top right of the interface gives you a monthly estimate of the cost for the selected parameters if the VM remains online all the time.

The default options should be enough for what we want to do now, except for one: we want to choose the operating system that the VM instance will be running.

Go to the "Boot disk" section, click on "CHANGE" at the bottom, change the operating system to Ubuntu, and select the latest Ubuntu xx.xx LTS x86/64 (Long Term Support) version.

Ubuntu is the Linux distro that will resemble the configuration on your machine the most, following the Le Wagon setup. Whether you are on a Mac, using Windows WSL2 or on native Linux, selecting this option will allow you to play with a remote machine using the commands you are already familiar with.

💡 Hint

In the future, when you know exactly what type of VM you want to create, you will be able to use the gcloud compute instances command if you want to do everything from the command line; for example:

INSTANCE=taxi-instance
IMAGE_PROJECT=ubuntu-os-cloud
IMAGE_FAMILY=ubuntu-2204-lts

gcloud compute instances create $INSTANCE --image-project=$IMAGE_PROJECT --image-family=$IMAGE_FAMILY

💻 Fill in the INSTANCE variable in the .env project configuration

Setup your VM

You have access to virtually unlimited computing power at your fingertips, ready to help with trainings or any other task you might think of.

❓How do you connect to the VM?

The GCP console allows you to connect to the VM instance through a web interface:

You can disconnect by typing exit or closing the window.

A nice alternative is to connect to the virtual machine right from your command line 🤩

All you need to do is to gcloud compute ssh on a running instance and to run exit when you want to disconnect 🎉

INSTANCE=taxi-instance

gcloud compute ssh $INSTANCE

💡 Error 22

If you encounter a port 22: Connection refused error, just wait a little more for the VM instance to complete its startup.

Just run pwd or hostname if you ever wonder on which machine you are running your commands.

❓How do you setup the VM to run your python code?

Let's run a light version of the Le Wagon setup.

💻 Connect to your VM instance and run the commands of the following sections

⚙️ zsh and omz (expand me)

The zsh shell and its Oh My Zsh framework are the CLI configuration you are already familiar with. When prompted, make sure to accept making zsh the default shell.

sudo apt update
sudo apt install -y zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

👉 Now the CLI of the remote machine starts to look a little more like the CLI of your local machine

⚙️ pyenv and pyenv-virtualenv (expand me)

Clone the pyenv and pyenv-virtualenv repos on the VM:

git clone https://github.com/pyenv/pyenv.git ~/.pyenv
git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv

Open ~/.zshrc in a Terminal code editor:

nano ~/.zshrc

Add pyenv, ssh-agent and direnv to the list of zsh plugins on the line with plugins=(git) in ~/.zshrc: in the end, you should have plugins=(git pyenv ssh-agent direnv). Then, exit and save (Ctrl + X, Y, Enter).

Make sure that the modifications were indeed saved:

cat ~/.zshrc | grep "plugins="

Add the pyenv initialization script to your ~/.zprofile:

cat << EOF >> ~/.zprofile
export PYENV_ROOT="\$HOME/.pyenv"
export PATH="\$PYENV_ROOT/bin:\$PATH"
eval "\$(pyenv init --path)"
EOF

👉 Now we are ready to install Python

⚙️ Python (expand me)

Add dependencies required to build Python:

sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
python3-dev

ℹ️ If a window pops up to ask you which services to restart, just press Enter:

Now we need to start a new user session so that the updates in ~/.zshrc and ~/.zprofile are taken into account. Run the command below 👇:

zsh --login

Install the same python version that you use for the bootcamp, and create a lewagon virtual env. This can take a while and look like it is stuck, but it is not:

# e.g. with 3.10.6
pyenv install 3.10.6
pyenv global 3.10.6
pyenv virtualenv 3.10.6 taxifare-env
pyenv global taxifare-env

⚙️ git authentication with GitHub (expand me)

Copy your private key 🔑 to the VM in order to allow it to access your GitHub account.

⚠️ Run this single command on your machine, not in the VM ⚠️

INSTANCE=taxi-instance

# scp stands for secure copy (cp)
gcloud compute scp ~/.ssh/id_ed25519 $USER@$INSTANCE:~/.ssh/

⚠️ Then, resume running commands in the VM ⚠️

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Enter your passphrase if asked to.

👉 You are now able to interact with your GitHub account from the virtual machine

⚙️ Python code authentication to GCP (expand me)

The code of your package needs to be able to access your Big Query data warehouse.

To do so, we will login to your account using the command below 👇

gcloud auth application-default login

❗️ Note: In a full production environment we would create a service account applying the least privilege principle for the vm but this is the easiest approach for development.

Let's verify that your Python code can now access your GCP resources. First, install some packages:

pip install -U pip
pip install google-cloud-storage

Then, run Python code from the CLI. This should list your GCP buckets:

python -c "from google.cloud import storage; \
    buckets = storage.Client().list_buckets(); \
    [print(b.name) for b in buckets]"

Your VM is now fully operational with:

A python venv (lewgon) to run your code
The credentials to connect to your GitHub account
The credentials to connect to your GCP account

The only thing that is missing is the code of your project!

🧪 Let's run a few tests inside your VM Terminal before we install it:

Default shell is /usr/bin/zsh
```
echo $SHELL
```
Python version is 3.10.6
```
python --version
```
Active GCP project is the same as $GCP_PROJECT in your .env file
```
gcloud config list project
```

Your VM is now a data science beast 🔥

Train in the Cloud

Let's run your first training in the cloud!

❓How do you setup and run your project on the virtual machine?

💻 Clone your package, install its requirements

💡 Hint

You can copy your code to the VM by cloning your GitHub project with this syntax:

git clone [email protected]:<user.github_nickname>/{{local_path_to("07-ML-Ops/02-Cloud-training/01-Cloud-training")}}

Enter the directory of today's taxifare package (adapt the command):

cd <path/to/the/package/model/dir>

Create directories to save the model and its parameters/metrics:

make reset_local_files

Create a .env file with all required parameters to use your package:

cp .env.sample .env

Fill in the content of the .env file (complete the missing values, change any values that are specific to your virtual machine):

nano .env

Install direnv to load your .env:

sudo apt update
sudo apt install -y direnv

ℹ️ If a window pops up to ask you which services to restart, just press Enter.

Reconnect (simulate a user reboot) so that direnv works:

zsh --login

Allow your .envrc:

direnv allow .

Install the taxifare package (and all its dependencies)!

pip install .

🔥 Run the preprocessing and the training in the cloud 🔥!

make run_all

Project not set error from GCP services? You can add a GCLOUD_PROJECT environment variable that should be the same as your GCP_PROJECT

🧪 Track your progress on Kitt to conclude (from your VM)

make test_kitt

🏋🏽‍♂️ Go Big: re-run everything with DATA_SIZE = 'all' CHUNK_SIZE=100k chunks for instance 🏋🏽‍♂️!

🏁 Switch OFF your VM to finish 🌒

You can easily start and stop a VM instance from the GCP console, which allows you to see which instances are running.

💡 Hint

A faster way to start and stop your virtual machine is to use the command line. The commands still take some time to complete, but you do not have to navigate through the GCP console interface.

Have a look at the gcloud compute instances command in order to start, stop, or list your instances:

INSTANCE=taxi-instance

gcloud compute instances stop $INSTANCE
gcloud compute instances list
gcloud compute instances start $INSTANCE

🚨 Computing power does not grow on trees 🌳, do not forget to switch the VM off whenever you stop using it! 💸

🏁 Remember: Switch OFF your VM with gcloud compute instances stop $INSTANCE

ngolisa / taxifare-v5 Goto Github PK

taxifare-v5's Introduction

🪐 Enter the Dimension of Cloud Computing! 🚀

1️⃣ New taxifare package setup

Project Structure

⚙️ .env.sample

🚪 main.py

Setup

Install taxifare version 0.0.7

Setup direnv & .env

2️⃣ GCP Setup

a) gcloud CLI

b) Cloud Storage (GCS) and the gsutil CLI

c) Big Query and the bq CLI

3️⃣ ⚙️ Train locally, with data on the cloud !

4️⃣ Train in the Cloud with Virtual Machines

Enable the Compute Engine Service

Create your First Virtual Machine

Setup your VM

Train in the Cloud

taxifare-v5's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

⚙️ `.env.sample`

🚪 `main.py`

Install `taxifare` version `0.0.7`

a) `gcloud` CLI

b) Cloud Storage (GCS) and the `gsutil` CLI

c) Big Query and the `bq` CLI