Driblet - Google Cloud based ML pipeline

Overview
Step 1: Environment setup
Step 2: Data Preprocessing
Step 3: Model training
Step 4: Cloud services setup
Step 5: Airflow configuration

Overview

Driblet is a Cloud based framework that 'partially' automates machine learning pipeline for structured data. It 'partially' automates, because modeling is not part of the automated pipeline and should be done manually, though modeling doesn't require building model from scratch.

In general, there are 4 steps to run end-to-end pipeline:

Preprocess datasets (train/eval/test)
Train the model based on provided model template
Setup Cloud environment
Configure Airflow variables

Following shows high level pipeline workflow:

First, set up Google Cloud and Python environments.

Step 1: Environment setup

Select or create a Google Cloud Platform project - link.
Clone Driblet repository and place it in ~/driblet directory. Clicking following button will do it for you.
Create Python environment by executing following command:

cd driblet && chmod +x virtualenv.sh && bash virtualenv.sh && \
  source ~/driblet-venv/bin/activate && python setup.py develop

This will do following 3 stpes:

Create Python virtual environvment
Activates it
Install all required Python packages.

NOTE: Proceed to the next section only after above command has been successfully executed.

Step 2: Data preprocessing

Dataset needs to be preprocessed to be able to train the model. All preprocessing jobs are handled by workflow/dags/tasks/preprocess/transformer.py.

NOTE: Data preprocessing pipeline expects the dataset already be split into train, eval and test datasets. If your data is in BigQuery, you can use steps described in this page. Otherwise you can use Tensorflow Datasets Splits API.

Following is step by step guide on how to run the data preprocessing pipeline.

1. Configure features

Edit workflow/dags/tasks/preprocess/features_config.py to configure feature columns in your dataset. This file contains feature names for dummy dataset based on workflow/dags/tasks/preprocess/test_data/. If you check details of one of the CSV file, you will see that it has multiple features like

ALL_FEATURES variable in contains all column names from the CSV. So, you need to modify following global variables to match your dataset features.

ALL_FEATURES: All feature columns in dataset
TARGET_FEATURE: Column with target values
ID_FEATURE: Column with unique ids
EXCLUDED_FEATURES: Features to exclude from training
FORWARD_FEATURE: Feature to be exported along with prediction values.
CATEGORICAL_FEATURES: Features with categorical values

NOTE: There is no need to modify NUMERIC_FEATURES as it's automatically generated based on above variables.

When above is done, move on to next step.

2. Run preprocessing pipeline

Follow steps described in Data Preprocessing Guide to preprocess data before training the model.

Step 3: Model training

Model expects .tfrecord files for train/eval/test datasets. For detailed guide on how to train model, refer to Model Training Guide.

After model training has been finished, move to next step to setup cloud environment to run the pipeline.

Step 4: Cloud services setup

Cloud environment setup involves 9 steps which is done by setup_cloud.py. Step by step process is shown on below image:

There are two steps needs to be done before starting cloud environment setup process with setup_cloud.py. Before running the script, update model_dir and schema_file fields in configuration.yaml.

Then run python script to kick start cloud environment setup process:

python setup_cloud.py

This will take ~40 minutes to finish. When it succeeds, move to the next step.

Step 5: Airflow configuration

Go to Google Composer Web UI and launch Airflow Web UI to manage the workflow.

Now you have an access to Airflow. This manages whole predictive workflow:
Change BigQuery dataset and tables to yours. To do so, follow the steps:

2.1. Click Admin - > Variables

2.2. Click edit icons and set values
- bq_dataset: Name of your dataset in BigQuery
- bq_input_table: Name of the table under the dataset. This data will be used for prediction.
- bq_output_table: Name of the table that prediction will be copied to. If you don't set, by default, driblet_output table will be created under your BigQuery dataset.

If everything went on well, you’ll see success status in Airflow Web UI

google / driblet Goto Github PK

driblet's Introduction

Driblet - Google Cloud based ML pipeline

Overview

Step 1: Environment setup

Step 2: Data preprocessing

1. Configure features

2. Run preprocessing pipeline

Step 3: Model training

Step 4: Cloud services setup

Step 5: Airflow configuration

driblet's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent