Driblet - Google Cloud based ML pipeline
- Overview
- Step 1: Environment setup
- Step 2: Data Preprocessing
- Step 3: Model training
- Step 4: Cloud services setup
- Step 5: Airflow configuration
Overview
Driblet is a Cloud based framework that 'partially' automates machine learning pipeline for structured data. It 'partially' automates, because modeling is not part of the automated pipeline and should be done manually, though modeling doesn't require building model from scratch.
In general, there are 4 steps to run end-to-end pipeline:
- Preprocess datasets (train/eval/test)
- Train the model based on provided model template
- Setup Cloud environment
- Configure Airflow variables
Following shows high level pipeline workflow:
First, set up Google Cloud and Python environments.
Step 1: Environment setup
-
Select or create a Google Cloud Platform project - link.
-
Clone Driblet repository and place it in
~/driblet
directory. Clicking following button will do it for you. -
Create Python environment by executing following command:
cd driblet && chmod +x virtualenv.sh && bash virtualenv.sh && \
source ~/driblet-venv/bin/activate && python setup.py develop
This will do following 3 stpes:
- Create Python virtual environvment
- Activates it
- Install all required Python packages.
NOTE: Proceed to the next section only after above command has been successfully executed.
Step 2: Data preprocessing
Dataset needs to be preprocessed to be able to train the model. All
preprocessing jobs are handled by
workflow/dags/tasks/preprocess/transformer.py
.
NOTE: Data preprocessing pipeline expects the dataset already be split into
train, eval and test
datasets. If your data is in BigQuery, you can use steps
described in
this page.
Otherwise you can use
Tensorflow Datasets Splits API.
Following is step by step guide on how to run the data preprocessing pipeline.
1. Configure features
Edit
workflow/dags/tasks/preprocess/features_config.py
to configure feature columns in your dataset. This file contains feature names
for dummy dataset based on workflow/dags/tasks/preprocess/test_data/
. If you
check details of one of the CSV file, you will see that it has multiple features
like
ALL_FEATURES
variable in contains all column names from the CSV. So, you need
to modify following global variables to match your dataset features.
ALL_FEATURES
: All feature columns in datasetTARGET_FEATURE
: Column with target valuesID_FEATURE
: Column with unique idsEXCLUDED_FEATURES
: Features to exclude from trainingFORWARD_FEATURE
: Feature to be exported along with prediction values.CATEGORICAL_FEATURES
: Features with categorical values
NOTE: There is no need to modify NUMERIC_FEATURES
as it's automatically
generated based on above variables.
When above is done, move on to next step.
2. Run preprocessing pipeline
Follow steps described in Data Preprocessing Guide to preprocess data before training the model.
Step 3: Model training
Model expects .tfrecord
files for train/eval/test datasets. For detailed guide
on how to train model, refer to Model Training Guide.
After model training has been finished, move to next step to setup cloud environment to run the pipeline.
Step 4: Cloud services setup
Cloud environment setup involves 9 steps which is done by setup_cloud.py
. Step
by step process is shown on below image:
There are two steps needs to be done before starting cloud environment setup
process with setup_cloud.py
. Before running the script, update model_dir
and
schema_file
fields in configuration.yaml
.
Then run python script to kick start cloud environment setup process:
python setup_cloud.py
This will take ~40 minutes to finish. When it succeeds, move to the next step.
Step 5: Airflow configuration
-
Go to Google Composer Web UI and launch Airflow Web UI to manage the workflow.
Now you have an access to Airflow. This manages whole predictive workflow:
-
Change BigQuery dataset and tables to yours. To do so, follow the steps:
2.1. Click Admin - > Variables
2.2. Click edit icons and set values
-
bq_dataset
: Name of your dataset in BigQuery -
bq_input_table
: Name of the table under the dataset. This data will be used for prediction. -
bq_output_table
: Name of the table that prediction will be copied to. If you don't set, by default,driblet_output
table will be created under your BigQuery dataset.
-
If everything went on well, you’ll see success status in Airflow Web UI