Git Product home page Git Product logo

driblet's Introduction

Driblet - Google Cloud based ML pipeline

Overview

Driblet is a Cloud based framework that 'partially' automates machine learning pipeline for structured data. It 'partially' automates, because modeling is not part of the automated pipeline and should be done manually, though modeling doesn't require building model from scratch.

In general, there are 4 steps to run end-to-end pipeline:

  1. Preprocess datasets (train/eval/test)
  2. Train the model based on provided model template
  3. Setup Cloud environment
  4. Configure Airflow variables

Following shows high level pipeline workflow: Pipeline architecture

First, set up Google Cloud and Python environments.

Step 1: Environment setup

  1. Select or create a Google Cloud Platform project - link.

  2. Clone Driblet repository and place it in ~/driblet directory. Clicking following button will do it for you.

    Download in Google Cloud Shell

  3. Create Python environment by executing following command:

cd driblet && chmod +x virtualenv.sh && bash virtualenv.sh && \
  source ~/driblet-venv/bin/activate && python setup.py develop

This will do following 3 stpes:

  1. Create Python virtual environvment
  2. Activates it
  3. Install all required Python packages.

NOTE: Proceed to the next section only after above command has been successfully executed.

Step 2: Data preprocessing

Dataset needs to be preprocessed to be able to train the model. All preprocessing jobs are handled by workflow/dags/tasks/preprocess/transformer.py.

NOTE: Data preprocessing pipeline expects the dataset already be split into train, eval and test datasets. If your data is in BigQuery, you can use steps described in this page. Otherwise you can use Tensorflow Datasets Splits API.

Processing architecture

Following is step by step guide on how to run the data preprocessing pipeline.

1. Configure features

Edit workflow/dags/tasks/preprocess/features_config.py to configure feature columns in your dataset. This file contains feature names for dummy dataset based on workflow/dags/tasks/preprocess/test_data/. If you check details of one of the CSV file, you will see that it has multiple features like

CSV features

ALL_FEATURES variable in contains all column names from the CSV. So, you need to modify following global variables to match your dataset features.

  • ALL_FEATURES: All feature columns in dataset
  • TARGET_FEATURE: Column with target values
  • ID_FEATURE: Column with unique ids
  • EXCLUDED_FEATURES: Features to exclude from training
  • FORWARD_FEATURE: Feature to be exported along with prediction values.
  • CATEGORICAL_FEATURES: Features with categorical values

NOTE: There is no need to modify NUMERIC_FEATURES as it's automatically generated based on above variables.

When above is done, move on to next step.

2. Run preprocessing pipeline

Follow steps described in Data Preprocessing Guide to preprocess data before training the model.

Step 3: Model training

Model expects .tfrecord files for train/eval/test datasets. For detailed guide on how to train model, refer to Model Training Guide.

After model training has been finished, move to next step to setup cloud environment to run the pipeline.

Step 4: Cloud services setup

Cloud environment setup involves 9 steps which is done by setup_cloud.py. Step by step process is shown on below image:

Cloud-setup

There are two steps needs to be done before starting cloud environment setup process with setup_cloud.py. Before running the script, update model_dir and schema_file fields in configuration.yaml.

Configuration

Then run python script to kick start cloud environment setup process:

python setup_cloud.py

This will take ~40 minutes to finish. When it succeeds, move to the next step.

Step 5: Airflow configuration

  1. Go to Google Composer Web UI and launch Airflow Web UI to manage the workflow.

    Composer Web UI

    Now you have an access to Airflow. This manages whole predictive workflow:

    Airflow Web UI

  2. Change BigQuery dataset and tables to yours. To do so, follow the steps:

    2.1. Click Admin - > Variables

    Airflow Web UI

    2.2. Click edit icons and set values

    • bq_dataset: Name of your dataset in BigQuery

    • bq_input_table: Name of the table under the dataset. This data will be used for prediction.

    • bq_output_table: Name of the table that prediction will be copied to. If you don't set, by default, driblet_output table will be created under your BigQuery dataset.

      Airflow Web UI

If everything went on well, you’ll see success status in Airflow Web UI

Airflow Variables

driblet's People

Contributors

zmtbnv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.