This repository compiles prescriptive guidance and code samples for operationalization of NVIDIA Merlin framework on Google Cloud Vertex AI.
NVIDIA Merlin is and open-source framework for building large-scale deep learning recommender system. Vertex AI is Google Cloud's unified Machine Learning platform to help data scientists and machine learning engineers increase experimentation, deploy faster, and manage models with confidence.
The content in this repository centers around five core scenarios:
- Setting up Merlin experimentation and development environment in Vertex AI Workbench.
- Operationalizing large scale data preprocessing pipelines with NVIDIA Merlin NVTabular and Vertex AI Pipelines.
- Training large-scale deep learning ranking models with NVIDIA Merlin HugeCTR and Vertex AI Training.
- Deploying models and serving predictions with NVIDIA Triton Inference Server and Vertex AI Prediction
- Implementing end to end data preprocessing, training, and deployment pipelines with Vertex AI Pipelines.
We assume that users of this repository have practical experience with both NVIDIA Merlin and Vertex AI. If you are unfamiliar with some of the core concepts or technologies used in this repo we recommend referring to NVIDIA Merlin and Vertex AI documentation.
The dataset used by all samples in this repo is Criteo 1TB Click Logs dataset provided by The Criteo AI Lab.
The below figure summarizes a high level architecture of the solution demonstrated in this repo.
Commercial recommenders are trained on huge datasets, often several hundreds of terabytes in size. At this scale, data preprocessing steps often take much more time than training recommender machine learning models. NVTabular - a core component of Merlin - is a feature engineering and preprocessing library designed to effectively manipulate terabytes of recommender system datasets and significantly reduce data preparation time.
In this repo we demonstrate how to operationalize NVTabular data preprocessing workflows using Vertex AI Pipelines and multi-GPU processing nodes. The repo includes two samples of reusable and customizable data preprocessing pipelines developed using Kubeflow Pipelines SDK and Google Cloud pipeline components.
The first pipeline demonstrates how to process large CSV datasets managed in Google Cloud Storage. The second pipeline digests source data from Google BigQuery
NVIDIA HugeCTR is NVIDIA's GPU-accelerated, highly scalable recommender framework. NVIDIA HugeCTR facilitates implementations of leading deep learning recommender models such as Google's Wide and Deep and, Facebook's DLRM.
The repo includes an example of how to operationalize training and hypertuning of the HugeCTR implementation of the DeepFM model using Vertex AI Training and massively scalable A2 workers.
NVIDIA Triton Inference Server is a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports ensemble models that can be used to implement multi-step inference pipelines.
The repo includes a sample that demonstrates how to create, deploy, and serve a Triton ensemble model using Vertex AI Prediction. The example ensemble implements an inference pipeline that integrates NVTabular data preprocessing workflow with a HugeCTR deep learning model.
Design patterns and best practices outlined in the previous sections come together in the last component of the solution - a reference implementation the machine learning pipeline that integrates data preprocessing, training, and deployment into a unified end to end workflow.
A flexible and powerful experimentation and development environment is critical in recommender system projects. In the environment setup section of this repo we outline steps to configure the environment depicted on the below figure.
The environment is based on Vertex AI Workbench. Container images based on NVIDIA NGC Merlin training and Merlin inference images are installed as managed notebooks kernels augmenting the standard features of a managed notebook instance that include UI and programmatic interfaces to Google Cloud services.
The core content of the repository comprises a series of Jupyter notebooks and a set of Python modules. The notebooks compile detailed guidance on implementing the solution's components described at a high level in the previous section. The Python modules encapsulate reusable code components that are used in the notebooks and in Vertex AI jobs and pipelines.
Currently, the repo includes the following notebooks:
- 00-dataset-management - describes and explores the Criteo dataset, and loads it to BigQuery.
- 01-dataset-preprocessing - guidance for large scaled data preprocessing with NVTabular and Vertex Pipelines
- 02-model-training-hugectr - guidance for training HugeCTR models with Vertex Training.
- 03-model-inference-hugectr - guidance for deploying Trition ensemble models with Vertex Prediction
- 04-e2e-pipeline - guidance for building an end-to-end data preprocessing, training, and deployment pipeline
The Python modules are in the src
folder:
src/pipelines
- pipeline and pipeline components definitionssrc/preprocessing
- data preprocessing utility functions and classessrc/serving
- deployment and serving utility functions and classessrc/training
- model definitions and training loops
The src
folder also includes Dockerfiles for custom container images used by Vertex Pipelines, Vertex Training, and Vertex Prediction. Refer to the notebooks for more detailed information.
This section outlines the steps to configure a GCP environment required to run the code samples in this repo.
In the Google CLoud Console, on the project selector page, select or create a Google Cloud project.
From Cloud Shell, run the following gcloud
command to enable the required Cloud APIs:
PROJECT_ID=<YOUR PROJECT ID>
gcloud services enable \
aiplatform.googleapis.com \
bigquery.googleapis.com \
bigquerystorage.googleapis.com \
cloudapis.googleapis.com \
cloudbuild.googleapis.com \
compute.googleapis.com \
containerregistry.googleapis.com \
notebooks.googleapis.com \
storage.googleapis.com \
--project=${PROJECT_ID}
The notebooks in the repo require access to a GCS bucket that is used for staging and managing ML artifacts created by the workflows implemented in the notebooks. The bucket should be in the same GCP region as the region you will use to run Vertex AI jobs and pipelines.
From Cloud Shell, run the following command to create the bucket:
REGION=<YOUR REGION>
BUCKET_NAME=<YOUR BUCKET_NAME>
gsutil mb -l $REGION gs://$BUCKET_NAME
To run the notebooks in this repo you will use a custom container image that will be configured as a Vertex AI Workbench managed notebooks kernel. The image is a based on the NVIDIA NGC Merlin training image augmented with additional packages required to interface with Vertex AI.
From Cloud Shell, run the following command to create the container image:
- Get Dockerfile for the Merlin development image:
SRC_REPO=https://github.com/GoogleCloudPlatform/merlin-on-vertex
LOCAL_DIR=merlin-env-setup
kpt pkg get $SRC_REPO/env@main $LOCAL_DIR
cd $LOCAL_DIR
- Build and push the development image
PROJECT_ID=merlin-on-vertex # change to your project id.
IMAGE_URI=gcr.io/${PROJECT_ID}/merlin-dev-vertex
gcloud builds submit --timeout "2h" --tag ${IMAGE_URI} . --machine-type=e2-highcpu-8
This section provides step for provisioning a Vertex AI Workbench managed notebook and configuring a custom kernel based on the image created in the previous step.
- Follow the instructions in the Create a managed notebooks instance how-to guide:
- In the Use custom Docker images settings enter a name of the image you created in the previous step:
gcr.io/<your-project-id>/merlin-dev-vertex:latest
- In the Configure hardware settings select your GPU configuration. We recommend a machine with two
NVIDIA Tesla T4
orNVIDIA Tesla A100
GPUs.
- In the Use custom Docker images settings enter a name of the image you created in the previous step:
After the Vertex Workbench managed notebook is created, peform the following steps:
- Click on the OPEN JUPYTERLAB link on the notebook instance.
- Click on the New Launcher button, then start a new terminal session.
- Clone the repository to your notebook instance:
git clone https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-aigit cd nvidia-merlin-on-vertex-ai