Git Product home page Git Product logo

bert-as-a-service_tfx's Introduction

image source


BERT as a service

This repository is designed to demonstrate a simple yet complete machine learning solution that uses a BERT model for text sentiment analysis using a TensorFlow Extended end-to-end pipeline, and making use of some of the best practices from the MLOps domain, it will cover steps from data ingestion until model serving and consuming it either with REST or gRPC requests.


Content

  • Pipelines
    • Notebook (Google Colab)
    • GCP (KubeFlow) [link]
    • GCP (Vertex AI) [link]
    • Local (AirFlow) TODO
  • Documentation [link]
  • Data [link]

Pipeline description

image source

The end-to-end TFX pipeline will cover most of the main areas of a machine learning solution, from data ingestion and validation to model training and serving, those steps are further described below, this repository also aims to provide different options for managing the pipeline, this will be done using orchestrators, the orchestrators covered will be AirFlow, KubeFlow and an interactive option that can be used at Google Colab for demonstration purposes.

  • ExampleGen is the initial input component of a pipeline that ingests and optionally splits the input dataset.
    • Reads the IMDB dataset stored as a CSV file and spits the data into train (2/3) and validation (1/3).
  • StatisticsGen calculates statistics for the dataset.
    • Generate statistics for text and label distribution.
  • SchemaGen examines the statistics and creates a data schema.
  • ExampleValidator looks for anomalies and missing values in the dataset.
    • Validates the input data based on the SchemaGen's schema.
  • Transform performs feature engineering on the dataset.
    • Input missing data and do basic data pre-processing.
  • Tuner uses kerastuner to perform hyperparameters tuning for the model.
    • The optimal hyperparameters will be used by the Trainer
  • Trainer trains the model.
    • Train the custom pre-trained BERT model, this model also has a built-in text tokenizer.
  • Resolver performs model validation.
    • Resolve a model to be used as a baseline for model validation.
  • Evaluator performs deep analysis of the training results and helps you validate your exported models, ensuring that they are "good enough" to be pushed to production.
  • InfraValidator used as an early warning layer before pushing a model into production. The name "infra" validator came from the fact that it is validating the model in the actual model serving "infrastructure".
    • Evaluate the model's accuracy over the complete dataset and across different data slices, also evaluate new models against a baseline.
  • Pusher deploys the model on a serving infrastructure.
    • Export the model for serving if the new model improved over the baseline.

Model description

At the modeling part, we are going to use the BERT model, for better performance we will use transfer learning, this means that we are using a model that was pre-trained on another task (usually a task that is more generic or similar), from the pre-trained model we will use all layers until the output of the last embedding, to be more specific only the output from the CLS token, shown in the image below, then we add a classifier layer at the top, this classifier layer will be responsible for classifying the input text as being positive or negative, this task is also known as sentiment analysis, and is very common in natural language processing.

image source


Dataset description

The dataset used for training and evaluating the model is the known IMDB review dataset, this dataset has 25,000 movies reviews, being either negative (label 0) or positive (label 1), this dataset was slightly processed to be used here, labels have been encoded to be integers (0 or 1), and for faster experimentation, the data was reduced to have only 5,000 samples.

bert-as-a-service_tfx's People

Contributors

dimitreoliveira avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.