Git Product home page Git Product logo

inference's Introduction

Inference

A Rust crate for managing the inference process for machine learning (ML) models. Currently, we support interacting with a Triton Inference Server, loading models from a MinIO Model Store.

Requirements

  • rust: Minimum Supported Rust Version 1.58
  • lld linker: for faster Rust builds.
  • docker: Container engine
  • NVIDIA container toolkit for Docker and a GPU supported by the container toolkit.
    • It may be possible to avoid the NVIDIA GPU requirement by changing the resource reservations for the triton service in docker compose to not require a GPU...YMMV based on whether the model you're trying to serve inference requests from was already compiled/optimized for GPU-only inference.
  • docker compose: Multi-container orchestration. NOTE: docker-compose is now deprecated and the compose functionality is integrated into the docker compose command. To install alongside an existing docker installation, run sudo apt-get install docker-compose-plugin. ref.
  • protoc: Google Protocol Buffer compiler. Needed to build protobufs to bind to the Triton gRPC server.

For Debian-based Linux distros, you can install inference's dependencies (except Docker & NVIDIA container toolkit, that require special repository configuration documented above) with the following command:

apt-get install clang build-essential lld clang protobuf-compiler libprotobuf-dev zstd libzstd-dev make cmake pkg-config libssl-dev

inference is tested on Ubuntu 22.04 LTS, but welcomes pull requests to fix Windows or MacOS issues.

Quick Start

  1. Clone repo: git clone https://github.com/opensensordotdev/inference.git
  • Ensure all requirements have been installed, especially the lld linker and protoc! Otherwise inference won't build!
  1. make: Download the latest versions of the Triton Inference Server Protocol Buffer files & Triton sample ML models
  2. docker compose up: Start the MinIO and Triton containers + monitoring infrastructure
  • If you have a GPU available, uncomment the below section of the inference.triton service in docker-compose.yaml. In order for your GPU to work with Triton, the CUDA versions on your host OS and the CUDA version expected by Triton have to be compatible.
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                capabilities: [gpu]
  • If you don't have a GPU, comment out that section
  1. Upload the contents of the sample_models directory to the models bucket vis the MinIO web UI at localhost:9001
  2. cargo test: Verify all cargo tests pass

Model Inspection

http://localhost:8000/v2/models/simple

Will print model name and parameters required to set up the inputs and outputs.

Errata

gRPC Setup

proto folder will contain protocol buffers. Only grpc_service.proto is referenced in the build.rs because model_config.proto is included by grpc_service. Generated code from tonic is in inference.rs

  • Json is served on port 8000
  • gRPC calls are submitted on on 8001
  • Prometheus metrics are on 8002

Multiplexing Tonic Channels

Submitting requests to a gRPC service requires a mutable reference to a Client. This prohibits you from passing a single Client around to multiple Tasks and creates a bottleneck for async code.

Trying to hide this from users by wrapping what amounts to a synchronous resource in a struct and using async message passing to access it might help some but still doesn't fix the core problem.

While it would be possible to make a connection pool of multiple Client<Channel>s and hide this pool in a struct accessed with async message passing, this is complicated.

It also doesn't work to store a tonic.transport.Channel in the TritonClient struct...it requires the struct to implement some obscure internal tonic traits. tonic.transport.Channel.

The idiomatic way appears to be storing a single master Client in a struct and then providing a function that returns a clone of the Client since Cloning clients is cheap.

A limitation of this could be that gRPC servers usually have a finite number of connections they can multiplex (100 seems to be the number a lot of places throw out). See gRPC performance best practices.

tonic seems to have a default buffer size of 1024. Source: DEFAULT_BUFFER_SIZE channel/mod.rs

This might be useful eventually if you have multiple Triton pods and want to discover which ones are live + update the endpoint list grpc load balancing, github.

Not clear if there's a connection pool under the hood there or how they're able to connect to multiple servers?

Alternate Implementations/Inspiration

Triton-client-rs

Integration with DALI

DALI (Data Loading Library) Triton DALI backend

inference's People

Contributors

dayofthepenguin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

inference's Issues

Question: Parsing model infer response

Hey man!

Was wondering if you have any take on parsing the model infer response back to the expected datatype? Looking for a sleek interface that does the parsing automatically based on the outputs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.