QIZHI Platform

Introduction

QIZHI is a cluster management tool and resource scheduling platform, initially designed and jointly developed by Microsoft Research (MSR), Microsoft Search Technology Center (STC), Peking University, Xi'an Jiaotong University, Zhejiang University, and University of Science and Technology of China, and maintained by NELVT, Peking University and AITISA. The platform incorporates some mature design that has a proven track record in large scale Microsoft production environment, and is tailored primarily for academic and research purpose.

QIZHI supports AI jobs (e.g., deep learning jobs) running in a GPU cluster. The platform provides a set of interfaces to support major deep learning frameworks: CNTK, TensorFlow, etc. The interface provides great extensibility: new deep learning framework (or other type of workload) can be supported by the interface with a few extra lines of script and/or Python code.

QIZHI supports GPU scheduling, a key requirement of deep learning job. For better performance, QIZHI supports fine-grained topology-aware job placement that can request for the GPU with a specific location (e.g., under the same PCI-E switch).

QIZHI embraces a microservices architecture: every component runs in a container. The system leverages Kubernetes to deploy and manage static components in the system. The more dynamic deep learning jobs are scheduled and managed by Hadoop YARN with our GPU enhancement. The training data and training results are stored in Hadoop HDFS.

An Open AI Platform for R&D and Education

QIZHI is completely open: it is under the Open-Intelligence license. QIZHI is architected in a modular way: different module can be plugged in as appropriate. This makes QIZHI particularly attractive to evaluate various research ideas, which include but not limited to the following components:

Scheduling mechanism for deep learning workload
Deep neural network application that requires evaluation under realistic platform environment
New deep learning framework
Compiler technique for AI
High performance networking for AI
Profiling tool, including network, platform, and AI job profiling
AI Benchmark suite
New hardware for AI, including FPGA, ASIC, Neural Processor
AI Storage support
AI platform management

QIZHI operates in an open model: contributions from academia and industry are all highly welcome.

System Deployment

Prerequisite

The system runs in a cluster of machines each equipped with one or multiple GPUs. Each machine in the cluster runs Ubuntu 16.04 LTS and has a statically assigned IP address. To deploy services, the system further relies on a Docker registry service (e.g., Docker hub) to store the Docker images for the services to be deployed. The system also requires a dev machine that runs in the same environment that has full access to the cluster. And the system need NTP service for clock synchronization.

Deployment process

To deploy and use the system, the process consists of the following steps.

Build the binary for Hadoop AI and place it in the specified path*
Deploy kubernetes and system services
Access web portal for job submission and cluster management

* If step 1 is skipped, a standard Hadoop 2.7.2 will be installed instead.

Kubernetes deployment

The platform leverages Kubernetes (k8s) to deploy and manage system services. To deploy k8s in the cluster, please refer to k8s deployment readme for details.

Service deployment

After Kubernetes is deployed, the system will leverage built-in k8s features (e.g., configmap) to deploy system services. Please refer to service deployment readme for details.

Job management

After system services have been deployed, user can access the web portal, a Web UI, for cluster management and job management. Please refer to this tutorial for details about job submission.

Cluster management

The web portal also provides Web UI for cluster management.

System Architecture

The system architecture is illustrated above. User submits jobs or monitors cluster status through the Web Portal, which calls APIs provided by the REST server. Third party tools can also call REST server directly for job management. Upon receiving API calls, the REST server coordinates with FrameworkLauncher (short for Launcher) to perform job management. The Launcher Server handles requests from the REST Server and submits jobs to Hadoop YARN. The job, scheduled by YARN with GPU enhancement, can leverage GPUs in the cluster for deep learning computation. Other type of CPU based AI workloads or traditional big data job can also run in the platform, coexisted with those GPU-based jobs. The platform leverages HDFS to store data. All jobs are assumed to support HDFS. All the static services (blue-lined box) are managed by Kubernetes, while jobs (purple-lined box) are managed by Hadoop YARN.

open-intelligence / qizhi Goto Github PK

qizhi's Introduction

QIZHI Platform

Introduction

An Open AI Platform for R&D and Education

System Deployment

Prerequisite

Deployment process

Kubernetes deployment

Service deployment

Job management

Cluster management

System Architecture

qizhi's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent