Git Product home page Git Product logo

ms_observability's Introduction

MindSpore Observability

Experimental notice: This project is still experimental, only shows how to run eBPF code in container to trace kernel, and provides some simple examples to teach you how to use BCC to develop eBPF tools and use ebpf_exporter to visualize the system tracing metrics. Right now here's an simple attempt to combine MindSpore with eBPF, real practical examples are expected in the near future.

Introduction of ms_observability

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. MindSpore is designed to provide development experience with friendly design and efficient execution for the data scientists and algorithmic engineers, native support for Ascend AI processor, and software hardware co-optimization.

Currently, the problem with all deep learning job is that the AI training process is invisible. While running a AI job by using the MindSpore, we don't know how it is layered, don’t know which CPU core it runs on , even don’t know what kernel functions it calls and how to jump. Once the task has bottlenecks, developers tend to choose to use some common monitoring tools to analyze, but these usually have blind spots and they are inflexible, such as: they can get long-lived processes information, but for some short-lived processes, often can't capture which leads to loss of information, a lot of these processes are actually on the consumption of resources.

To solve the gap, the project ms_observability combines the MindSpore with the new technology eBPF to improve the observability of the AI kernel throughout the training and reasoning process. eBPF can make the kernel fully programmable and dynamically run a mini programs on a wide variety of kernel events, which can empower non-kernel developers to customize their own tracing codes to solve real problems they met, which means that it can keep watch over the whole kernel states of the AI job to provide more detailed context to further analyze your system and application.

Getting Started

Prerequisites

Run eBPF code in container to probe kernel metrics

Download ms_observability code

cd $HOME
git clone https://github.com/hellowaywewe/ms_observability.git

Build and run ebpf_bcc_exporter container

cd $HOME/ms_observability/docker
docker build -f Dockerfile -t ebpf_bcc_exporter:latest .
cd $HOME/ms_observability
DOCKER_NAME=ebpf_bcc_exporter TAG=latest ./run_docker.sh   // mount the host kernel to the container

Show the kernel queue IO latency metrics (simple example, showing how to use bcc to develop eBPF code and probe kernel)

docker exec -it ebpf_bcc_exporter /bin/bash     // Container interactive operation
cd /mnt/ms_observability/ebpf_example && ./io-latency.py 1 2

Visualize kernel metrics in the unified format of Prometheus

Show the kernel queue IO latency metrics (simple example, show how to use ebpf_exporter to configure and visualize metrics)

~/go/bin/ebpf_exporter --config.file=/mnt/ms_observability/exporter_example/io-latency.yaml

Use the curl command to verify that the visual metrics are properly captured

docker inspect ebpf_bcc_exporter | grep IPAddress  // Query the IP of the container
curl http://<yourContainerIP>:9435/metrics

A simple attempt to combine MindSpore with eBPF

When executing MindSpore LENET job in the host, if the kernel function “blk_account_io_done” is called, the words “Hello World” will be printed, if not, print nothing.

Run the lenet-io.py code in container

docker exec -it ebpf_bcc_exporter /bin/bash
cd /mnt/ms_observability/ebpf_example
./lenet-io.py

Run the MindSpore lenet training job in the host (Required MindSpore v0.2.0-alpha Env)

cd $HOME && git clone https://github.com/mindspore-ai/docs.git
conda activate mindspore && cd $HOME/docs/tutorials/tutorial_code/
python lenet.py --device_target="CPU"

Future Work

Currently the ms_observability is in the early stages of experiment, in the future, most importantly, we should analyze what to do in AI scenarios and which can be used and traced from the thousands of available kernel events. And then collaborate with other open source communities:

  1. Work with the iovisor/bcc project to develop AI observability tools based on eBPF.
  2. Enable MindSpore to support eBPF AI observability tools.
  3. Work with the Prometheus and ebpf_exporter project to visualize the AI kernel metrics.

ms_observability's People

Contributors

hellowaywewe avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.