Git Product home page Git Product logo

distributed-tutorial's Introduction

Install Softwares to support distributed training on AI cluster

本教程基于horovod分布式训练框架,支持keras, tensorflow和pytorch

以下相关版本已经过测试:

  • keras: 2.2.2
  • tensorflow: 1.10.0
  • pytorch: 0.4.0

硬件环境

本教程针对ShanghaiTech AI集群

  • Container开启时需使用-network=host
  • 确保每个节点的hostname是不相同的,hostname请勿使用 .

软件环境安装

以下软件均在计算节点安装

NCCL

  1. 下载 NCCL 2.

  2. 安装步骤 here.

  3. If you have installed NCCL 2 using the nccl-<version>.txz package, you should add the library path to LD_LIBRARY_PATH environment variable or register it in /etc/ld.so.conf.

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib

OpenMPI

  1. 下载最新版安装包 Open MPI
  2. 安装步骤 here.
  3. ldconfig

Deep learning framework

  • 安装keras,tensorflow,pytorch
pip install keras==2.2.2 tensorflow-gpu==1.10.0 torch==0.4.0 torchvision
HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

Example

参见examples中的README.md

distributed-tutorial's People

Contributors

tonysy avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

giorking

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.