tczhangzhi / pytorch-distributed Goto Github PK

A quickstart and benchmark for pytorch distributed training.

License: MIT License

Python 98.56% Shell 1.44%

pytorch-distributed's Introduction

#!/usr/bin/python
# -- coding: utf-8 --

print('''
Hi there 👋. I am ZHANG Zhi from The Hong Kong Polytechnic University (PolyU).
My current research interest includes EEG analysis using generative models and graph convolutional networks.
Follow me. 😄
''')

pytorch-distributed's People

Contributors

Stargazers

Watchers

Forkers

nieshaoshuai lxgychen wpf535236337 moyulization lyf6 happog guojl7 abrliu songkq a1berttt jimmyc96 chenshen03 kiminh bqzhu922 holygen creatorcen supermousse runxinxu beichao1314 perrywu1989 wuzhi19931128 lliai eeking niconico6 weizaixl crazyvertigo hitzht realjl caoyuhang shuangyumo cavalleria chzhan pgsrv xianglinbuaa lhyciomp seine7ee parker-lyu xialuxi thesky0108 berther tzq2doc faultaddr andrewchiyz old-six94 songtf525 lturing tiaoziliao houchenyu vandesa003 bronzepot liangzhendong123 maryhh hwang7308 felixzhang7 chenzuge1 xiaolongcheng crisescode daonancai cxmscb goldgaruda iamgavinzhou wenjingkangintel fernandonichey 3288103265 whu-dft ericking19 hxmowang xyanggu distributed-deep-learning yhjflower gsygsy96 yykhuster brettll tinyloop wubukeneng qianduoduolr tianqi-0522 whitefu cheng321284 zhihong1224 jediyoda36 paseam manutdzou rihengzhu chenjie04 guanguanboy shflhl wobjtushisui zhunge azuredsky blakecheng zeta1999 kevinkevin189 isyanan1024 podismine codewithzichao wangbo-zhao indirection42 duanyaohui cch2016

pytorch-distributed's Issues

Does the test accuracy need to be synchronized in distributed.py?

If directly output the test accuracy, will the code automatically synchronize the accuracy between each GPUs?

Connections time out

When using nccl as my communication backend in distributed learning, I found that all operations about gathering variables from other groups can't work. The program would be stopped because of connection time-out? Do you know what causes this issue, how can we do to solve it?

DDP掉精度

您好，我使用的是DistributedDataParallel，通信是nccl，然后数据方面使用的DistributedSampler，但是发现对比单卡同样参数设置的模型，多卡的精度会大幅下降（模型里也没有BN层）。
然后还有几个疑问望解答：

多卡相当于增大batch size吗？
2.假如使用warm up的话，对应设置和单卡时应该除以卡的数量吗？
3.使用adamw自适应优化器，多卡时需要将lr乘以对应卡数吗？
谢谢

您好，我想请问一下Pytorch 在多进程分布式训练的时候，一开始加载数据集dataloader的时候，由于是多进程的，就会导致多个进程一起加载dataloader，这些dataloder全部加载到内存中使得内存爆炸，请问有没有什么办法解决呢

使用mp.spawn启动比torch.distributed.launch慢很多

作者您好，我最近测试多GPU训练时遇到了使用mp.spawn启动比torch.distributed.launch慢很多的情况。
我发现使用mp.spawn方法后，每次个epoch开始时都会等待很久，但使用torch.distributed.launch启动时就没有出现这种情况。请问作者您在使用过程中有出现这种情况吗？

distributed.py seed

seed 放到main里面好像没用吧，这样每次运行main_worker都是的到不相同的随机数。
是不是应该放到main_worker里面

大佬，请问如何指定gpu训练

例如，我在一张8卡节点上训练，想用其中4张训练
如果我用0，1，2，3是可以训练的
但是如果我用其他任意组合的gpuid就不可以

我参考了这个把每个进程的gpuid 改了
Lightning-AI/pytorch-lightning#2407

会提示
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

我的代码

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.utils.data.distributed
import torch.multiprocessing as mp
import argparse
import os



parser = argparse.ArgumentParser(description = 'multi process')

parser.add_argument('--gpu-id',type =str,default='0,1,2,4')
parser.add_argument('--world-size', default=1, type=int,
                    help='number of nodes for distributed training')
parser.add_argument('--rank', default=0, type=int,
                    help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
                    help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
                    help='distributed backend')
args = parser.parse_args()






def  main():
    global args


    os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id
    # args.gpu = list(map(int,args.gpu_id.split(',')))

    # state = {k: v for k, v in args._get_kwargs()}

    # ngpus_per_node = torch.cuda.device_count() #len(args.gpu)

    ngpus_per_node = args.gpu_id.split(',').__len__()
    # print(os.environ['CUDA_VISIBLE_DEVICES'])
    # print('能看到的gpu',ngpus_per_node)
    args.nprocs = ngpus_per_node
    args.world_size = ngpus_per_node * args.world_size
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))


# Random seed

# best_acc = 0  # best test accuracy

def main_worker(local_rank,ngpus_per_node,args):
    # global best_acc
 # start from epoch 0 or last checkpoint epoch

    # if not os.path.isdir(args.checkpoint):
    #     mkdir_p(args.checkpoint)
    # # import pdb
    # pdb.set_trace()
    gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
    gpu = int(gpus[local_rank])

    args.gpu = gpu
    best_acc = 0
    # print(best_acc)
    args.rank = args.rank * ngpus_per_node + local_rank#args.gpu[gpu]
    print('rank: {} / {}'.format(args.rank, args.world_size))

    dist.init_process_group(backend=args.dist_backend,
                            init_method=args.dist_url,
                            world_size=args.world_size,
                            rank=args.rank)



    torch.cuda.set_device(gpu)


if __name__ == '__main__':
    main()`

how can i modify it

how can i modify it ?

i want to modify it that let it can work on multi mechine , I don't kown how to do it?

About saving model in the multi-GPU training.

pytorch-distributed/multiprocessing_distributed.py

Lines 183 to 189 in e45f4e0

 save_checkpoint( 

 { 

 'epoch': epoch + 1, 

 'arch': args.arch, 

 'state_dict': model.module.state_dict(), 

 'best_acc1': best_acc1, 

 }, is_best)

In multi-GPU training, is it only necessary to save the model parameters during gpu=0

def train(model, start_epoch, end_epoch, tr_loader, optimizer, scheduler, loss_funcs, local_rank):
    for curr_epoch in range(start_epoch, end_epoch):
        train_epoch(curr_epoch, end_epoch, local_rank, loss_funcs, model, optimizer, scheduler, tr_loader)

        # 根据周期修改学习率
        if not arg_config["sche_usebatch"]:
            scheduler.step()

        if local_rank == 0:
            # 每个周期都进行保存测试，保存的是针对第curr_epoch+1周期的参数
            save_checkpoint(
                model=model,
                optimizer=optimizer,
                scheduler=scheduler,
                amp=amp if arg_config["use_amp"] else None,
                exp_name=exp_name,
                current_epoch=curr_epoch + 1,
                full_net_path=path_config["final_full_net"],
                state_net_path=path_config["final_state_net"],
            )  # 保存参数

and only need to follow the following steps when loading the parameters?

import apex.parallel as apexparallel
import numpy as np
import torch
import torch.backends.cudnn as torchcudnn

...

    if arg_config["multi_gpu"]:
        model = apexparallel.convert_syncbn_model(model)

    if arg_config["use_amp"]:
        assert torchcudnn.enabled, "Amp requires cudnn backend to be enabled."
        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

    if arg_config["multi_gpu"]:
        model = apexparallel.DistributedDataParallel(model, delay_allreduce=True)

    if arg_config["resume_mode"] == "train":
        # resume model to train the model
        start_epoch = resume_checkpoint(
            model=model,
            optimizer=optimizer,
            scheduler=scheduler,
            amp=amp if arg_config["use_amp"] else None,
            exp_name=exp_name,
            load_path=path_config["final_full_net"],
            mode="all",
        )
    else:
        # only train a new model.
        start_epoch = 0

RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

按照你的脚本跑，一直报错，找不到原因。

root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2  python distributed_slurm_main.py --dist-file dist_file
Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

srun: error: pai-worker1: tasks 0-1: Exited with exit code 1
root@pai-worker1:/home/Data/exports/pytorch-distributed#

请问使用DistributedSampler，各个GPU的数据是如何分配的？是连续(互不相同)的还是相同的？

我仿照了您的方法实现了一次分布式训练：发现单机单卡和多机多卡完成相同次数epoch的时间差不多，遂有所问。

关于损失backward问题

作者大大您好，为何代码中计算梯度的时候用的是loss.backward()而不是reduce_loss.backward() ?

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)

Hi there,

Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use
torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

No need to divide with nprocs again

horovod.allreduce calculate the average value by default

pytorch-distributed/horovod_distributed.py

Line 73 in c10e193

rt /= nprocs

Windows 提示Distributed package doesn't have NCCL "Distributed package doesn't have NCCL built in

是不是Windows不能用NCCL的backend呢？如果是这样，请问Windows 想用多GPU怎么解决呢？感谢！

2080Ti多卡训练，GPU利用率忽高忽低，这个怎么回事

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1515 G /usr/lib/xorg/Xorg 44MiB |
| 0 2542 G /usr/lib/xorg/Xorg 12MiB |
| 0 4109 G /usr/lib/xorg/Xorg 12MiB |
| 2 14865 C /usr/bin/python3 10923MiB |
| 3 14866 C /usr/bin/python3 10919MiB |
+-----------------------------------------------------------------------------+

14865 14866 这两个进程CPU占有率也是忽高忽低

2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4835 loss_board2:9.2581 acc:0.0000 acc2:0.0001
2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4867 loss_board2:9.2649 acc:0.0000 acc2:0.0001
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3776 loss_board2:9.2109 acc:0.0000 acc2:0.0000
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3761 loss_board2:9.2100 acc:0.0001 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3758 loss_board2:9.1939 acc:0.0000 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3738 loss_board2:9.2173 acc:0.0000 acc2:0.0002
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3720 loss_board2:9.1942 acc:0.0000 acc2:0.0001
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3726 loss_board2:9.1976 acc:0.0000 acc2:0.0002
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3741 loss_board2:9.1846 acc:0.0000 acc2:0.0001
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3731 loss_board2:9.1964 acc:0.0000 acc2:0.0002
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3720 loss_board2:9.1539 acc:0.0000 acc2:0.0001
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3691 loss_board2:9.1631 acc:0.0000 acc2:0.0001
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3655 loss_board2:9.1452 acc:0.0001 acc2:0.0002
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3699 loss_board2:9.1368 acc:0.0000 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3642 loss_board2:9.1215 acc:0.0001 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3608 loss_board2:9.1222 acc:0.0000 acc2:0.0001
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3618 loss_board2:9.1017 acc:0.0000 acc2:0.0000
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3584 loss_board2:9.1050 acc:0.0000 acc2:0.0003
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3562 loss_board2:9.0890 acc:0.0000 acc2:0.0001
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3541 loss_board2:9.0840 acc:0.0000 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3503 loss_board2:9.0778 acc:0.0001 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3462 loss_board2:9.0838 acc:0.0000 acc2:0.0004
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3430 loss_board2:9.0621 acc:0.0000 acc2:0.0002
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3428 loss_board2:9.0644 acc:0.0001 acc2:0.0003

训练时间从一开始1分钟到后来的3分钟

这个现象加个nvlink会有用吗

多卡显存占用不均衡

我run distributed.py ，发现显存占用不均衡，主卡占用10GB，另外3个卡占用8GB。
请问怎么解决？

Can't pickle <function main_worker

When I used multiprocessing distributed, I encountered an error：
Can't pickle <function main_worker at 0x7f1c444e9d30>: attribute lookup main_worker on main failed.
I found this error even if I did not make any changes to the multiprocessing distributed.py
Can you help me?

apex 并行加速中的data_prefetcher normalize为什么做两次？

@tczhangzhi 你好，感谢分享。有一个问题，我在看pex_distributed.py 的时候，发现dataset中已经做了normalize了，为什么在data_prefetcher中要在做一次normalize呢？等于一共做了两次normalize？是我理解的有问题还是这么做有特殊的原因？
data_prefetcher 中的normalize

dataset中的normalize

	save_checkpoint(
	{
	'epoch': epoch + 1,
	'arch': args.arch,
	'state_dict': model.module.state_dict(),
	'best_acc1': best_acc1,
	}, is_best)