tczhangzhi / pytorch-distributed Goto Github PK

A quickstart and benchmark for pytorch distributed training.

License: MIT License

Python 98.56% Shell 1.44%

pytorch-distributed's Issues

您好，我想请问一下Pytorch 在多进程分布式训练的时候，一开始加载数据集dataloader的时候，由于是多进程的，就会导致多个进程一起加载dataloader，这些dataloder全部加载到内存中使得内存爆炸，请问有没有什么办法解决呢

使用mp.spawn启动比torch.distributed.launch慢很多

作者您好，我最近测试多GPU训练时遇到了使用mp.spawn启动比torch.distributed.launch慢很多的情况。
我发现使用mp.spawn方法后，每次个epoch开始时都会等待很久，但使用torch.distributed.launch启动时就没有出现这种情况。请问作者您在使用过程中有出现这种情况吗？

大佬，请问如何指定gpu训练

例如，我在一张8卡节点上训练，想用其中4张训练
如果我用0，1，2，3是可以训练的
但是如果我用其他任意组合的gpuid就不可以

我参考了这个把每个进程的gpuid 改了
Lightning-AI/pytorch-lightning#2407

会提示
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

我的代码

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.utils.data.distributed
import torch.multiprocessing as mp
import argparse
import os



parser = argparse.ArgumentParser(description = 'multi process')

parser.add_argument('--gpu-id',type =str,default='0,1,2,4')
parser.add_argument('--world-size', default=1, type=int,
                    help='number of nodes for distributed training')
parser.add_argument('--rank', default=0, type=int,
                    help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
                    help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
                    help='distributed backend')
args = parser.parse_args()






def  main():
    global args


    os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id
    # args.gpu = list(map(int,args.gpu_id.split(',')))

    # state = {k: v for k, v in args._get_kwargs()}

    # ngpus_per_node = torch.cuda.device_count() #len(args.gpu)

    ngpus_per_node = args.gpu_id.split(',').__len__()
    # print(os.environ['CUDA_VISIBLE_DEVICES'])
    # print('能看到的gpu',ngpus_per_node)
    args.nprocs = ngpus_per_node
    args.world_size = ngpus_per_node * args.world_size
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))


# Random seed

# best_acc = 0  # best test accuracy

def main_worker(local_rank,ngpus_per_node,args):
    # global best_acc
 # start from epoch 0 or last checkpoint epoch

    # if not os.path.isdir(args.checkpoint):
    #     mkdir_p(args.checkpoint)
    # # import pdb
    # pdb.set_trace()
    gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
    gpu = int(gpus[local_rank])

    args.gpu = gpu
    best_acc = 0
    # print(best_acc)
    args.rank = args.rank * ngpus_per_node + local_rank#args.gpu[gpu]
    print('rank: {} / {}'.format(args.rank, args.world_size))

    dist.init_process_group(backend=args.dist_backend,
                            init_method=args.dist_url,
                            world_size=args.world_size,
                            rank=args.rank)



    torch.cuda.set_device(gpu)


if __name__ == '__main__':
    main()`

distributed.py seed

seed 放到main里面好像没用吧，这样每次运行main_worker都是的到不相同的随机数。
是不是应该放到main_worker里面

No need to divide with nprocs again

horovod.allreduce calculate the average value by default

pytorch-distributed/horovod_distributed.py

Line 73 in c10e193

rt /= nprocs

Does the test accuracy need to be synchronized in distributed.py?

If directly output the test accuracy, will the code automatically synchronize the accuracy between each GPUs?

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1515 G /usr/lib/xorg/Xorg 44MiB |
| 0 2542 G /usr/lib/xorg/Xorg 12MiB |
| 0 4109 G /usr/lib/xorg/Xorg 12MiB |
| 2 14865 C /usr/bin/python3 10923MiB |
| 3 14866 C /usr/bin/python3 10919MiB |
+-----------------------------------------------------------------------------+

14865 14866 这两个进程CPU占有率也是忽高忽低

2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4835 loss_board2:9.2581 acc:0.0000 acc2:0.0001
2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4867 loss_board2:9.2649 acc:0.0000 acc2:0.0001
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3776 loss_board2:9.2109 acc:0.0000 acc2:0.0000
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3761 loss_board2:9.2100 acc:0.0001 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3758 loss_board2:9.1939 acc:0.0000 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3738 loss_board2:9.2173 acc:0.0000 acc2:0.0002
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3720 loss_board2:9.1942 acc:0.0000 acc2:0.0001
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3726 loss_board2:9.1976 acc:0.0000 acc2:0.0002
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3741 loss_board2:9.1846 acc:0.0000 acc2:0.0001
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3731 loss_board2:9.1964 acc:0.0000 acc2:0.0002
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3720 loss_board2:9.1539 acc:0.0000 acc2:0.0001
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3691 loss_board2:9.1631 acc:0.0000 acc2:0.0001
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3655 loss_board2:9.1452 acc:0.0001 acc2:0.0002
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3699 loss_board2:9.1368 acc:0.0000 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3642 loss_board2:9.1215 acc:0.0001 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3608 loss_board2:9.1222 acc:0.0000 acc2:0.0001
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3618 loss_board2:9.1017 acc:0.0000 acc2:0.0000
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3584 loss_board2:9.1050 acc:0.0000 acc2:0.0003
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3562 loss_board2:9.0890 acc:0.0000 acc2:0.0001
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3541 loss_board2:9.0840 acc:0.0000 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3503 loss_board2:9.0778 acc:0.0001 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3462 loss_board2:9.0838 acc:0.0000 acc2:0.0004
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3430 loss_board2:9.0621 acc:0.0000 acc2:0.0002
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3428 loss_board2:9.0644 acc:0.0001 acc2:0.0003

训练时间从一开始1分钟到后来的3分钟

这个现象加个nvlink会有用吗

Connections time out

When using nccl as my communication backend in distributed learning, I found that all operations about gathering variables from other groups can't work. The program would be stopped because of connection time-out? Do you know what causes this issue, how can we do to solve it?

Windows 提示Distributed package doesn't have NCCL "Distributed package doesn't have NCCL built in

是不是Windows不能用NCCL的backend呢？如果是这样，请问Windows 想用多GPU怎么解决呢？感谢！

apex 并行加速中的data_prefetcher normalize为什么做两次？

@tczhangzhi 你好，感谢分享。有一个问题，我在看pex_distributed.py 的时候，发现dataset中已经做了normalize了，为什么在data_prefetcher中要在做一次normalize呢？等于一共做了两次normalize？是我理解的有问题还是这么做有特殊的原因？
data_prefetcher 中的normalize

dataset中的normalize

DDP掉精度

您好，我使用的是DistributedDataParallel，通信是nccl，然后数据方面使用的DistributedSampler，但是发现对比单卡同样参数设置的模型，多卡的精度会大幅下降（模型里也没有BN层）。
然后还有几个疑问望解答：

多卡相当于增大batch size吗？
2.假如使用warm up的话，对应设置和单卡时应该除以卡的数量吗？
3.使用adamw自适应优化器，多卡时需要将lr乘以对应卡数吗？
谢谢

how can i modify it

RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

按照你的脚本跑，一直报错，找不到原因。

root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2  python distributed_slurm_main.py --dist-file dist_file
Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

srun: error: pai-worker1: tasks 0-1: Exited with exit code 1
root@pai-worker1:/home/Data/exports/pytorch-distributed#

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)

Hi there,

Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use
torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

Can't pickle <function main_worker

When I used multiprocessing distributed, I encountered an error：
Can't pickle <function main_worker at 0x7f1c444e9d30>: attribute lookup main_worker on main failed.
I found this error even if I did not make any changes to the multiprocessing distributed.py
Can you help me?

 save_checkpoint( 

 { 

 'epoch': epoch + 1, 

 'arch': args.arch, 

 'state_dict': model.module.state_dict(), 

 'best_acc1': best_acc1, 

 }, is_best)

In multi-GPU training, is it only necessary to save the model parameters during gpu=0

def train(model, start_epoch, end_epoch, tr_loader, optimizer, scheduler, loss_funcs, local_rank):
    for curr_epoch in range(start_epoch, end_epoch):
        train_epoch(curr_epoch, end_epoch, local_rank, loss_funcs, model, optimizer, scheduler, tr_loader)

        # 根据周期修改学习率
        if not arg_config["sche_usebatch"]:
            scheduler.step()

        if local_rank == 0:
            # 每个周期都进行保存测试，保存的是针对第curr_epoch+1周期的参数
            save_checkpoint(
                model=model,
                optimizer=optimizer,
                scheduler=scheduler,
                amp=amp if arg_config["use_amp"] else None,
                exp_name=exp_name,
                current_epoch=curr_epoch + 1,
                full_net_path=path_config["final_full_net"],
                state_net_path=path_config["final_state_net"],
            )  # 保存参数

and only need to follow the following steps when loading the parameters?

import apex.parallel as apexparallel
import numpy as np
import torch
import torch.backends.cudnn as torchcudnn

...

    if arg_config["multi_gpu"]:
        model = apexparallel.convert_syncbn_model(model)

    if arg_config["use_amp"]:
        assert torchcudnn.enabled, "Amp requires cudnn backend to be enabled."
        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

    if arg_config["multi_gpu"]:
        model = apexparallel.DistributedDataParallel(model, delay_allreduce=True)

    if arg_config["resume_mode"] == "train":
        # resume model to train the model
        start_epoch = resume_checkpoint(
            model=model,
            optimizer=optimizer,
            scheduler=scheduler,
            amp=amp if arg_config["use_amp"] else None,
            exp_name=exp_name,
            load_path=path_config["final_full_net"],
            mode="all",
        )
    else:
        # only train a new model.
        start_epoch = 0

多卡显存占用不均衡

我run distributed.py ，发现显存占用不均衡，主卡占用10GB，另外3个卡占用8GB。
请问怎么解决？

关于损失backward问题

作者大大您好，为何代码中计算梯度的时候用的是loss.backward()而不是reduce_loss.backward() ?

	save_checkpoint(
	{
	'epoch': epoch + 1,
	'arch': args.arch,
	'state_dict': model.module.state_dict(),
	'best_acc1': best_acc1,
	}, is_best)

tczhangzhi / pytorch-distributed Goto Github PK

pytorch-distributed's Issues

Recommend Projects

Recommend Topics

Recommend Org