Git Product home page Git Product logo

pytorch-distributed's Issues

使用mp.spawn启动比torch.distributed.launch慢很多

作者您好,我最近测试多GPU训练时遇到了使用mp.spawn启动比torch.distributed.launch慢很多的情况。
我发现使用mp.spawn方法后,每次个epoch开始时都会等待很久,但使用torch.distributed.launch启动时就没有出现这种情况。请问作者您在使用过程中有出现这种情况吗?

大佬,请问如何指定gpu训练

例如,我在一张8卡节点上训练,想用其中4张训练
如果我用0,1,2,3是可以训练的
但是如果我用 其他任意组合的gpuid就不可以

我参考了这个把每个进程的gpuid 改了
Lightning-AI/pytorch-lightning#2407

会提示
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

我的代码

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.utils.data.distributed
import torch.multiprocessing as mp
import argparse
import os



parser = argparse.ArgumentParser(description = 'multi process')

parser.add_argument('--gpu-id',type =str,default='0,1,2,4')
parser.add_argument('--world-size', default=1, type=int,
                    help='number of nodes for distributed training')
parser.add_argument('--rank', default=0, type=int,
                    help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
                    help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
                    help='distributed backend')
args = parser.parse_args()






def  main():
    global args


    os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id
    # args.gpu = list(map(int,args.gpu_id.split(',')))

    # state = {k: v for k, v in args._get_kwargs()}

    # ngpus_per_node = torch.cuda.device_count() #len(args.gpu)

    ngpus_per_node = args.gpu_id.split(',').__len__()
    # print(os.environ['CUDA_VISIBLE_DEVICES'])
    # print('能看到的gpu',ngpus_per_node)
    args.nprocs = ngpus_per_node
    args.world_size = ngpus_per_node * args.world_size
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))


# Random seed

# best_acc = 0  # best test accuracy

def main_worker(local_rank,ngpus_per_node,args):
    # global best_acc
 # start from epoch 0 or last checkpoint epoch

    # if not os.path.isdir(args.checkpoint):
    #     mkdir_p(args.checkpoint)
    # # import pdb
    # pdb.set_trace()
    gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
    gpu = int(gpus[local_rank])

    args.gpu = gpu
    best_acc = 0
    # print(best_acc)
    args.rank = args.rank * ngpus_per_node + local_rank#args.gpu[gpu]
    print('rank: {} / {}'.format(args.rank, args.world_size))

    dist.init_process_group(backend=args.dist_backend,
                            init_method=args.dist_url,
                            world_size=args.world_size,
                            rank=args.rank)



    torch.cuda.set_device(gpu)


if __name__ == '__main__':
    main()`

distributed.py seed

seed 放到main里面好像没用吧,这样每次运行main_worker都是的到不相同的随机数。
是不是应该放到main_worker里面

2080Ti多卡训练,GPU利用率忽高忽低,这个怎么回事

Fri Jan 3 15:59:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 27% 26C P8 3W / 250W | 71MiB / 11016MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |
| 27% 26C P8 11W / 250W | 1MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:03:00.0 Off | N/A |
| 49% 52C P2 58W / 250W | 10934MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:82:00.0 Off | N/A |
| 48% 54C P2 102W / 250W | 10930MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1515 G /usr/lib/xorg/Xorg 44MiB |
| 0 2542 G /usr/lib/xorg/Xorg 12MiB |
| 0 4109 G /usr/lib/xorg/Xorg 12MiB |
| 2 14865 C /usr/bin/python3 10923MiB |
| 3 14866 C /usr/bin/python3 10919MiB |
+-----------------------------------------------------------------------------+

14865 14866 这两个进程CPU占有率也是忽高忽低

2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4835 loss_board2:9.2581 acc:0.0000 acc2:0.0001
2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4867 loss_board2:9.2649 acc:0.0000 acc2:0.0001
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3776 loss_board2:9.2109 acc:0.0000 acc2:0.0000
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3761 loss_board2:9.2100 acc:0.0001 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3758 loss_board2:9.1939 acc:0.0000 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3738 loss_board2:9.2173 acc:0.0000 acc2:0.0002
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3720 loss_board2:9.1942 acc:0.0000 acc2:0.0001
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3726 loss_board2:9.1976 acc:0.0000 acc2:0.0002
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3741 loss_board2:9.1846 acc:0.0000 acc2:0.0001
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3731 loss_board2:9.1964 acc:0.0000 acc2:0.0002
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3720 loss_board2:9.1539 acc:0.0000 acc2:0.0001
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3691 loss_board2:9.1631 acc:0.0000 acc2:0.0001
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3655 loss_board2:9.1452 acc:0.0001 acc2:0.0002
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3699 loss_board2:9.1368 acc:0.0000 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3642 loss_board2:9.1215 acc:0.0001 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3608 loss_board2:9.1222 acc:0.0000 acc2:0.0001
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3618 loss_board2:9.1017 acc:0.0000 acc2:0.0000
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3584 loss_board2:9.1050 acc:0.0000 acc2:0.0003
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3562 loss_board2:9.0890 acc:0.0000 acc2:0.0001
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3541 loss_board2:9.0840 acc:0.0000 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3503 loss_board2:9.0778 acc:0.0001 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3462 loss_board2:9.0838 acc:0.0000 acc2:0.0004
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3430 loss_board2:9.0621 acc:0.0000 acc2:0.0002
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3428 loss_board2:9.0644 acc:0.0001 acc2:0.0003

训练时间从一开始1分钟到后来的3分钟

这个现象加个nvlink会有用吗

Connections time out

When using nccl as my communication backend in distributed learning, I found that all operations about gathering variables from other groups can't work. The program would be stopped because of connection time-out? Do you know what causes this issue, how can we do to solve it?

apex 并行加速中的data_prefetcher normalize为什么做两次?

@tczhangzhi 你好,感谢分享。有一个问题,我在看pex_distributed.py 的时候,发现dataset中已经做了normalize了,为什么在data_prefetcher中要在做一次normalize呢?等于一共做了两次normalize?是我理解的有问题还是这么做有特殊的原因?
data_prefetcher 中的normalize
image
image
dataset中的normalize
image

DDP掉精度

您好,我使用的是DistributedDataParallel,通信是nccl,然后数据方面使用的DistributedSampler,但是发现对比单卡同样参数设置的模型,多卡的精度会大幅下降(模型里也没有BN层)。
然后还有几个疑问望解答:

  1. 多卡相当于增大batch size吗?
    2.假如使用warm up的话,对应设置和单卡时应该除以卡的数量吗?
    3.使用adamw自适应优化器,多卡时需要将lr乘以对应卡数吗?
    谢谢

RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

按照你的脚本跑,一直报错,找不到原因。

root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2  python distributed_slurm_main.py --dist-file dist_file
Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

Traceback (most recent call last):
  File "distributed_slurm_main.py", line 420, in <module>
    main()
  File "distributed_slurm_main.py", line 131, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
    dist.init_process_group(backend='nccl',
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

srun: error: pai-worker1: tasks 0-1: Exited with exit code 1
root@pai-worker1:/home/Data/exports/pytorch-distributed#

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)

Hi there,

Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use
torch.cuda.set_device(local_rank) (L144)
but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

Can't pickle <function main_worker

When I used multiprocessing distributed, I encountered an error:
Can't pickle <function main_worker at 0x7f1c444e9d30>: attribute lookup main_worker on main failed.
I found this error even if I did not make any changes to the multiprocessing distributed.py
Can you help me?

how can i modify it ?

i want to modify it that let it can work on multi mechine , I don't kown how to do it?

About saving model in the multi-GPU training.

save_checkpoint(
{
'epoch': epoch + 1,
'arch': args.arch,
'state_dict': model.module.state_dict(),
'best_acc1': best_acc1,
}, is_best)

In multi-GPU training, is it only necessary to save the model parameters during gpu=0

def train(model, start_epoch, end_epoch, tr_loader, optimizer, scheduler, loss_funcs, local_rank):
    for curr_epoch in range(start_epoch, end_epoch):
        train_epoch(curr_epoch, end_epoch, local_rank, loss_funcs, model, optimizer, scheduler, tr_loader)

        # 根据周期修改学习率
        if not arg_config["sche_usebatch"]:
            scheduler.step()

        if local_rank == 0:
            # 每个周期都进行保存测试,保存的是针对第curr_epoch+1周期的参数
            save_checkpoint(
                model=model,
                optimizer=optimizer,
                scheduler=scheduler,
                amp=amp if arg_config["use_amp"] else None,
                exp_name=exp_name,
                current_epoch=curr_epoch + 1,
                full_net_path=path_config["final_full_net"],
                state_net_path=path_config["final_state_net"],
            )  # 保存参数

and only need to follow the following steps when loading the parameters?

import apex.parallel as apexparallel
import numpy as np
import torch
import torch.backends.cudnn as torchcudnn

...

    if arg_config["multi_gpu"]:
        model = apexparallel.convert_syncbn_model(model)

    if arg_config["use_amp"]:
        assert torchcudnn.enabled, "Amp requires cudnn backend to be enabled."
        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

    if arg_config["multi_gpu"]:
        model = apexparallel.DistributedDataParallel(model, delay_allreduce=True)

    if arg_config["resume_mode"] == "train":
        # resume model to train the model
        start_epoch = resume_checkpoint(
            model=model,
            optimizer=optimizer,
            scheduler=scheduler,
            amp=amp if arg_config["use_amp"] else None,
            exp_name=exp_name,
            load_path=path_config["final_full_net"],
            mode="all",
        )
    else:
        # only train a new model.
        start_epoch = 0

多卡显存占用不均衡

我run distributed.py ,发现显存占用不均衡,主卡占用10GB,另外3个卡占用8GB。
请问怎么解决?

关于损失backward问题

作者大大您好,为何代码中计算梯度的时候用的是loss.backward()而不是reduce_loss.backward() ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.