tczhangzhi / pytorch-distributed Goto Github PK
View Code? Open in Web Editor NEWA quickstart and benchmark for pytorch distributed training.
License: MIT License
A quickstart and benchmark for pytorch distributed training.
License: MIT License
作者您好,我最近测试多GPU训练时遇到了使用mp.spawn启动比torch.distributed.launch慢很多的情况。
我发现使用mp.spawn方法后,每次个epoch开始时都会等待很久,但使用torch.distributed.launch启动时就没有出现这种情况。请问作者您在使用过程中有出现这种情况吗?
例如,我在一张8卡节点上训练,想用其中4张训练
如果我用0,1,2,3是可以训练的
但是如果我用 其他任意组合的gpuid就不可以
我参考了这个把每个进程的gpuid 改了
Lightning-AI/pytorch-lightning#2407
会提示
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59
我的代码
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.utils.data.distributed
import torch.multiprocessing as mp
import argparse
import os
parser = argparse.ArgumentParser(description = 'multi process')
parser.add_argument('--gpu-id',type =str,default='0,1,2,4')
parser.add_argument('--world-size', default=1, type=int,
help='number of nodes for distributed training')
parser.add_argument('--rank', default=0, type=int,
help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
help='distributed backend')
args = parser.parse_args()
def main():
global args
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id
# args.gpu = list(map(int,args.gpu_id.split(',')))
# state = {k: v for k, v in args._get_kwargs()}
# ngpus_per_node = torch.cuda.device_count() #len(args.gpu)
ngpus_per_node = args.gpu_id.split(',').__len__()
# print(os.environ['CUDA_VISIBLE_DEVICES'])
# print('能看到的gpu',ngpus_per_node)
args.nprocs = ngpus_per_node
args.world_size = ngpus_per_node * args.world_size
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
# Random seed
# best_acc = 0 # best test accuracy
def main_worker(local_rank,ngpus_per_node,args):
# global best_acc
# start from epoch 0 or last checkpoint epoch
# if not os.path.isdir(args.checkpoint):
# mkdir_p(args.checkpoint)
# # import pdb
# pdb.set_trace()
gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
gpu = int(gpus[local_rank])
args.gpu = gpu
best_acc = 0
# print(best_acc)
args.rank = args.rank * ngpus_per_node + local_rank#args.gpu[gpu]
print('rank: {} / {}'.format(args.rank, args.world_size))
dist.init_process_group(backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank)
torch.cuda.set_device(gpu)
if __name__ == '__main__':
main()`
seed 放到main里面好像没用吧,这样每次运行main_worker都是的到不相同的随机数。
是不是应该放到main_worker里面
horovod.allreduce calculate the average value by default
If directly output the test accuracy, will the code automatically synchronize the accuracy between each GPUs?
Fri Jan 3 15:59:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 27% 26C P8 3W / 250W | 71MiB / 11016MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |
| 27% 26C P8 11W / 250W | 1MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:03:00.0 Off | N/A |
| 49% 52C P2 58W / 250W | 10934MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:82:00.0 Off | N/A |
| 48% 54C P2 102W / 250W | 10930MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1515 G /usr/lib/xorg/Xorg 44MiB |
| 0 2542 G /usr/lib/xorg/Xorg 12MiB |
| 0 4109 G /usr/lib/xorg/Xorg 12MiB |
| 2 14865 C /usr/bin/python3 10923MiB |
| 3 14866 C /usr/bin/python3 10919MiB |
+-----------------------------------------------------------------------------+
14865 14866 这两个进程CPU占有率也是忽高忽低
2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4835 loss_board2:9.2581 acc:0.0000 acc2:0.0001
2020-01-03 15:35:06: epoch 0 step 100 loss_board:11.4867 loss_board2:9.2649 acc:0.0000 acc2:0.0001
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3776 loss_board2:9.2109 acc:0.0000 acc2:0.0000
2020-01-03 15:36:12: epoch 0 step 200 loss_board:11.3761 loss_board2:9.2100 acc:0.0001 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3758 loss_board2:9.1939 acc:0.0000 acc2:0.0001
2020-01-03 15:37:18: epoch 0 step 300 loss_board:11.3738 loss_board2:9.2173 acc:0.0000 acc2:0.0002
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3720 loss_board2:9.1942 acc:0.0000 acc2:0.0001
2020-01-03 15:38:25: epoch 0 step 400 loss_board:11.3726 loss_board2:9.1976 acc:0.0000 acc2:0.0002
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3741 loss_board2:9.1846 acc:0.0000 acc2:0.0001
2020-01-03 15:39:31: epoch 0 step 500 loss_board:11.3731 loss_board2:9.1964 acc:0.0000 acc2:0.0002
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3720 loss_board2:9.1539 acc:0.0000 acc2:0.0001
2020-01-03 15:40:38: epoch 0 step 600 loss_board:11.3691 loss_board2:9.1631 acc:0.0000 acc2:0.0001
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3655 loss_board2:9.1452 acc:0.0001 acc2:0.0002
2020-01-03 15:43:44: epoch 0 step 700 loss_board:11.3699 loss_board2:9.1368 acc:0.0000 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3642 loss_board2:9.1215 acc:0.0001 acc2:0.0000
2020-01-03 15:47:08: epoch 0 step 800 loss_board:11.3608 loss_board2:9.1222 acc:0.0000 acc2:0.0001
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3618 loss_board2:9.1017 acc:0.0000 acc2:0.0000
2020-01-03 15:50:24: epoch 0 step 900 loss_board:11.3584 loss_board2:9.1050 acc:0.0000 acc2:0.0003
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3562 loss_board2:9.0890 acc:0.0000 acc2:0.0001
2020-01-03 15:53:52: epoch 0 step 1000 loss_board:11.3541 loss_board2:9.0840 acc:0.0000 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3503 loss_board2:9.0778 acc:0.0001 acc2:0.0001
2020-01-03 15:57:21: epoch 0 step 1100 loss_board:11.3462 loss_board2:9.0838 acc:0.0000 acc2:0.0004
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3430 loss_board2:9.0621 acc:0.0000 acc2:0.0002
2020-01-03 16:00:44: epoch 0 step 1200 loss_board:11.3428 loss_board2:9.0644 acc:0.0001 acc2:0.0003
训练时间从一开始1分钟到后来的3分钟
这个现象加个nvlink会有用吗
When using nccl as my communication backend in distributed learning, I found that all operations about gathering variables from other groups can't work. The program would be stopped because of connection time-out? Do you know what causes this issue, how can we do to solve it?
是不是Windows不能用NCCL的backend呢?如果是这样,请问Windows 想用多GPU怎么解决呢?感谢!
@tczhangzhi 你好,感谢分享。有一个问题,我在看pex_distributed.py 的时候,发现dataset中已经做了normalize了,为什么在data_prefetcher中要在做一次normalize呢?等于一共做了两次normalize?是我理解的有问题还是这么做有特殊的原因?
data_prefetcher 中的normalize
dataset中的normalize
您好,我使用的是DistributedDataParallel,通信是nccl,然后数据方面使用的DistributedSampler,但是发现对比单卡同样参数设置的模型,多卡的精度会大幅下降(模型里也没有BN层)。
然后还有几个疑问望解答:
按照你的脚本跑,一直报错,找不到原因。
root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2 python distributed_slurm_main.py --dist-file dist_file
Traceback (most recent call last):
File "distributed_slurm_main.py", line 420, in <module>
main()
File "distributed_slurm_main.py", line 131, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
dist.init_process_group(backend='nccl',
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
Traceback (most recent call last):
File "distributed_slurm_main.py", line 420, in <module>
main()
File "distributed_slurm_main.py", line 131, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/Data/exports/pytorch-distributed/distributed_slurm_main.py", line 137, in main_worker
dist.init_process_group(backend='nccl',
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
srun: error: pai-worker1: tasks 0-1: Exited with exit code 1
root@pai-worker1:/home/Data/exports/pytorch-distributed#
Hi there,
Great repo!
I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.
I noticed one place that they not only use
torch.cuda.set_device(local_rank)
(L144)
but also set the specific gpu id everywhere (their args.gpu
refers to local rank):
model.cuda(args.gpu) # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu) # L169
loc = 'cuda:{}'.format(args.gpu) # L183
checkpoint = torch.load(args.resume, map_location=loc)
if args.gpu is not None: # L 282
images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)
This is a bit weird. I'm wondering if you have any idea about this phenomenon?
And the doc for torch.cuda.set_device
says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."
Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?
Thank you!
When I used multiprocessing distributed, I encountered an error:
Can't pickle <function main_worker at 0x7f1c444e9d30>: attribute lookup main_worker on main failed.
I found this error even if I did not make any changes to the multiprocessing distributed.py
Can you help me?
我仿照了您的方法实现了一次分布式训练:发现单机单卡和多机多卡完成相同次数epoch的时间差不多,遂有所问。
i want to modify it that let it can work on multi mechine , I don't kown how to do it?
pytorch-distributed/multiprocessing_distributed.py
Lines 183 to 189 in e45f4e0
In multi-GPU training, is it only necessary to save the model parameters during gpu=0
def train(model, start_epoch, end_epoch, tr_loader, optimizer, scheduler, loss_funcs, local_rank):
for curr_epoch in range(start_epoch, end_epoch):
train_epoch(curr_epoch, end_epoch, local_rank, loss_funcs, model, optimizer, scheduler, tr_loader)
# 根据周期修改学习率
if not arg_config["sche_usebatch"]:
scheduler.step()
if local_rank == 0:
# 每个周期都进行保存测试,保存的是针对第curr_epoch+1周期的参数
save_checkpoint(
model=model,
optimizer=optimizer,
scheduler=scheduler,
amp=amp if arg_config["use_amp"] else None,
exp_name=exp_name,
current_epoch=curr_epoch + 1,
full_net_path=path_config["final_full_net"],
state_net_path=path_config["final_state_net"],
) # 保存参数
and only need to follow the following steps when loading the parameters?
import apex.parallel as apexparallel
import numpy as np
import torch
import torch.backends.cudnn as torchcudnn
...
if arg_config["multi_gpu"]:
model = apexparallel.convert_syncbn_model(model)
if arg_config["use_amp"]:
assert torchcudnn.enabled, "Amp requires cudnn backend to be enabled."
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
if arg_config["multi_gpu"]:
model = apexparallel.DistributedDataParallel(model, delay_allreduce=True)
if arg_config["resume_mode"] == "train":
# resume model to train the model
start_epoch = resume_checkpoint(
model=model,
optimizer=optimizer,
scheduler=scheduler,
amp=amp if arg_config["use_amp"] else None,
exp_name=exp_name,
load_path=path_config["final_full_net"],
mode="all",
)
else:
# only train a new model.
start_epoch = 0
我run distributed.py ,发现显存占用不均衡,主卡占用10GB,另外3个卡占用8GB。
请问怎么解决?
作者大大您好,为何代码中计算梯度的时候用的是loss.backward()而不是reduce_loss.backward() ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.