If directly output the test accuracy, will the code automatically synchronize the accu

m not sure if u understand, if not u can directly use this code: <div class="snipp

No. That's DistributedDataParallel's job. Wrap your model with <code class="notran

Does the test accuracy need to be synchronized in distributed.py? about pytorch-distributed HOT 7 CLOSED

tczhangzhi commented on August 23, 2024

Does the test accuracy need to be synchronized in distributed.py?

from pytorch-distributed.

Comments (7)

tczhangzhi commented on August 23, 2024 1

m not sure if u understand, if not u can directly use this code:

acc1, acc5 = accuracy(output, target, topk=(1, 5))
...
dist.all_reduce(acc1, op=dist.reduce_op.SUM)
...
top1.update(acc1[0] / 4 , images.size(0))

Btw, I don't think we really need to calculate the average accuracy during training, which is the waste of time.

from pytorch-distributed.

tczhangzhi commented on August 23, 2024

Nope, if u really need it, u can use .share_memory() to share a Tensor's memory.
All in all, most distributed lib only help u to handle the synchronization of data, parameters, and gradient.

from pytorch-distributed.

yifanjiang19 commented on August 23, 2024

Could you give a specific example?
Thanks!

from pytorch-distributed.

tczhangzhi commented on August 23, 2024

hm, m afraid that's not right.
Here are two ways to communicate between torch.multiprocessing:

if u dont care the running results, u can use share_memory_ like this, which is more faster:

import time
import random

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def evaluate(rank):
    torch.cuda.manual_seed(rank)
    local_acc = torch.randn(1)[0].cuda(rank)

    print("local_acc:", local_acc)

    return local_acc

def main_worker(gpu, ngpus_per_node, args):
    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)

    local_acc = evaluate(gpu)

    global_acc, global_count = args['global_acc'], args['global_count']

    global_acc += local_acc.cpu()
    global_count += 1

    print("global_acc:", global_acc / global_count)

if __name__ == '__main__':
    global_acc = torch.tensor(.0)
    global_count = torch.tensor(.0)
    
    global_acc.share_memory_()
    global_count.share_memory_()

    args = {
        'global_acc': global_acc,
        'global_count': global_count
    }
    
    mp.spawn(main_worker, nprocs=4, args=(4, args))

But if you really need to synchronize the accuracy, I suggest this kind of implement or something else using all_reduce:

import time
import random

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def evaluate(rank):
    torch.cuda.manual_seed(rank)
    local_acc = torch.randn(1)[0].cuda(rank)

    print("local_acc:", local_acc)

    return local_acc

def main_worker(gpu, ngpus_per_node, args):
    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)

    local_acc = evaluate(gpu)

    dist.all_reduce(local_acc, op=dist.reduce_op.SUM)
    global_acc = local_acc / ngpus_per_node

    print("global:", global_acc)

if __name__ == '__main__':
    args = {}
    mp.spawn(main_worker, nprocs=4, args=(4, args))

from pytorch-distributed.

yifanjiang19 commented on August 23, 2024

Thanks!
Should the code synchronize the loss between each gpus before loss.backward()? Or the backward function will synchronize automatically?

from pytorch-distributed.

tczhangzhi commented on August 23, 2024

No. That's DistributedDataParallel's job.
Wrap your model with DistributedDataParallel and just call backward() as usual. During the backwards pass, gradients from each node are averaged (same as your saying "synchronize the loss") and parameters are synchronized automatically.
Check it here: https://github.com/pytorch/pytorch/blob/46539eee0363e25ce5eb408c85cefd808cd6f878/torch/nn/parallel/distributed.py#L378-L382

from pytorch-distributed.

yifanjiang19 commented on August 23, 2024

thanks

from pytorch-distributed.

Does the test accuracy need to be synchronized in distributed.py? about pytorch-distributed HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent