Hi, thanks for the awesome project! Does the code base support distr

Do we support GPU distributed training? about text-to-text-transfer-transformer HOT 5 CLOSED

google-research commented on September 26, 2024 1

Do we support GPU distributed training?

from text-to-text-transfer-transformer.

Comments (5)

adarob commented on September 26, 2024 2

Verified a fix. Working on a PR.

from text-to-text-transfer-transformer.

craffel commented on September 26, 2024 1

Hey Morizeyao, thanks for your interest.

Does the code base support distributed training? If not, is it possible to support it after some code modifications?

We (the authors) haven't tried GPU training yet but it should be possible. See the example under here: https://github.com/tensorflow/mesh#example-network-mnist
@nshazeer in case he wants to chime in.

By the way, what is the way to set batch size and gpu number if I want to use GPU to train the model?

The batch size can be set according to how much memory your GPU can hold. The Mesh TF Transformer is in principle smart enough to chop your batch up into microbatches and accumulate gradients if the batch won't fit in memory. The GPU number will depend on which GPUs are available on the machine you're running on and which you want to use. Or do you mean how many GPUs you should use? It should work to use as many as you have available.

from text-to-text-transfer-transformer.

Morizeyao commented on September 26, 2024

Thank you for your quick response!

About distributed GPU training, I'll do some research following your link. :)

For the second question, I did some test on 2 GPU machine and the code used only one card.
As you have said, the Mesh library automatically choose the batch size, gradient accumulation and gpu numbers. I guess the problem may due to the environment settings.

I will report my findings here soon.

from text-to-text-transfer-transformer.

eelxpeng commented on September 26, 2024

Same question here. I have tried my best to let the code run on GPUs, but the log shows that tpu_estimator.py says training on CPU. And the speed is around 300 times slower than on tpu in GCP. I believe I have set settings properly. The log_device_placement show that some of the operations are assigned to cpu, and some assigned to gpu. Do you have any idea what is going on?

from text-to-text-transfer-transformer.

danyaljj commented on September 26, 2024

And the speed is around 300 times slower than on tpu in GCP.

How did you get to this number?

Have you tried changing your batch size?
As @craffel mentioned:

The batch size can be set according to how much memory your GPU can hold.

from text-to-text-transfer-transformer.

Recommend Projects

Do we support GPU distributed training? about text-to-text-transfer-transformer HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent