Git Product home page Git Product logo

Comments (5)

adarob avatar adarob commented on September 26, 2024 2

Verified a fix. Working on a PR.

from text-to-text-transfer-transformer.

craffel avatar craffel commented on September 26, 2024 1

Hey Morizeyao, thanks for your interest.

Does the code base support distributed training? If not, is it possible to support it after some code modifications?

We (the authors) haven't tried GPU training yet but it should be possible. See the example under here: https://github.com/tensorflow/mesh#example-network-mnist
@nshazeer in case he wants to chime in.

By the way, what is the way to set batch size and gpu number if I want to use GPU to train the model?

The batch size can be set according to how much memory your GPU can hold. The Mesh TF Transformer is in principle smart enough to chop your batch up into microbatches and accumulate gradients if the batch won't fit in memory. The GPU number will depend on which GPUs are available on the machine you're running on and which you want to use. Or do you mean how many GPUs you should use? It should work to use as many as you have available.

from text-to-text-transfer-transformer.

Morizeyao avatar Morizeyao commented on September 26, 2024

Thank you for your quick response!

About distributed GPU training, I'll do some research following your link. :)

For the second question, I did some test on 2 GPU machine and the code used only one card.
As you have said, the Mesh library automatically choose the batch size, gradient accumulation and gpu numbers. I guess the problem may due to the environment settings.

I will report my findings here soon.

from text-to-text-transfer-transformer.

eelxpeng avatar eelxpeng commented on September 26, 2024

Same question here. I have tried my best to let the code run on GPUs, but the log shows that tpu_estimator.py says training on CPU. And the speed is around 300 times slower than on tpu in GCP. I believe I have set settings properly. The log_device_placement show that some of the operations are assigned to cpu, and some assigned to gpu. Do you have any idea what is going on?

from text-to-text-transfer-transformer.

danyaljj avatar danyaljj commented on September 26, 2024

And the speed is around 300 times slower than on tpu in GCP.

How did you get to this number?

Have you tried changing your batch size?
As @craffel mentioned:

The batch size can be set according to how much memory your GPU can hold.

from text-to-text-transfer-transformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.