Git Product home page Git Product logo

Comments (3)

Venkat6871 avatar Venkat6871 commented on April 27, 2024

Hi @akrupien ,

I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference.

Thank you!

from tensorflow.

warmbasket avatar warmbasket commented on April 27, 2024

Hi @Venkat6871,

Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example.

Epoch 1/400
28/28 [==============================] - ETA: 0s - loss: 3.3312 - dice_coef: 0.0686
Epoch 1: val_loss improved from inf to 1.63522, saving model to /home/path/model1.h5
28/28 [==============================] - 184s 4s/step - loss: 1.6656 - dice_coef: 0.0343 - val_loss: 1.6352 - val_dice_coef: 0.0303

Epoch 2/400
28/28 [==============================] - ETA: 0s - loss: 3.1451 - dice_coef: 0.1314
Epoch 2: val_loss improved from 1.63522 to 1.57489, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 833ms/step - loss: 1.5726 - dice_coef: 0.0657 - val_loss: 1.5749 - val_dice_coef: 0.0431

Epoch 3/400
28/28 [==============================] - ETA: 0s - loss: 3.0354 - dice_coef: 0.1781
Epoch 3: val_loss improved from 1.57489 to 1.53716, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 828ms/step - loss: 1.5177 - dice_coef: 0.0890 - val_loss: 1.5372 - val_dice_coef: 0.0577

Epoch 4/400
28/28 [==============================] - ETA: 0s - loss: 2.9451 - dice_coef: 0.2227
Epoch 4: val_loss improved from 1.53716 to 1.51450, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 831ms/step - loss: 1.4726 - dice_coef: 0.1114 - val_loss: 1.5145 - val_dice_coef: 0.0577

Epoch 5/400
28/28 [==============================] - ETA: 0s - loss: 2.8993 - dice_coef: 0.2236
Epoch 5: val_loss improved from 1.51450 to 1.49228, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 829ms/step - loss: 1.4496 - dice_coef: 0.1118 - val_loss: 1.4923 - val_dice_coef: 0.0577

Epoch 6/400
28/28 [==============================] - ETA: 0s - loss: 2.8554 - dice_coef: 0.2237
Epoch 6: val_loss improved from 1.49228 to 1.47076, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 834ms/step - loss: 1.4277 - dice_coef: 0.1119 - val_loss: 1.4708 - val_dice_coef: 0.0577

You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines.

It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high?

Thank you again,
@akrupien

from tensorflow.

warmbasket avatar warmbasket commented on April 27, 2024

Any Ideas?

from tensorflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.