Issue type Support Have you reproduced the bug

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating about tensorflow HOT 3 OPEN

warmbasket commented on April 27, 2024

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating

from tensorflow.

Comments (3)

Venkat6871 commented on April 27, 2024

Hi @akrupien ,

I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference.

Thank you!

from tensorflow.

warmbasket commented on April 27, 2024

Hi @Venkat6871,

Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example.

Epoch 1/400
28/28 [==============================] - ETA: 0s - loss: 3.3312 - dice_coef: 0.0686
Epoch 1: val_loss improved from inf to 1.63522, saving model to /home/path/model1.h5
28/28 [==============================] - 184s 4s/step - loss: 1.6656 - dice_coef: 0.0343 - val_loss: 1.6352 - val_dice_coef: 0.0303

Epoch 2/400
28/28 [==============================] - ETA: 0s - loss: 3.1451 - dice_coef: 0.1314
Epoch 2: val_loss improved from 1.63522 to 1.57489, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 833ms/step - loss: 1.5726 - dice_coef: 0.0657 - val_loss: 1.5749 - val_dice_coef: 0.0431

Epoch 3/400
28/28 [==============================] - ETA: 0s - loss: 3.0354 - dice_coef: 0.1781
Epoch 3: val_loss improved from 1.57489 to 1.53716, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 828ms/step - loss: 1.5177 - dice_coef: 0.0890 - val_loss: 1.5372 - val_dice_coef: 0.0577

Epoch 4/400
28/28 [==============================] - ETA: 0s - loss: 2.9451 - dice_coef: 0.2227
Epoch 4: val_loss improved from 1.53716 to 1.51450, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 831ms/step - loss: 1.4726 - dice_coef: 0.1114 - val_loss: 1.5145 - val_dice_coef: 0.0577

Epoch 5/400
28/28 [==============================] - ETA: 0s - loss: 2.8993 - dice_coef: 0.2236
Epoch 5: val_loss improved from 1.51450 to 1.49228, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 829ms/step - loss: 1.4496 - dice_coef: 0.1118 - val_loss: 1.4923 - val_dice_coef: 0.0577

Epoch 6/400
28/28 [==============================] - ETA: 0s - loss: 2.8554 - dice_coef: 0.2237
Epoch 6: val_loss improved from 1.49228 to 1.47076, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 834ms/step - loss: 1.4277 - dice_coef: 0.1119 - val_loss: 1.4708 - val_dice_coef: 0.0577

You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines.

It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high?

Thank you again,
@akrupien

from tensorflow.

warmbasket commented on April 27, 2024

Any Ideas?

from tensorflow.

MultiWorkerMirrorStrategy Metrics Incorrectly Aggregating about tensorflow HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent