We should be able to train the model using TFJob so that we can take advantage of K8s

I pinged some folks internally From <a class="user-mention notransla

/cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[GH Issue Summarization] Train model distributed using TFJob about examples HOT 12 CLOSED

kubeflow commented on August 22, 2024 1

[GH Issue Summarization] Train model distributed using TFJob

from examples.

Comments (12)

hamelsmu commented on August 22, 2024 5

FYI you folks got good publicity out of this, look at the 4 minute mark: https://www.youtube.com/watch?v=NFVKDa5W35I&feature=youtu.be

Congrats, this is one of the most popular MOOCs for deep learning!

from examples.

jlewi commented on August 22, 2024 1

I pinged some folks internally

From @mrry

I think the problem might be in how Keras' Embedding layer and TensorFlow's _GatherGrad() interact:

The embedding variable will be placed on "/job:ps/task:0", by the tf.train.replica_device_setter().
Keras' Embedding layer uses tf.gather(self.embeddings, indices), which will be placed on "/job:worker/task:0" (and hence copy the embeddings matrix to the worker at each step).
Keras always tries to colocate the gradient calculations with their respective ops, creating a colocation constraint between all the ops in _GatherGrad() and the tf.gather() op on "/job:worker/task:0".
_GatherGrad() always tries to colocate its tf.shape(params) (== tf.shape(self.embeddings)) op with the (large) embedding matrix, creating a colocation constraint between the tf.shape() op and the variable on "/job:ps/task:0".

The two colocation constraints are in conflict, leading to the error.

I can think of a couple of fixes, but unfortunately they're in library code:
a) Keras could potentially use tf.nn.embedding_lookup() instead of tf.gather() in its Embedding layer. The tf.nn.embedding_lookup() function is device aware, so it will place the gather on the same device as the embedding variable(s), and the colocation constraints will be satisfied.
b) _GatherGrad() could set ignore_existing=True when colocating the tf.shape(params) with params.

A couple of potential workarounds (untested):
a) Use tf.nn.embedding_lookup() directly in your program instead of the Keras Embedding layer.
b) Try the following when creating and using an embedding:

dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, name='Decoder-Word-Embedding', mask_zero=False)
with tf.device(None):  # Strip off any device annotations.
  dec_emb = dec_emb_layer(decoder_inputs)

Another option might be to switch to Tensor2Tensor; here's a blog post using CMLE

To do summarization, use --problems=summarize_cnn_dailymail32k and
--model=transformer and --hparams_set=transformer_prepend

/cc @lukaszkaiser

from examples.

jlewi commented on August 22, 2024

Reopening this issue because we aren't training distributed.

The original blog post only trained on part of the dataset (2 million out of 5 million) issues. It would be great if we could train on all 5 million issues using distributed training.

from examples.

ankushagarwal commented on August 22, 2024

/cc @hamelsmu

Hamel - Have you tried distributed training using Keras?

from examples.

hamelsmu commented on August 22, 2024

I have, the way I have tried to accomplish this is put a different parts of the computation graph on different GPUs, for example putting the encoder on GPU 1 and decoder on GPU 2.

However, I did not enjoy a material training speedup when doing this, even though I could increase the batch size significantly. I didn't spend much time profiling where the bottleneck was, however.

from examples.

hamelsmu commented on August 22, 2024

Edit: As you can gather from above, I have only tried distributed on a single node with multiple GPUs. I have not attempted to distribute training across multiple nodes.

from examples.

jlewi commented on August 22, 2024

Thanks @hamelsmu. I'm interested in distributed using multiple nodes to highlight the TFJob capabilities.

This blog post makes it seem like its as simple as setting the session for Keras.

from examples.

ankushagarwal commented on August 22, 2024

The distributed model that I'm thinking of for training this model is data parallelism where we have a copy of the graph in N worker nodes and we send batches of training data to each worker and combine the loss.

The current model uses the inbuilt fit() method to train the model. We will have to do the training manually instead of relying on the fit method.

seq2seq_Model.fit([encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1),
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

from examples.

jlewi commented on August 22, 2024

@hamelsmu Likewise!

from examples.

ankushagarwal commented on August 22, 2024

I am running into an issue while doing distributed training. I am using a single parameter server and a single worker server.

Since this model has hundreds of variables, I am using tf.train.replica_device_setter to assign variables of the graph to the parameter server and ops to the worker. But when I run the training iteration, the following error is thrown:

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Decoder-Word-Embedding/embeddings' and 'training/Nadam/gradients/Decoder-Word-Embedding/Gather_grad/Shape: Cannot merge devices with incompatible jobs: '/job:worker/task:0' and '/job:ps/task:0'
   [[Node: Decoder-Word-Embedding/embeddings = VariableV2[container="", dtype=DT_FLOAT, shape=[4278,300], shared_name="", _device="/job:ps/task:0"]()]]

When I set parameter server and worker server to the same server, everything works fine.

My suspicion is that tf.train.replica_device_setter places some variables/ops in an incompatible way.

This has been reported by some other users working on distributed training of seq2seq models here :
tensorflow/tensorflow#3198

from examples.

ankushagarwal commented on August 22, 2024

Here is the code that I used https://gist.github.com/ankushagarwal/a2dab3aaf664a4296292c4ff330fb5e6

from examples.

lazybonesboy commented on August 22, 2024

I can think of a couple of fixes, but unfortunately they're in library code:
a) Keras could potentially use tf.nn.embedding_lookup() instead of tf.gather() in its Embedding layer. The tf.nn.embedding_lookup() function is device aware, so it will place the gather on the same device
I use this code ， and it work.
thank you

from examples.

[GH Issue Summarization] Train model distributed using TFJob about examples HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent