Git Product home page Git Product logo

Comments (12)

hamelsmu avatar hamelsmu commented on August 22, 2024 5

FYI you folks got good publicity out of this, look at the 4 minute mark: https://www.youtube.com/watch?v=NFVKDa5W35I&feature=youtu.be

Congrats, this is one of the most popular MOOCs for deep learning!

from examples.

jlewi avatar jlewi commented on August 22, 2024 1

I pinged some folks internally

From @mrry

I think the problem might be in how Keras' Embedding layer and TensorFlow's _GatherGrad() interact:

  1. The embedding variable will be placed on "/job:ps/task:0", by the tf.train.replica_device_setter().
  2. Keras' Embedding layer uses tf.gather(self.embeddings, indices), which will be placed on "/job:worker/task:0" (and hence copy the embeddings matrix to the worker at each step).
  3. Keras always tries to colocate the gradient calculations with their respective ops, creating a colocation constraint between all the ops in _GatherGrad() and the tf.gather() op on "/job:worker/task:0".
  4. _GatherGrad() always tries to colocate its tf.shape(params) (== tf.shape(self.embeddings)) op with the (large) embedding matrix, creating a colocation constraint between the tf.shape() op and the variable on "/job:ps/task:0".

The two colocation constraints are in conflict, leading to the error.

I can think of a couple of fixes, but unfortunately they're in library code:
a) Keras could potentially use tf.nn.embedding_lookup() instead of tf.gather() in its Embedding layer. The tf.nn.embedding_lookup() function is device aware, so it will place the gather on the same device as the embedding variable(s), and the colocation constraints will be satisfied.
b) _GatherGrad() could set ignore_existing=True when colocating the tf.shape(params) with params.

A couple of potential workarounds (untested):
a) Use tf.nn.embedding_lookup() directly in your program instead of the Keras Embedding layer.
b) Try the following when creating and using an embedding:

dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, name='Decoder-Word-Embedding', mask_zero=False)
with tf.device(None):  # Strip off any device annotations.
  dec_emb = dec_emb_layer(decoder_inputs)

Another option might be to switch to Tensor2Tensor; here's a blog post using CMLE

To do summarization, use --problems=summarize_cnn_dailymail32k and
--model=transformer and --hparams_set=transformer_prepend

/cc @lukaszkaiser

from examples.

jlewi avatar jlewi commented on August 22, 2024

Reopening this issue because we aren't training distributed.

The original blog post only trained on part of the dataset (2 million out of 5 million) issues. It would be great if we could train on all 5 million issues using distributed training.

from examples.

ankushagarwal avatar ankushagarwal commented on August 22, 2024

/cc @hamelsmu

Hamel - Have you tried distributed training using Keras?

from examples.

hamelsmu avatar hamelsmu commented on August 22, 2024

I have, the way I have tried to accomplish this is put a different parts of the computation graph on different GPUs, for example putting the encoder on GPU 1 and decoder on GPU 2.

However, I did not enjoy a material training speedup when doing this, even though I could increase the batch size significantly. I didn't spend much time profiling where the bottleneck was, however.

from examples.

hamelsmu avatar hamelsmu commented on August 22, 2024

Edit: As you can gather from above, I have only tried distributed on a single node with multiple GPUs. I have not attempted to distribute training across multiple nodes.

from examples.

jlewi avatar jlewi commented on August 22, 2024

Thanks @hamelsmu. I'm interested in distributed using multiple nodes to highlight the TFJob capabilities.

This blog post makes it seem like its as simple as setting the session for Keras.

from examples.

ankushagarwal avatar ankushagarwal commented on August 22, 2024

The distributed model that I'm thinking of for training this model is data parallelism where we have a copy of the graph in N worker nodes and we send batches of training data to each worker and combine the loss.

The current model uses the inbuilt fit() method to train the model. We will have to do the training manually instead of relying on the fit method.

seq2seq_Model.fit([encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1),
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

from examples.

jlewi avatar jlewi commented on August 22, 2024

@hamelsmu Likewise!

from examples.

ankushagarwal avatar ankushagarwal commented on August 22, 2024

I am running into an issue while doing distributed training. I am using a single parameter server and a single worker server.

Since this model has hundreds of variables, I am using tf.train.replica_device_setter to assign variables of the graph to the parameter server and ops to the worker. But when I run the training iteration, the following error is thrown:

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Decoder-Word-Embedding/embeddings' and 'training/Nadam/gradients/Decoder-Word-Embedding/Gather_grad/Shape: Cannot merge devices with incompatible jobs: '/job:worker/task:0' and '/job:ps/task:0'
   [[Node: Decoder-Word-Embedding/embeddings = VariableV2[container="", dtype=DT_FLOAT, shape=[4278,300], shared_name="", _device="/job:ps/task:0"]()]]

When I set parameter server and worker server to the same server, everything works fine.

My suspicion is that tf.train.replica_device_setter places some variables/ops in an incompatible way.

This has been reported by some other users working on distributed training of seq2seq models here :
tensorflow/tensorflow#3198

from examples.

ankushagarwal avatar ankushagarwal commented on August 22, 2024

Here is the code that I used https://gist.github.com/ankushagarwal/a2dab3aaf664a4296292c4ff330fb5e6

from examples.

lazybonesboy avatar lazybonesboy commented on August 22, 2024

I can think of a couple of fixes, but unfortunately they're in library code:
a) Keras could potentially use tf.nn.embedding_lookup() instead of tf.gather() in its Embedding layer. The tf.nn.embedding_lookup() function is device aware, so it will place the gather on the same device
I use this code , and it work.
thank you

from examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.