Comments (17)
Are you using TensorFlow 1.5? Previous reports seem to indicate that only this version produces this issue.
from opennmt-tf.
I'm using TensorFlow 1.6.0 and Python 3.6.3
from opennmt-tf.
In my quick experiments, the memory usage do increase but hover around a fixed value after a few evaluation. In my case the difference in memory usage was about 300MB.
Is your experience similar? If not, can you comment on the initial memory usage you measured and the increased usage after each evaluation?
from opennmt-tf.
Yes, it is. When I start the training the memory usage is ~6GB and it increases at each evaluation step ~3GB. I'm using a dataset of ~5M parallalel sentences and a configuration similar to the one in config/models/nmt_medium.py with a vocabulary size ~50K.
from opennmt-tf.
Can you share the run configuration you are using (the YML file)?
from opennmt-tf.
The config.yml file is here: https://pastebin.com/gcDzbhNh
I slightly modified the evaluation hooks and the ExternalEvaluator class: https://pastebin.com/4Ub7ZtNB
and hooks.py: https://pastebin.com/ii5twvB3
The model: https://pastebin.com/Y0r8MV4E
from opennmt-tf.
@guillaumekln Do you think it is a bug of Tensorflow? In such case I could open an issue in the Tensorflow repository.
from opennmt-tf.
I can't answer confidently at the moment. I spent some time trying to reproduce it based on your feedback but failed. Thanks for providing your complete configuration anyway, I might need to take another look.
Do you face this issue for each of your trainings?
from opennmt-tf.
I faced the same issue while training with 'train_and_eval' however -with the same config settings- and 'train' param i havent encountered this OOM error. maybe this helps
from opennmt-tf.
I am also facing this problem, Anybody has found the solution ??
from opennmt-tf.
@AnubhavSi What TensorFlow version are you using?
If someone can share the data files and training configuration for which the issue appears, that would definitely help.
from opennmt-tf.
This is fixed by tensorflow/tensorflow@3edb609 which is available in the latest TensorFlow package tf-nightly-gpu
(and should be part of TensorFlow 1.10).
Closing this issue as it is a TensorFlow issue.
from opennmt-tf.
Tensorflow version: 1.4.0 and python 2.7 and Cuda 9.1
I am training for 5M sentence pair with default single gpu transformer model configuration.
I am facing OOM error while performing the evaluation, training is fine without evaluation.
As you suggested I tried installing tf-nightly-gpu but cuda 9.1 is creating some problem.
from opennmt-tf.
I tried to decrease the validation data, from 1M to 0.2M. but OOM error is therewith evaluation.
from opennmt-tf.
Is it a OOM on the CPU or GPU memory?
from opennmt-tf.
Error log:
2018-07-18 19:20:48.816905: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1958593536 totalling 1.82GiB
2018-07-18 19:20:48.816913: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1967099904 totalling 1.83GiB
2018-07-18 19:20:48.816921: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2058326016 totalling 1.92GiB
2018-07-18 19:20:48.816930: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3677045760 totalling 3.42GiB
2018-07-18 19:20:48.816938: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.84GiB
2018-07-18 19:20:48.816949: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit: 10907126989
InUse: 10568435968
MaxInUse: 10837224704
NumAllocs: 1503329
MaxAllocSize: 10276284416
2018-07-18 19:20:48.816981: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **************************x_*********************************************************xxxxxxxxxxxxxxx
2018-07-18 19:20:48.816998: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[354048,1383]
Traceback (most recent call last):
File "main.py", line 3, in
main.main()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/bin/main.py", line 138, in main
runner.train_and_evaluate()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/runner.py", line 149, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 430, in train_and_evaluate
executor.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 616, in run_local
metrics = evaluator.evaluate_and_export()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 751, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 839, in _evaluate_model
config=self._session_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/evaluation.py", line 206, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 521, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 892, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 967, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1024, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[354816,1386]
[[Node: transformer/decoder/layer_0/masked_multi_head/Softmax = SoftmaxT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: transformer/decoder/dense/Tensordot/Shape/_1297 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3371_transformer/decoder/dense/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op u'transformer/decoder/layer_0/masked_multi_head/Softmax', defined at:
File "main.py", line 3, in
main.main()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/bin/main.py", line 138, in main
runner.train_and_evaluate()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/runner.py", line 149, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 430, in train_and_evaluate
executor.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 616, in run_local
metrics = evaluator.evaluate_and_export()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 751, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 810, in _evaluate_model
features, labels, model_fn_lib.ModeKeys.EVAL, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/models/model.py", line 113, in _model_fn
logits, predictions = self._build(features, labels, params, mode, config=config)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/models/sequence_to_sequence.py", line 144, in _build
memory_sequence_length=encoder_sequence_length)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/decoders/self_attention_decoder.py", line 246, in decode
memory_sequence_length=memory_sequence_length)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/decoders/self_attention_decoder.py", line 168, in _self_attention_stack
dropout=self.attention_dropout)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/layers/transformer.py", line 276, in multi_head_attention
dropout=dropout)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/layers/transformer.py", line 199, in dot_product_attention
attn = tf.nn.softmax(dot)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1667, in softmax
return _softmax(logits, gen_nn_ops._softmax, dim, name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1617, in _softmax
output = compute_op(logits)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 4317, in _softmax
"Softmax", logits=logits, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[354816,1386]
[[Node: transformer/decoder/layer_0/masked_multi_head/Softmax = SoftmaxT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: transformer/decoder/dense/Tensordot/Shape/_1297 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3371_transformer/decoder/dense/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
from opennmt-tf.
This looks like a GPU OOM and it is unrelated to the current issue. If you think there is bug, please open a new issue. Otherwise, this issue might be helpful: #175
from opennmt-tf.
Related Issues (20)
- xla_ops failed when use multi gpu
- word-level knowledge distillation
- Input size mismatch
- tensorflow lite model example HOT 8
- OpenNMT-tf onnx export ?
- Wrong prediction length from exported OpenNMT-tf models with shared embeddings HOT 1
- Error when inferencing with gpt. HOT 3
- An issue with SequenceRecordInputter ? HOT 1
- cpp for tensorflow serving HOT 2
- SequenceClassifier doesn't seem to learn leaked target HOT 2
- What material did you use to train your pretrained model? HOT 1
- Fine-tune GPT2 model from Transformers HOT 4
- Modify loss function or mask attention in this code HOT 1
- Parallel text inputter fails with compressed text files HOT 1
- An input error occurred while expanding the dataset HOT 1
- ModuleNotFoundError: No module named 'lxml.etree' HOT 1
- Training fails with --mixed_precision and guided alignments
- "Horovod has been shut down" error when training is finished due to early stopping HOT 2
- Make timeout value configurable while searching for an optimal batch size
- Quickstart example runs for a long time HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opennmt-tf.