I'm currently running some experiments using the sequence_to_sequence model. I not

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

CPU memory leak when using train_and_eval run type about opennmt-tf HOT 17 CLOSED

opennmt commented on May 21, 2024

CPU memory leak when using train_and_eval run type

from opennmt-tf.

Comments (17)

guillaumekln commented on May 21, 2024

Are you using TensorFlow 1.5? Previous reports seem to indicate that only this version produces this issue.

from opennmt-tf.

azinnai commented on May 21, 2024

I'm using TensorFlow 1.6.0 and Python 3.6.3

from opennmt-tf.

guillaumekln commented on May 21, 2024

In my quick experiments, the memory usage do increase but hover around a fixed value after a few evaluation. In my case the difference in memory usage was about 300MB.

Is your experience similar? If not, can you comment on the initial memory usage you measured and the increased usage after each evaluation?

from opennmt-tf.

azinnai commented on May 21, 2024

Yes, it is. When I start the training the memory usage is ~6GB and it increases at each evaluation step ~3GB. I'm using a dataset of ~5M parallalel sentences and a configuration similar to the one in config/models/nmt_medium.py with a vocabulary size ~50K.

from opennmt-tf.

guillaumekln commented on May 21, 2024

Can you share the run configuration you are using (the YML file)?

from opennmt-tf.

azinnai commented on May 21, 2024

The config.yml file is here: https://pastebin.com/gcDzbhNh
I slightly modified the evaluation hooks and the ExternalEvaluator class: https://pastebin.com/4Ub7ZtNB
and hooks.py: https://pastebin.com/ii5twvB3
The model: https://pastebin.com/Y0r8MV4E

from opennmt-tf.

azinnai commented on May 21, 2024

@guillaumekln Do you think it is a bug of Tensorflow? In such case I could open an issue in the Tensorflow repository.

from opennmt-tf.

guillaumekln commented on May 21, 2024

I can't answer confidently at the moment. I spent some time trying to reproduce it based on your feedback but failed. Thanks for providing your complete configuration anyway, I might need to take another look.

Do you face this issue for each of your trainings?

from opennmt-tf.

ptamas88 commented on May 21, 2024

I faced the same issue while training with 'train_and_eval' however -with the same config settings- and 'train' param i havent encountered this OOM error. maybe this helps

from opennmt-tf.

AnubhavSi commented on May 21, 2024

I am also facing this problem, Anybody has found the solution ??

from opennmt-tf.

guillaumekln commented on May 21, 2024

@AnubhavSi What TensorFlow version are you using?

If someone can share the data files and training configuration for which the issue appears, that would definitely help.

from opennmt-tf.

guillaumekln commented on May 21, 2024

This is fixed by tensorflow/tensorflow@3edb609 which is available in the latest TensorFlow package tf-nightly-gpu (and should be part of TensorFlow 1.10).

Closing this issue as it is a TensorFlow issue.

from opennmt-tf.

AnubhavSi commented on May 21, 2024

Tensorflow version: 1.4.0 and python 2.7 and Cuda 9.1
I am training for 5M sentence pair with default single gpu transformer model configuration.
I am facing OOM error while performing the evaluation, training is fine without evaluation.
As you suggested I tried installing tf-nightly-gpu but cuda 9.1 is creating some problem.

from opennmt-tf.

AnubhavSi commented on May 21, 2024

I tried to decrease the validation data, from 1M to 0.2M. but OOM error is therewith evaluation.

from opennmt-tf.

guillaumekln commented on May 21, 2024

Is it a OOM on the CPU or GPU memory?

from opennmt-tf.

AnubhavSi commented on May 21, 2024

Error log:
2018-07-18 19:20:48.816905: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1958593536 totalling 1.82GiB
2018-07-18 19:20:48.816913: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1967099904 totalling 1.83GiB
2018-07-18 19:20:48.816921: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 2058326016 totalling 1.92GiB
2018-07-18 19:20:48.816930: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 3677045760 totalling 3.42GiB
2018-07-18 19:20:48.816938: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.84GiB
2018-07-18 19:20:48.816949: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit: 10907126989
InUse: 10568435968
MaxInUse: 10837224704
NumAllocs: 1503329
MaxAllocSize: 10276284416

2018-07-18 19:20:48.816981: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **************************x_*********************************************************xxxxxxxxxxxxxxx
2018-07-18 19:20:48.816998: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[354048,1383]
Traceback (most recent call last):
File "main.py", line 3, in
main.main()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/bin/main.py", line 138, in main
runner.train_and_evaluate()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/runner.py", line 149, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 430, in train_and_evaluate
executor.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 616, in run_local
metrics = evaluator.evaluate_and_export()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 751, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 839, in _evaluate_model
config=self._session_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/evaluation.py", line 206, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 521, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 892, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 967, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1024, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[354816,1386]
[[Node: transformer/decoder/layer_0/masked_multi_head/Softmax = SoftmaxT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: transformer/decoder/dense/Tensordot/Shape/_1297 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3371_transformer/decoder/dense/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'transformer/decoder/layer_0/masked_multi_head/Softmax', defined at:
File "main.py", line 3, in
main.main()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/bin/main.py", line 138, in main
runner.train_and_evaluate()
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/runner.py", line 149, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 430, in train_and_evaluate
executor.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 616, in run_local
metrics = evaluator.evaluate_and_export()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 751, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 810, in _evaluate_model
features, labels, model_fn_lib.ModeKeys.EVAL, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/models/model.py", line 113, in _model_fn
logits, predictions = self._build(features, labels, params, mode, config=config)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/models/sequence_to_sequence.py", line 144, in _build
memory_sequence_length=encoder_sequence_length)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/decoders/self_attention_decoder.py", line 246, in decode
memory_sequence_length=memory_sequence_length)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/decoders/self_attention_decoder.py", line 168, in _self_attention_stack
dropout=self.attention_dropout)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/layers/transformer.py", line 276, in multi_head_attention
dropout=dropout)
File "/home/anubhav.singh9179/MachineTranslationAPI/opennmt/layers/transformer.py", line 199, in dot_product_attention
attn = tf.nn.softmax(dot)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1667, in softmax
return _softmax(logits, gen_nn_ops._softmax, dim, name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1617, in _softmax
output = compute_op(logits)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 4317, in _softmax
"Softmax", logits=logits, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[354816,1386]
[[Node: transformer/decoder/layer_0/masked_multi_head/Softmax = SoftmaxT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: transformer/decoder/dense/Tensordot/Shape/_1297 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3371_transformer/decoder/dense/Tensordot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

from opennmt-tf.

guillaumekln commented on May 21, 2024

This looks like a GPU OOM and it is unrelated to the current issue. If you think there is bug, please open a new issue. Otherwise, this issue might be helpful: #175

from opennmt-tf.

CPU memory leak when using train_and_eval run type about opennmt-tf HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent