When i try the WMT'16 EN-DE sample, encountered the following CUDA_ERROR_OUT_OF_MEMORY

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I think this is a different error and not related to the GPU: <div class="snippet-

I'm closing this one because it seems like a duplicate of <a class="issue-link js-issu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY about seq2seq HOT 10 CLOSED

google commented on April 29, 2024

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY

from seq2seq.

Comments (10)

papajohn commented on April 29, 2024 1

I read some other posts [1], [2] about this issue for other projects. It appears that such an error doesn't matter, and the GPU still gets used (just with some memory growth factor that gets adjusted over time).

You can check whether your GPU is being used with the nvidia-smi command-line tool.

[1] http://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow
[2] tensorflow/tensorflow#6048

from seq2seq.

DaoD commented on April 29, 2024

Same problem, but I solve it by ignoring the file "train_seq2seq.yml" and just using the "nmt_xxxx.yml"
I guess there is some trouble in the configuration about hooks, but I'm not sure.

from seq2seq.

zzks commented on April 29, 2024

@DaoD Thanks for you reply!
However, I tried your method, still got CUDA_ERROR_OUT_OF_MEMORY.
Anyone succeeded to run the sample?
Could you tell us your configuration and env?

from seq2seq.

DaoD commented on April 29, 2024

Yes the error will still happen but the training process could be continued.
I don't know the reason.

from seq2seq.

dennybritz commented on April 29, 2024

I think this is a different error and not related to the GPU:

"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'

This is probably the same as #43. I haven't been able to reproduce this and not sure what is wrong here.

from seq2seq.

dennybritz commented on April 29, 2024

I'm closing this one because it seems like a duplicate of #43 - please discuss there or re-open the issue if it's not a duplicate.

from seq2seq.

zzks commented on April 29, 2024

@dennybritz @papajohn thanks for your reply!
I updated the new seq2seq, got some new errors.
however, i tried @DaoD 's method(no buckets) again, it still CUDA_ERROR_OUT_OF_MEMORY but the training process is continued.

new errors when using buckets config are as following (python3.4):
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6810 get requests, put_count=8013 evicted_count=2000 eviction_rate=0.249594 and unsatisfied allocation rate=0.119824
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 212 to 233
INFO:tensorflow:Performing full trace on next step.
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cudnn/lib64
F tensorflow/core/platform/default/gpu/cupti_wrapper.cc:59] Check failed: ::tensorflow::Status::OK() == (::tensorflow::Env::Default()->GetSymbolFromLibrary( GetDsoHandle(), kName, &f)) (OK vs. Not found: /home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cuptiActivityRegisterCallbacks)could not find cuptiActivityRegisterCallbacksin libcupti DSO
Aborted (core dumped)

Anyway, I can run the sample now.

from seq2seq.

ayushidalmia commented on April 29, 2024

I also run into this. Any solution? A lighter unittest will help

from seq2seq.

eugenioclrc commented on April 29, 2024

same here...

from seq2seq.

myagmur01 commented on April 29, 2024

Have you checked your operating processes on GPU ? It may arise by multiple opened environments that cause GPU overloaded. I recommend to close all terminals and open again. I hope this works for you.

from seq2seq.

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY about seq2seq HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent