Running train.py on Nvidia GRID K2, 1536 cores, 3.5 GB memory. <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Could this be the same problem as in <a class="issue-link js-issue-link" data-error-te

GPU OOM about tensorflow-wavenet HOT 10 CLOSED

ibab commented on September 14, 2024

GPU OOM

from tensorflow-wavenet.

Comments (10)

larsr commented on September 14, 2024

Are you running something else already on the GPU?

from tensorflow-wavenet.

lelayf commented on September 14, 2024

nope, fresh install, fresh reboot, running headless and only process using the GPU. Has anyone successfully trained the network in less than 3.5 GiB on the GPU ?

from tensorflow-wavenet.

ibab commented on September 14, 2024

Looks like it stopped on 2 particularly large input samples.
Did this happen immediately, or after a few iterations?
If it's just because of large samples, pull request #54 should fix this by always cutting them into fixed size pieces.

You can also try setting the BATCH_SIZE to 1, or reducing the number of dilated convolutional layers in wavenet_params.json.

3.5 GB isn't a lot, but should be enough to train the network with some tweaks.

Note that we haven't yet managed to reproduce the results from the wavenet paper, so it's not worth training the network unless you can invest the time to look for good hyperparameters.

from tensorflow-wavenet.

tuba commented on September 14, 2024

I get OOM on 8GB video ram always after 160 step
Seems like BATCH_SIZE = 1 solves the problem for me.

from tensorflow-wavenet.

lelayf commented on September 14, 2024

BATCH_SIZE=1 allows the process to run for a few dozen steps before OOMing. I just tried to mount a second identical GPU to increase total mem to 7GB but it behaves similarly. The second GPU is well recognized by TF but I think we would need explicit placement to make it useful. I will now look into #54

from tensorflow-wavenet.

ibab commented on September 14, 2024

@lelayf: Yeah, we're not making use of extra GPUs at the moment.
#54 should definitely help. You can adjust the SAMPLE_SIZE downwards if you still run into problems.

from tensorflow-wavenet.

lelayf commented on September 14, 2024

it seems the training is now stalling silently after a few hundred steps. I do not get OOMs. I tried using 96000 sample_size, then 64000 and finally ran it with no sample_size on the command line. I also tried different GPUs, a GRID K2 and a Tesla K80. In all cases the same silent stalling arises.

from tensorflow-wavenet.

ibab commented on September 14, 2024

Could this be the same problem as in #65?
The audio pipeline stopped processing data after traversing the input files once.
Updating to one of the newer commits should fix that.

from tensorflow-wavenet.

ibab commented on September 14, 2024

Closing this, as all mentioned issues should be fixed at this point.
Feel free to comment if you still experience problems with this.

from tensorflow-wavenet.

mikelibg commented on September 14, 2024

Getting same OOM when running on 61GiB aws instance even with sample size 10,000.
commit: 3c973c0
python: 3.6
with --silence_threshold=0 on one utterance p225 from the original VCTK corpus

>>> psutil.virtual_memory()
svmem(total=64389132288, available=60933558272, percent=5.4, used=2929561600, free=59257782272, active=3300839424, inactive=1334075392, buffers=73183232, cached=2128605184, shared=21499904, slab=143368192)

Exception:
2018-05-13 06:20:26.790439: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at slice_op.cc:154 : Resource exhausted: OOM when allocating tensor with shape[1,44050,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Storing checkpoint to ./logdir/train/2018-05-13T06-19-22 ... Done.
Traceback (most recent call last):
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,38934,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: wavenet_1/loss/Slice = Slice[Index=DT_INT32, T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](wavenet_1/loss/Reshape, wavenet_1/loss/Slice/begin, wavenet_1/loss/Slice/size)]]

from tensorflow-wavenet.

GPU OOM about tensorflow-wavenet HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent