Git Product home page Git Product logo

Comments (10)

larsr avatar larsr commented on September 14, 2024

Are you running something else already on the GPU?

from tensorflow-wavenet.

lelayf avatar lelayf commented on September 14, 2024

nope, fresh install, fresh reboot, running headless and only process using the GPU. Has anyone successfully trained the network in less than 3.5 GiB on the GPU ?

from tensorflow-wavenet.

ibab avatar ibab commented on September 14, 2024

Looks like it stopped on 2 particularly large input samples.
Did this happen immediately, or after a few iterations?
If it's just because of large samples, pull request #54 should fix this by always cutting them into fixed size pieces.

You can also try setting the BATCH_SIZE to 1, or reducing the number of dilated convolutional layers in wavenet_params.json.

3.5 GB isn't a lot, but should be enough to train the network with some tweaks.

Note that we haven't yet managed to reproduce the results from the wavenet paper, so it's not worth training the network unless you can invest the time to look for good hyperparameters.

from tensorflow-wavenet.

tuba avatar tuba commented on September 14, 2024

I get OOM on 8GB video ram always after 160 step
Seems like BATCH_SIZE = 1 solves the problem for me.

from tensorflow-wavenet.

lelayf avatar lelayf commented on September 14, 2024

BATCH_SIZE=1 allows the process to run for a few dozen steps before OOMing. I just tried to mount a second identical GPU to increase total mem to 7GB but it behaves similarly. The second GPU is well recognized by TF but I think we would need explicit placement to make it useful. I will now look into #54

from tensorflow-wavenet.

ibab avatar ibab commented on September 14, 2024

@lelayf: Yeah, we're not making use of extra GPUs at the moment.
#54 should definitely help. You can adjust the SAMPLE_SIZE downwards if you still run into problems.

from tensorflow-wavenet.

lelayf avatar lelayf commented on September 14, 2024

it seems the training is now stalling silently after a few hundred steps. I do not get OOMs. I tried using 96000 sample_size, then 64000 and finally ran it with no sample_size on the command line. I also tried different GPUs, a GRID K2 and a Tesla K80. In all cases the same silent stalling arises.

from tensorflow-wavenet.

ibab avatar ibab commented on September 14, 2024

Could this be the same problem as in #65?
The audio pipeline stopped processing data after traversing the input files once.
Updating to one of the newer commits should fix that.

from tensorflow-wavenet.

ibab avatar ibab commented on September 14, 2024

Closing this, as all mentioned issues should be fixed at this point.
Feel free to comment if you still experience problems with this.

from tensorflow-wavenet.

mikelibg avatar mikelibg commented on September 14, 2024

Getting same OOM when running on 61GiB aws instance even with sample size 10,000.
commit: 3c973c0
python: 3.6
with --silence_threshold=0 on one utterance p225 from the original VCTK corpus

>>> psutil.virtual_memory()
svmem(total=64389132288, available=60933558272, percent=5.4, used=2929561600, free=59257782272, active=3300839424, inactive=1334075392, buffers=73183232, cached=2128605184, shared=21499904, slab=143368192)

Exception:
2018-05-13 06:20:26.790439: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at slice_op.cc:154 : Resource exhausted: OOM when allocating tensor with shape[1,44050,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Storing checkpoint to ./logdir/train/2018-05-13T06-19-22 ... Done.
Traceback (most recent call last):
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ubuntu/training/tensorflow-wavenet/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,38934,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: wavenet_1/loss/Slice = Slice[Index=DT_INT32, T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](wavenet_1/loss/Reshape, wavenet_1/loss/Slice/begin, wavenet_1/loss/Slice/size)]]

from tensorflow-wavenet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.