Git Product home page Git Product logo

Comments (37)

aymericdamien avatar aymericdamien commented on April 28, 2024 274

Are you using GPU? Usually, this error raises when GPU memory is full.

from tensorflow-examples.

burness avatar burness commented on April 28, 2024 62

@aymericdamien Thanks! I found the reason : I use ipython notebook to run the code , but i forget to close another one , the script and it waster too much memory

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024 38

@pumplerod - I found a solution / kludge that somehow seems to work, although I can't explain why / how.

Before starting your Jupyter notebook / tensorflow program, set this:

export CUDA_VISIBLE_DEVICES=1

This seems to work in that the scripts work OK. Not sure if this is a requirement.
Give it a try and see.

from tensorflow-examples.

subodhp avatar subodhp commented on April 28, 2024 20

Yup, GPU Memory Full is the reason. IPython kernels stuck in background processes does that.

Thanks,
Subodh
thesubodh.com

from tensorflow-examples.

pumplerod avatar pumplerod commented on April 28, 2024 14

Wow. Thanks. That seems to have worked. Not sure how it's related, but before trying your solution I got rid of the error by specifying
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, gpu_options=gpu_options))

However when I tried to run training it crashed the jupyter notebook.

from tensorflow-examples.

Boltuzamaki avatar Boltuzamaki commented on April 28, 2024 10

It occurs due to full of the memory of GPU. The best way is to reduce batch size

Like if batch_size = 32

make it 16/8/4/2 anything till your error is resolved

It works every single time for me.

from tensorflow-examples.

Mazecreator avatar Mazecreator commented on April 28, 2024 4

Just stumbled upon this thread. I think you have hidden your GPU from the CUDA drivers with this line:

export CUDA_VISIBLE_DEVICES=1

What this is telling CUDA is that it should only use "Device 1" in your system. So, unless you have 2 GPU devices, you have hidden the primary "Device 0". I am sure if you set this as follows TF will see your GPU again, but your other problems may return:

export CUDA_VISIBLE_DEVICES=0

from tensorflow-examples.

SlobodanNinkov avatar SlobodanNinkov commented on April 28, 2024 4

Reduce the size of batches sent in the run or eval, it should do the trick.

from tensorflow-examples.

MrYakobo avatar MrYakobo commented on April 28, 2024 2

It occurs due to full of the memory of GPU. The best way is to reduce batch size

Like if batch_size = 32

make it 16/8/4/2 anything till your error is resolved

It works every single time for me.

For me, removing val_split helped as well. 🤷

from tensorflow-examples.

bmy-ashampoo avatar bmy-ashampoo commented on April 28, 2024 2

I had a similar problem when loading a previously trained model from disk (so changing the batch_size wasn't an option). This is what fixed it:

with tf.device('/CPU:0'):
    loaded = tf.saved_model.load(model_path)

from tensorflow-examples.

normanheckscher avatar normanheckscher commented on April 28, 2024 1

MacBook Nvidia GPU isn't dedicated and shares resources with TensorFlow and the screen.

I regularly have out of memory issues. Using mid 2012 rMBP with GeForce 650.

Before running TensorFlow, I close all processes using the GPU (look at resource monitor video card column) to force OSX to use the integrated video card. Doing this releases some memory and I can execute TensofFlow scripts. Not all memory is cleared when I check memory with cuda-smi. Can quickly see which graphics card is being used with gfx.io app. I found it good to disable WebGL in safari (although it's needed for Tensorboard). Restarting Safari and pycharm before running TensorFlow scripts is helpful to clear GPU memory. Stop non-essential apps in the background is also helpful.

https://github.com/phvu/cuda-smi

https://gfx.io

An OSX issue is possibility?

MacBook isn't the best "all in one" dev platform for TensorFlow, it can be made to work... albeit frustratingly.

Would be good to force OSX to use integrated video chip for screen and Nvidia for dedicated TensorFlow. I'm totally unsure, however some early discussions about the hardware were indicating that apple has locked down certain parts of the GPU access... so if it can't be used exclusively now... it's likely to be difficult/impossible to do in the future.

from tensorflow-examples.

UkiDLucas avatar UkiDLucas commented on April 28, 2024 1

I can run MacBook Pro NVidia GPU, but only for minimal applications:

import tensorflow as tf
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8) #0.333
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, gpu_options=gpu_options))

When I increase the number of Conv2D filters e.g. from 32 to 64 I am starting to get DEAD KERNEL, so I lower number of images I process per batch from e.g. 256 to 24.

You have to keep trying until you get the right balance between the depth of your neural network, batch size and amount of GPU memory.

In the end, it is much faster than CPU, but too fragile, after much frustration, I am going back to CPU and more powerful Linux GPU instance.

from tensorflow-examples.

sebtac avatar sebtac commented on April 28, 2024 1

Running into the same issue with the smallest possible model, Cart-Pole, on GTX1080 8MB. is it a TensorFlow bug that can be fixed somehow or we are simply trying to fit too big models (overenthusiastic the batch size probably the main reason for that)?

sebtac

from tensorflow-examples.

dhruvchamania avatar dhruvchamania commented on April 28, 2024 1

For my case, it was the issue with the dataset. Removing the problematic images (Google large images or weird images that you get from web scrapping) solved my problem.

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

@burness @subodhp I'm getting the same error ("Ran out of memory")
[MacbookPro 2013 with 16 GM RAM, GPU (2GB RAM), TensorFlow 0.11, CUDA 8.0, CUDNN 5.x]

I tried shutting down the Jupyter Notebook and restarting it... but it crashed with the same error.
Is this solved?
How does one resolve GPU memory full errors?

Thanks!

`I tensorflow/core/common_runtime/bfc_allocator.cc:689] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 5 Chunks of size 256 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 31488 totalling 30.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 46609152 totalling 44.45MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 44.48MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 57622528
InUse: 46643200
MaxInUse: 46643200
NumAllocs: 11
MaxAllocSize: 46609152

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ********************************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 390.6KiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:958] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: Reshape_1/_2__cf__2 = Constdtype=DT_FLOAT, value=Tensor<type: float shape: [10000,10] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]]
`

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

I rebooted my Macbook and started afresh.
system: [MacbookPro 2013, with 16 GB RAM, GPU with 2GB RAM; Tensor Flow 0.11, CUDA 8.0, CUDNN 5.x]
Here's the error I get (see attached error-tf.txt at the bottom for all detail).

  1. How is the free memory only 20.49 MiB (on a recently rebooted system) if there's 2.0 GiB available to the GPU?
  2. Is there a way to track GPU memory usage?
  3. Is there a way to disable GPU usage for an iPython notebook?

Thanks!

Some relevant parts I see:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 20.49MiB
...

I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 21487616
InUse: 33792
MaxInUse: 65280
NumAllocs: 9
MaxAllocSize: 31488

W

tensorflow/core/common_runtime/bfc_allocator.cc:270] *___________________________________________________________________________________________________
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 29.91MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:958] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
[[Node: Const = Constdtype=DT_FLOAT, value=Tensor<type: float shape: [10000,784] values: -0.5 -0.49607843 -0.5...>, _device="/job:localhost/replica:0/task:0/gpu:0"]]

error-tf.txt

from tensorflow-examples.

pumplerod avatar pumplerod commented on April 28, 2024

laventura, did you ever find a solution to the gpu out of memory error? I have the same problem with the same setup. Though I got an error trying to allocate 10.8Mib

from tensorflow-examples.

sapeyes avatar sapeyes commented on April 28, 2024

@pumplerod

Oh yours is very helpful for me. I got an error message about only 29Mib out of memory.
I added your code with fraction 0.8 since there was 80% free memory (from 2GiB, 1.6GiB was free).
My code started working. After that, I deleted ALL GPU options and this still works. very curious..

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

Update on this:

Earlier -- the GPU was being recognized by an older TensorFlow. Now, when I upgraded TF to 0.11rc2 and later to 0.12

Now, my TF does not recognize any GPU at all.

Also, the deviceQuery does not report any GPU either. I'm going totally bonkers in this CUDA hell.

See details here:
tensorflow/tensorflow#2882

Also on NVIDIA Devtalk, if any one has any bright insights - would be very helpful to me!
https://devtalk.nvidia.com/default/topic/990015/cuda-setup-and-installation/help-cuda-7-5-or-8-devicequery-failing-not-working-on-macbookpro-2013-os-x-10-11-gt750m/

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

@Mazecreator & Others,

Indeed; when I set CUDA_VISIBLE_DEVICES=0, the deviceQuery returns successfully. However, now TensorFlow complains again with "Dst Tensor Not initialized" !!

This is so frustrating!!

It appears that CUDA is leaking memory... I see that free memory listed (when a python script starts) keeps getting less and less... though I dont know for sure if that's the problem.
The workaround suggested above (set TF's GPUOptions) are all workarounds - they require manual code / intervention in existing scripts that were supposed to work OK.

See here: deviceQuery

 py35 ▶ ~ ▶ Developer ❯ … ❯ x86_64 ❯ darwin ❯ release ▶ $ ▶ echo $CUDA_HOME 
/usr/local/cuda
 py35 ▶ ~ ▶ Developer ❯ … ❯ x86_64 ❯ darwin ❯ release ▶ $ ▶ echo $CUDA_VISIBLE_DEVICES

 py35 ▶ ~ ▶ Developer ❯ … ❯ x86_64 ❯ darwin ❯ release ▶ $ ▶ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 750M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2048 MBytes (2147024896 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            926 MHz (0.93 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS
 py35 ▶ ~ ▶ Developer ❯ … ❯ x86_64 ❯ darwin ❯ release ▶ $ ▶ 

 py35 ▶ ~ ▶ Developer ❯ CUDA ❯ cuda-smi ▶ master ▶ ❓ ▶ $ ▶ ./cuda-smi 
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 369.92 of 2047.6 MB (i.e. 18.1%) Free

Running a Python script with TensorFlow:

 py35 ▶ ~ ▶ Developer ❯ … ❯ self_driving_car ❯ traffic-signs ❯ CarND-Alexnet-Fe ▶ master ▶ 4✎ ▶ $ ▶ python imagenet_inference.py 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.1.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.dylib locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 305.92MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): 	Total Chunks: 1, Chunks in use: 0 97.01MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 144.00MiB was 128.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700a60000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700a60500 of size 139520
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700a82600 of size 512
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700a82800 of size 1228800
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700bae800 of size 1024
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700baec00 of size 3538944
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700f0ec00 of size 1536
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x700f0f200 of size 2654208
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x701197200 of size 1536
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x701197800 of size 1769472
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x701347800 of size 1024
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x701347c00 of size 101725184
I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 512 totalling 512B
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 1024 totalling 2.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 1536 totalling 3.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 139520 totalling 136.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1228800 totalling 1.17MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1769472 totalling 1.69MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 2654208 totalling 2.53MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3538944 totalling 3.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 8.91MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                   111063040
InUse:                     9337856
MaxInUse:                  9337856
NumAllocs:                      11
MaxAllocSize:              3538944

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********___________________________________________________________________________________________
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 144.00MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:965] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:390] Executor failed to create kernel. Internal: Dst tensor is not initialized.
	 [[Node: Variable_10/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [9216,4096] values: [-0.0043384791 -0.0071635786 -0.0067223078]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
    return fn(*args)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
    status, run_metadata)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
	 [[Node: Variable_10/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [9216,4096] values: [-0.0043384791 -0.0071635786 -0.0067223078]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "imagenet_inference.py", line 19, in <module>
    sess.run(init)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
	 [[Node: Variable_10/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [9216,4096] values: [-0.0043384791 -0.0071635786 -0.0067223078]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op 'Variable_10/initial_value', defined at:
  File "imagenet_inference.py", line 16, in <module>
    probs = AlexNet(x, feature_extract=False)
  File "/Users/aa/Developer/courses/self_driving_carnd/traffic-signs/CarND-Alexnet-Feature-Extraction/alexnet.py", line 139, in AlexNet
    fc6W = tf.Variable(net_data["fc6"][0])
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
    expected_shape=expected_shape)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 333, in _init_from_args
    initial_value, name="initial_value", dtype=dtype)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 169, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/aa/Developer/miniconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized.
	 [[Node: Variable_10/initial_value = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [9216,4096] values: [-0.0043384791 -0.0071635786 -0.0067223078]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

 py35 ▶ ~ ▶ Developer ❯ … ❯ self_driving_car ❯ traffic-signs ❯ CarND-Alexnet-Fe ▶ master ▶ 4✎ ▶ $ ▶ 

from tensorflow-examples.

monajalal avatar monajalal commented on April 28, 2024

I get the same error and I have 12GB of GPU memory:

mona@pascal:~/computer_vision/VPilot$ python train.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:1938: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Epoch 1/1000
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 412.50MiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4547d60
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:83:00.0
Total memory: 11.92GiB
Free memory: 534.50MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:855] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1:   N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 512.0KiB was 512.0KiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740e00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b740f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b741000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b741100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b741200 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b741300 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b741400 of size 4096
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b742400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b742500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b742600 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b742e00 of size 2048
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743600 of size 1024
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743d00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x130b743e00 of size 222806528
I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 21 Chunks of size 256 totalling 5.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1024 totalling 1.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 2048 totalling 4.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4096 totalling 4.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 222806528 totalling 212.48MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 212.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                   222822400
InUse:                   222822400
MaxInUse:                222822400
NumAllocs:                      27
MaxAllocSize:            222806528
 
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***********************************************************************************************xxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 512.0KiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:958] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:334] Executor failed to create kernel. Internal: Dst tensor is not initialized.
     [[Node: Const_37 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [512,256] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
  File "train.py", line 55, in <module>
    callbacks=[ckp_callback]
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 935, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1553, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1316, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 1919, in __call__
    session = get_session()
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 121, in get_session
    _initialize_variables()
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 275, in _initialize_variables
    sess.run(tf.initialize_variables(uninitialized_variables))
  File "/home/mona/tensorflow/_python_build/tensorflow/python/client/session.py", line 717, in run
    run_metadata_ptr)
  File "/home/mona/tensorflow/_python_build/tensorflow/python/client/session.py", line 915, in _run
    feed_dict_string, options, run_metadata)
  File "/home/mona/tensorflow/_python_build/tensorflow/python/client/session.py", line 965, in _do_run
    target_list, options, run_metadata)
  File "/home/mona/tensorflow/_python_build/tensorflow/python/client/session.py", line 985, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Dst tensor is not initialized.
     [[Node: Const_37 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [512,256] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
 
Caused by op u'Const_37', defined at:
  File "train.py", line 55, in <module>
    callbacks=[ckp_callback]
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 935, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1450, in fit_generator
    self._make_train_function()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 761, in _make_train_function
    self.total_loss)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 234, in get_updates
    accumulators = [K.zeros(shape) for shape in shapes]
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 482, in zeros
    return variable(tf.constant_initializer(0., dtype=tf_dtype)(shape),
  File "/home/mona/tensorflow/_python_build/tensorflow/python/ops/init_ops.py", line 145, in _initializer
    return constant_op.constant(value, dtype=dtype, shape=shape)
  File "/home/mona/tensorflow/_python_build/tensorflow/python/framework/constant_op.py", line 167, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/home/mona/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 2388, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/mona/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 1300, in __init__
    self._traceback = _extract_stack()
 
InternalError (see above for traceback): Dst tensor is not initialized.
     [[Node: Const_37 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [512,256] values: 0 0 0...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

@monajalal --

It appears that the GPU is running out of memory for some reason.
WHY that is happening, I can't say; it is the most confounding thing since the executed programs have ended.

Probably a memory leak?? If so, it could be at the GPU driver level??

See here too: tensorflow/tensorflow#7025 (comment)

I've tried searching for how to release/clear GPU memory, but haven't found anything good / credible / useful.

Do let me know if you or anyone comes across a solution.

Until then, this TensorFlow + GPU combo is a total fail for me (on my Macbook). 😡

from tensorflow-examples.

laventura avatar laventura commented on April 28, 2024

@normanheckscher - Thanks for the tips. Good to know about the Macbook GPU.

I downloaded gfx.io -it's helpful in understanding when the GPU is being used.
I've used cuda-smi; it's useful in showing the free GPU mem, but doesn't really show the processes using it. I was hoping an nvidia-smi kind of thing would exist for Macs.

When you said "I close all processes using the GPU (look at resource monitor video card column) to force OSX to use the integrated video card" which 'resource monitor video card' column do you refer to? In ActivityMonitor? If so, I didn't find it. :-(

Yeah, I try closing most of the programs that use GPUs (mostly Chrome etc. that I use) before running TF scripts. Sometimes, the TF scripts run out of mem almost immediately after a fresh reboot, which is kind of confounding.

I'm just coming to a slow realization that TensorFlow + GPU combo isn't a very effective/efficient on Macbooks. 😕

I'm rather sadly investigating Theano combo (instead of TF) on top of Keras, which is my main high-level framework of choice. Sadly bcos I dont know enough Theano and dont have enough bandwidth to learn it effectively. :-/

from tensorflow-examples.

normanheckscher avatar normanheckscher commented on April 28, 2024

Sorry @laventura I meant the Activity Monitor for OSX. If you go to CPU or Memory tabs where you can look for the running processes you can select "View>Columns>Graphics Card" and a new column with "Requires High Perf GPU will appear. Sort by this column and you can see which processes are using the Nvidia card.

MacBookPro can be used for learning and development. I want to use TensorFlow and I find OSX is a very good environment to work in, so I deal with these little irritations while I get myself up to speed with TensorFlow. When my models need more memory I'll make the call as to building a headless Linux box or going with a service such as AWS. If I was starting from scratch I'd consider a dedicated GPU notebook that could run Linux, however, I'm not flush with cash and I don't see the need to purchase a new hardware environment when the one I have works.

Best of luck to you.

from tensorflow-examples.

bpanahij avatar bpanahij commented on April 28, 2024

This is not just a MacBook issue. I am seeing this on my laptop with a GTX1060 (6GB) Running ubuntu.

This seems to help:
keras-team/keras#3675

Use:

max_q_size=1,
pickle_safe=False

in fit_generator()

After adding these two options I am up and running again.

from tensorflow-examples.

jasgrewal avatar jasgrewal commented on April 28, 2024

Stumbled onto this thread, perhaps my two cents can help. Launching python with a preceding flag of THEANO_FLAGS='device=gpu0' or THEANO_FLAGS='device=gpu1' etc (latter if you have more than 1 gpu) helps. For ex, this terminal command will run the python code on gpu2 (you can use gpustat to track usage of different GPUs on your machine in realtime:

THEANO_FLAGS='device=gpu4' python /run/this/script.py
If your convolutional filters are large, having smaller training batches can be one way to overcome the memory issue. That is, if the network initialization fits in memory first.

from tensorflow-examples.

philipperemy avatar philipperemy commented on April 28, 2024

You don't have ENOUGH GPU MEMORY.

from tensorflow-examples.

estathop avatar estathop commented on April 28, 2024

when the system is idle and not processing, shouldn't somehow python not use the whole GPU memory ? it is a useful feature

from tensorflow-examples.

soufianesabiri avatar soufianesabiri commented on April 28, 2024

In my case I have a laptop, the command export CUDA_VISIBLE_DEVICES=1 made the training really slow so I assume it used the integrated graphics card. So I had to use value 0.

from tensorflow-examples.

sunn-e avatar sunn-e commented on April 28, 2024

I'm having the same issue. Its a windows machine. Now I reduced my rnn size and embedding size.. lets see.

from tensorflow-examples.

sunn-e avatar sunn-e commented on April 28, 2024

Not working.

from tensorflow-examples.

iedmrc avatar iedmrc commented on April 28, 2024

Wow. Thanks. That seems to have worked. Not sure how it's related, but before trying your solution I got rid of the error by specifying
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, gpu_options=gpu_options))

However when I tried to run training it crashed the jupyter notebook.

There it is! Thank you for the answer. It worked in my case!
There is also another similar solution:

config = tf.ConfigProto(gpu_options= tf.GPUOptions(allow_growth=True))
# allow_growth=True is the important part here

from tensorflow-examples.

Adesoji1 avatar Adesoji1 commented on April 28, 2024

WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
Train for 11523 steps, validate for 4153 steps
Epoch 1/5
1/11523 [..............................] - ETA: 33:47:16

InternalError Traceback (most recent call last)
in
6 epochs=EPOCHS,
7 validation_data=validation_generator,
----> 8 validation_steps=validation_generator.samples//validation_generator.batch_size)

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\util\deprecation.py in new_func(*args, **kwargs)
322 'in a future version' if date is None else ('after %s' % date),
323 instructions)
--> 324 return func(*args, **kwargs)
325 return tf_decorator.make_decorator(
326 func, new_func, 'deprecated',

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1304 use_multiprocessing=use_multiprocessing,
1305 shuffle=shuffle,
-> 1306 initial_epoch=initial_epoch)
1307
1308 @deprecation.deprecated(

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
--> 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate(self,

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
--> 342 total_epochs=epochs)
343 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
344

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
--> 128 batch_outs = execution_function(iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py in execution_function(input_fn)
96 # numpy translates Tensors to values in Eager mode.
97 return nest.map_structure(_non_none_constant_value,
---> 98 distributed_function(input_fn))
99
100 return execution_function

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\def_function.py in call(self, *args, **kwds)
566 xla_context.Exit()
567 else:
--> 568 result = self._call(*args, **kwds)
569
570 if tracing_count == self._get_tracing_count():

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call(self, *args, **kwds)
597 # In this case we have created variables on the first call, so we run the
598 # defunned version which is guaranteed to never create variables.
--> 599 return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
600 elif self._stateful_fn is not None:
601 # Release the lock early so that multiple threads can perform the call

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, *args, **kwargs)
2361 with self._lock:
2362 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2363 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
2364
2365 @Property

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call(self, args, kwargs)
1609 if isinstance(t, (ops.Tensor,
1610 resource_variable_ops.BaseResourceVariable))),
-> 1611 self.captured_inputs)
1612
1613 def _call_flat(self, args, captured_inputs, cancellation_manager=None):

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1690 # No tape is watching; skip to running the function.
1691 return self._build_call_outputs(self._inference_function.call(
-> 1692 ctx, args, cancellation_manager=cancellation_manager))
1693 forward_backward = self._select_forward_and_backward_functions(
1694 args,

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, ctx, args, cancellation_manager)
543 inputs=args,
544 attrs=("executor_type", executor_type, "config_proto", config),
--> 545 ctx=ctx)
546 else:
547 outputs = execute.execute_with_cancellation(

~\anaconda3\envs\ev_2\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
65 else:
66 message = e.message
---> 67 six.raise_from(core._status_to_exception(e.code, message), None)
68 except TypeError as e:
69 keras_symbolic_tensors = [

~\anaconda3\envs\ev_2\lib\site-packages\six.py in raise_from(value, from_value)

InternalError: Dst tensor is not initialized.
[[{{node IteratorGetNext/_2}}]] [Op:__inference_distributed_function_24557]

Function call stack:
distributed_function
please how do i resolve in windows Os

from tensorflow-examples.

NeurAlch avatar NeurAlch commented on April 28, 2024

In case someone comes from Google like me, I had a similar issue and in my case restarting Jupyter server and my IDE (intellij) fixed it... guessing a memory leak.

from tensorflow-examples.

aleon1138 avatar aleon1138 commented on April 28, 2024

There is another trick which worked for me - I delete any dangling iPython output that might be lying around:

%reset -f out

My guess is that the GPU can't release memory if there are any python variables still somehow linked to it.

from tensorflow-examples.

xy-always avatar xy-always commented on April 28, 2024

from tensorflow-examples.

jiandandema avatar jiandandema commented on April 28, 2024

from tensorflow-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.