hi, when i run the code on my server ( v100*4 cuda 9.0 cudnn 7.0), it occurs this errors.
Could you please help me ?
which version of cuda and cudnn do you use?
`/home/admin/algomodule/test/kaggle-web-traffic# python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500
WARNING:tensorflow:From /home/admin/algomodule/test/kaggle-web-traffic/model.py:144: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-10-02 06:00:37.510047: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-10-02 06:00:37.909980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:37.911006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.047527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.048568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:09.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.179680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.180730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0a.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.319747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 06:00:38.320794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0b.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-10-02 06:00:38.320867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-10-02 06:00:40.205535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 06:00:40.205600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-10-02 06:00:40.205610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y
2019-10-02 06:00:40.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y
2019-10-02 06:00:40.205631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y
2019-10-02 06:00:40.205641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N
2019-10-02 06:00:40.205992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14941 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
2019-10-02 06:00:40.508989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14941 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
2019-10-02 06:00:40.811745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14941 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0a.0, compute capability: 7.0)
2019-10-02 06:00:41.114312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14941 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0)
1: 0%| | 0/566 [00:00<?, ?it/s]2019-10-02 06:00:47.758076: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.770054: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
2019-10-02 06:00:47.782300: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 599, in train
step = trainer.train_step(sess, epoch)
File "trainer.py", line 251, in train_step
results = self._metric_step(Stage.TRAIN, ops, sess, epoch, summary_every=20)
File "trainer.py", line 235, in _metric_step
results = sess.run(ops)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'm_0/cudnn_gru/CudnnRNN', defined at:
File "trainer.py", line 786, in
train(**param_dict)
File "trainer.py", line 520, in train
all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i))
File "trainer.py", line 474, in create_model
train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 342, in init
transpose_output=False)
File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 65, in make_encoder
rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 412, in call
training)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 487, in _forward
seed=self._seed)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 922, in _cudnn_rnn
outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 115, in cudnn_rnn
is_training=is_training, name=name)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]]
[[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]`