huaweicloud / dls-example Goto Github PK
View Code? Open in Web Editor NEWIntroduction of usage deep learning service of huawei cloud
Introduction of usage deep learning service of huawei cloud
创建预测作业,所使用的引擎为TFServing,作业显示为运行中,但是使用样例客户端代码显示连接不上。
相关作业类型: 预测作业
作业ID:
引擎类型: TF
运行参数:
计算节点个数:
计算节点规格:
执行:python dls-tfserving-client/python/predict.py --task_type="image_classification" --host=117.78.40.73 --port=30955 --data_path="/data/aicenter_eric/learn/tensorflow/test/0_pb1.b276.ps1521.ps1521-s.page2.c_20170308143640954568336910845.jpg" --labels_file_path="dls-tfserving-client/data/flowers/labels.txt" --model_name="resnet_v1_50"
报:grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed")
运行作业日志提示如下信息,并经过很长时间都没有反应。日志一直卡在下面这句话。
INFO:tensorflow:Find tfrecord files. Using tfrecord files in this job.
INFO:tensorflow:Automatically extracting num_samples from tfrecord. If the dataset is large, it may take some time. You can also manually specify the num_samples to Dataset to save time.
相关作业类型: 预置模型库-创建训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
预置模型库 -> ResNet_v1_50 -> 创建训练作业 -> 选择一个数据集并提交作业
train_url=s3://zzy/zzy/data/log/carped/
model_name=resnet_v1_50
checkpoint_url=s3://zzy/zzy/pretrained_model/resnet_v1_50/
2018-05-26 15:19:35.802270: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 2 of dimension 0 out of bounds.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, slice index 2 of dimension 0 out of bounds.
[[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]
INFO:tensorflow:Saving checkpoints for 273 into s3://zzy/zzy/data/log/carped/model.ckpt.
Traceback (most recent call last):
File "resnet_v1_50_code/finetune_model.py", line 518, in <module>
main()
File "resnet_v1_50_code/finetune_model.py", line 514, in main
export_model=mox.ExportKeys.TF_SERVING)
File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 329, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 483, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
File "moxing/tensorflow/executor/learning.pyx", line 661, in moxing.tensorflow.executor.learning.Learning.run
File "moxing/tensorflow/executor/learning.pyx", line 1156, in moxing.tensorflow.executor.learning.Learning.training
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
self._close_internal(exception_type)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 2 of dimension 0 out of bounds.
[[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]
按照官网的教程,https://support.huaweicloud.com/usermanual-dls/dls_01_0077.html
在ROMA上训练失败
相关作业类型:
作业ID:d2f13d03-cd72-4d60-88cf-15b5e3fad120
引擎类型: (TensorFlow or MXNet)
运行参数:train_url=s3://final-model; model_name=resnet_v1_50; checkpoint_url=s3://final-model/resnet_v1_50
计算节点个数:1
计算节点规格:8核|64GiB|1*P100
ceback (most recent call last):
File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
提交基于pytorch的训练作业时,遇到如下错误:
RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383
相关作业类型: 训练作业
作业ID:
引擎类型: pytorch
运行参数:
计算节点个数:
计算节点规格:
Traceback (most recent call last):
File "pytorch/main.py", line 205, in <module>
main()
File "pytorch/main.py", line 116, in main
train(train_loader, model, criterion, optimizer, epoch)
File "pytorch/main.py", line 136, in train
for i, (input, target) in enumerate(train_loader):
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
return self._process_next_batch(batch)
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in default_collate
return [default_collate(samples) for samples in transposed]
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 113, in default_collate
storage = batch[0].storage()._new_shared(numel)
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/storage.py", line 114, in _new_shared
return cls._new_using_filename(size)
RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 35, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 87) is killed by signal: Bus error.
Error when reading .npy/.npz files using moxing
Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'
相关作业类型:
作业ID:
引擎类型: (TensorFlow)
运行参数:
计算节点个数:
计算节点规格:
Source code:
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'
使用num_gpus = mox.get_flag('num_gpus'),出现错误
(简单描述问题信息,如果是bug,请描述重现步骤)
相关作业类型:
作业ID:
引擎类型: (TensorFlow)
运行参数:
计算节点个数:1
计算节点规格:16核|128GiB|2*P100
WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
Traceback (most recent call last):
File "bert/run_classifier_new.py", line 760, in <module>
main()
File "bert/run_classifier_new.py", line 608, in main
if not flags.do_train and not flags.do_eval and not flags.do_predict:
File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/flags.py", line 84, in __getattr__
wrapped(_sys.argv)
File "/home/work/anaconda3/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 630, in __call__
name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'num_gpus'
在保存训练模型时, TensorBoard数据文件可以保存,但是ModelCheckpoint数据文件无法保存,另外参考文档https://bbs.huaweicloud.com/forum/thread-11660-1-1.html中有提到
,“由于cp_callback、tb_callback不能直接写入,...”, 求解惑。
相关作业类型:
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:1
计算节点规格:8核 | 64GiB | 1*P100 | 750GB
--
(Your Log Here)
![image](https://user-images.githubusercontent.com/32593000/55632541-1e603d00-57ed-11e9-9c9c-392c11ddfea2.png)
正常启动程序,训练ResNet50模型(300M左右模型文件),但是运行了多个epoch后突然显示以下信息(见Log),任务失败。原因是Unable to connect to endpoint,可能是OBS连接不稳定所致。
(简单描述问题信息,如果是bug,请描述重现步骤)
相关作业类型:
作业ID: resnet-42586680-10
引擎类型: (TensorFlow)
运行参数:无
计算节点个数:1
计算节点规格:单机8卡
Caused by op u'ModelSaver/save/SaveV2', defined at:
File "resnet_cloud_multi/imagenet_resnet_cloud.py", line 218, in
launch_train_with_config(config, trainer)
...
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): : Unable to connect to endpoint
训练数据集在OBS上存在,创建训练作业填写data_url为OBS路径,训练失败,提示:No such file or directory
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
相关代码:
if not gfile.Exists(image_dir):
print("Image directory '" + image_dir + "' not found.")
return None
result = {}
sub_dirs = [x[0] for x in os.walk(image_dir)]
在公有云上提交训练作业(比如冰山识别项目), 训练过程中连接obs读取tfrecord数据时出现错误,报错信息大致如下所示:
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
2018-08-30 13:29:29.379592: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
File "code/train_iceberg.py", line 164, in <module>
save_model_secs=120)
File "moxing/tensorflow/executor/learning_builder.pyx", line 379, in moxing.tensorflow.executor.learning_builder.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 287, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 484, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
File "moxing/tensorflow/executor/learning.pyx", line 649, in moxing.tensorflow.executor.learning.Learning.run
File "moxing/tensorflow/executor/learning.pyx", line 1117, in moxing.tensorflow.executor.learning.Learning.training
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
self._close_internal(exception_type)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 9832454
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
这是训练集数据
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:train_url=s3://donotdel-dls/models/
model_name=resnet_v1_50
checkpoint_url=s3://donotdel-dls/models/resnet_v1_50/
batch_size=2
计算节点个数:1
计算节点规格:8核 | 32GiB
--
INFO:tensorflow:Total 3301 files found in s3://donotdel-dls-test/models/data/
INFO:tensorflow:Total number of samples is 3301
INFO:tensorflow:Labels index to name:
INFO:tensorflow:0: daisy
INFO:tensorflow:1: dandelion
INFO:tensorflow:2: roses
INFO:tensorflow:3: sunflowers
INFO:tensorflow:4: tulips
Traceback (most recent call last):
File "resnet_v1_50_code/finetune_model.py", line 501, in <module>
main()
File "resnet_v1_50_code/finetune_model.py", line 262, in main
raise ValueError("learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. "
ValueError: learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. e_i:l_i represents that from epoch e_i-1 to e_i use learning rate l_i. And training will stop at epoch e_n'
训练作业日志错误日志: error: unrecognized arguments: --data_url=s3://my_bucket/data
你好,在使用tensorflow训练模型时,lable文件的读取和图像文件应该怎么读取?
启动预测作业,如果提示信息类似如下:
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:268] No versions of servable resnet_v1_50 found under base path s3://dls-test/log/resnet_v1_50/1/
相关作业类型: 预测作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
(简单描述问题信息,如果是bug,请描述重现步骤)
在DLS提交自定义的网络模型,如何设置运行参数
相关作业类型:
作业ID:
引擎类型: (TensorFlow or MXNet) TensorFlow
运行参数:
计算节点个数:8核
计算节点规格:60G
(Your Log Here)
使用的模型是Transformer (paper: Attention is all you need)。
模型大小为1024 units (参数量约382x4 MB) 时,出现错误信息。
模型大小为512 units (参数量约148x4 MB) 时运行正常。
尝试设置环境变量 S3_REQUEST_TIMEOUT_MSEC 和 S3_REQUEST_TIMEOUT到更大的值,比如6000000,依旧在大模型上出现错误
训练使用的Adam,因此保存checkpoint的时候参数量会更大
如果使用SGD,也不会出现问题
相关作业类型:
作业ID:
引擎类型: Tensorflow 1.4
运行参数:
计算节点个数:1
计算节点规格:8xP100
INFO:tensorflow:Create CheckpointSaverHook.
2018-06-30 15:52:46.447723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
INFO:tensorflow:Saving checkpoints for 1 into s3://mt-models/transformer/model_big/model.ckpt.
2018-06-30 15:55:35.289438: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 786, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 521, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 892, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 967, in run
raise six.reraise(*original_exc_info)
File "/usr/local/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1032, in run
run_metadata=run_metadata))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 452, in after_run
self._save(run_context.session, global_step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 468, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1573, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint
Caused by op 'save/SaveV2', defined at:
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 783, in _train_model
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 368, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 673, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 493, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 851, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 856, in _create_session
return self._sess_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 554, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 419, in create_session
self._scaffold.finalize()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver.build()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 742, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 381, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 355, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 296, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 239, in save_op
tensors)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1163, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): : Unable to connect to endpoint
相关作业类型:
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
(Your Log Here)
(简单描述问题信息,如果是bug,请描述重现步骤)
相关作业类型: 训练
作业ID:prejob-92946102
引擎类型: TensorFlow
运行参数:
计算节点个数:1
计算节点规格:8核|64GiB|1*P100
INFO:tensorflow:RGB checkpoint restored
Traceback (most recent call last):
File "kinetics-i3d-master/evaluate_sample.py", line 146, in <module>
tf.app.run(main)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "kinetics-i3d-master/evaluate_sample.py", line 113, in main
rgb_sample = np.load(_SAMPLE_PATHS['rgb'])
File "/usr/local/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 372, in load
fid = open(file, "rb")
IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/v_CricketShot_g04_c01_rgb.npy'
基于TensorFlow-1.8启动训练作业,并在代码中使用 tf.gfile模块连接OBS。
(AKSK等基本环境变量在DLS中已经设置好)
启动训练作业后会频繁打印如下日志信息。
相关作业类型: 训练作业
作业ID:
引擎类型: (TensorFlow or MXNet): TensorFlow
运行参数:
计算节点个数:
计算节点规格:
2018-07-17 11:29:04.024753: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:04.025943: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:04.026045: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.556230: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.556383: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.558030: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.558174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.582349: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.582455: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.583321: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.583429: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.585176: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.585264: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.586086: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.586174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.589516: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.589625: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.590120: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.590219: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.076022: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:07.076192: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.078109: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/label_map.txt'
(简单描述问题信息,如果是bug,请描述重现步骤)
相关作业类型:
作业ID:
引擎类型: (TensorFlow or MXNet)
运行参数:
计算节点个数:
计算节点规格:
Traceback (most recent call last):
File "CnnModel/DCNN.py", line 152, in <module>
model = cnn_model()
File "CnnModel/DCNN.py", line 66, in cnn_model
callbacks=[check],
File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
validation_steps=validation_steps)
File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 219, in fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/callbacks.py", line 452, in on_epoch_end
filepath = self.filepath.format(epoch=epoch + 1, **logs)
AttributeError: 'File' object has no attribute 'format'
代码结构如下:
project_dir
|- main.py
|- module_dir
|- module_file.py
用户在main.py中有代码
from module_dir import module_file
发生如下错误:
Traceback (most recent call last):
File "project_dir/main.py", line 1, in <module>
from module_dir import module_file
ImportError: No module named module_dir
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
训练作业日志显示错误信息: AttributeError: 'module' object has no attribute '_FlagValues'
DLS服务-预置模型库-创建训练作业-选择自己的一个数据集并训练出现错误。
作业ID:
引擎类型: TensorFlow
运行参数:
train_url=s3://cat-body-six-classes/model.resnet_v1_50/
batch_size=32
learning_rate_strategy=10:0.01,20:0.001
file_pattern=flowers_*
max_epoches=20
image_size=224
num_classes=6
samples_per_epoch=589
checkpoint_exclude_patterns=logits.global_step
计算节点个数:1
计算节点规格:1*P100
截图:
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]
INFO:tensorflow:Saving checkpoints for 2 into s3://cat-body-six-classes/log/model.ckpt.
Traceback (most recent call last):
File "resnet_v1_50_code/finetune_model.py", line 519, in <module>
main()
File "resnet_v1_50_code/finetune_model.py", line 515, in main
export_model=mox.ExportKeys.TF_SERVING)
File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 237, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 491, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
File "moxing/tensorflow/executor/learning.pyx", line 653, in moxing.tensorflow.executor.learning.Learning.run
File "moxing/tensorflow/executor/learning.pyx", line 1124, in moxing.tensorflow.executor.learning.Learning.training
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
self._close_internal(exception_type)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]
提交任务之后不执行代码,print信息未打印。
代码入口处:
if name == 'main':
print('Configuring CNN model...')
print('train level : ' + str(args.level))
method=train; level=2; embedding_dim=300; type=new; l2beta=0.001; hidden_dim=1000; max_contract_length=20000; print_per_batch=10; max_train=30000; model=rcnn; risk=Payment Collection; num_filters=256; learning_rate=0.001
2018-07-02 15:23:35.936210: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-02 15:23:36.558640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:2d:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:36.896654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:31:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.245999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:35:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.607229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:39:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.989295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 4 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:a9:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.372123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 5 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:ad:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.768131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 6 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b1:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.173232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 7 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b5:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.189643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-07-02 15:23:39.189973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 4 5 6 7
2018-07-02 15:23:39.189989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y Y Y N N N N
2018-07-02 15:23:39.189998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y Y Y N N N N
2018-07-02 15:23:39.190005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: Y Y Y Y N N N N
2018-07-02 15:23:39.190014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: Y Y Y Y N N N N
2018-07-02 15:23:39.190021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 4: N N N N Y Y Y Y
2018-07-02 15:23:39.190027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 5: N N N N Y Y Y Y
2018-07-02 15:23:39.190036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 6: N N N N Y Y Y Y
2018-07-02 15:23:39.190042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 7: N N N N Y Y Y Y
2018-07-02 15:23:39.190068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:31:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:35:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)
reating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)
提交训练作业,一直向tensorboard中写入数据,不到5GB时,报错:
2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
启动一个训练作业时,发现很快就结束了,控制台也没有打印任何与loss或是accuracy相关的信息。
输出日志信息如下:
INFO:tensorflow:Restoring parameters from s3://bucket_name/log/model.ckpt-xxx
INFO:tensorflow:Saving checkpoints for xxx into s3://bucket_name/log
相关作业类型: 训练作业
作业ID:
引擎类型: TensorFlow
运行参数:
计算节点个数:
计算节点规格:
开发环境中新建python 提示:Unexpected error while saving file : Untitled.ipynb unable to open database file
(简单描述问题信息,如果是bug,请描述重现步骤)
已经解决,原因是由于用户调试过程中,自己的代码出错生产了这些core dump,导致磁盘空间不足。用户再调试的时候,可以删除掉/home/work目录下面产生的core dump,这样就不会有问题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.