Git Product home page Git Product logo

dls-example's Issues

执行TFServing预测作业的客户端代码报错,提示Connect Failed

基本信息

  • Python版本: 2.7
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

创建预测作业,所使用的引擎为TFServing,作业显示为运行中,但是使用样例客户端代码显示连接不上。

作业基本信息

  • 相关作业类型: 预测作业

  • 作业ID:

  • 引擎类型: TF

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

执行:python dls-tfserving-client/python/predict.py --task_type="image_classification" --host=117.78.40.73 --port=30955 --data_path="/data/aicenter_eric/learn/tensorflow/test/0_pb1.b276.ps1521.ps1521-s.page2.c_20170308143640954568336910845.jpg" --labels_file_path="dls-tfserving-client/data/flowers/labels.txt" --model_name="resnet_v1_50"
报:grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed")

预置模型库读取TFRecord数据集很慢

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

运行作业日志提示如下信息,并经过很长时间都没有反应。日志一直卡在下面这句话。

INFO:tensorflow:Find tfrecord files. Using tfrecord files in this job.
INFO:tensorflow:Automatically extracting num_samples from tfrecord. If the dataset is large, it may take some time. You can also manually specify the num_samples to Dataset to save time.

作业基本信息

  • 相关作业类型: 预置模型库-创建训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

训练作业错误. InvalidArgumentError: slice index 2 of dimension 0 out of bounds.

基本信息

  • Python版本: 2.7
  • MoXing版本:1.0.6
  • 浏览器:Chrome

问题描述 / 重现步骤

预置模型库 -> ResNet_v1_50 -> 创建训练作业 -> 选择一个数据集并提交作业

作业基本信息

  • 相关作业类型: 训练作业
  • 作业ID:
  • 引擎类型: TensorFlow
  • 运行参数:

train_url=s3://zzy/zzy/data/log/carped/
model_name=resnet_v1_50
checkpoint_url=s3://zzy/zzy/pretrained_model/resnet_v1_50/

  • 计算节点个数:1
  • 计算节点规格:1*P100

相关源码 / 输出日志

2018-05-26 15:19:35.802270: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 2 of dimension 0 out of bounds.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, slice index 2 of dimension 0 out of bounds.
	 [[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]
INFO:tensorflow:Saving checkpoints for 273 into s3://zzy/zzy/data/log/carped/model.ckpt.
Traceback (most recent call last):
  File "resnet_v1_50_code/finetune_model.py", line 518, in <module>
    main()
  File "resnet_v1_50_code/finetune_model.py", line 514, in main
    export_model=mox.ExportKeys.TF_SERVING)
  File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 329, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 483, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
  File "moxing/tensorflow/executor/learning.pyx", line 661, in moxing.tensorflow.executor.learning.Learning.run
  File "moxing/tensorflow/executor/learning.pyx", line 1156, in moxing.tensorflow.executor.learning.Learning.training
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
    ignore_live_threads=True)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
    target_list_as_strings, status, None)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 2 of dimension 0 out of bounds.
	 [[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]

训练作业失败,报AssertionError: can only join a started process

基本信息

  • Python版本: (2.7)
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

按照官网的教程,https://support.huaweicloud.com/usermanual-dls/dls_01_0077.html
在ROMA上训练失败

作业基本信息

  • 相关作业类型:

  • 作业ID:d2f13d03-cd72-4d60-88cf-15b5e3fad120

  • 引擎类型: (TensorFlow or MXNet)

  • 运行参数:train_url=s3://final-model; model_name=resnet_v1_50; checkpoint_url=s3://final-model/resnet_v1_50

  • 计算节点个数:1

  • 计算节点规格:8核|64GiB|1*P100

相关源码 / 输出日志

ceback (most recent call last):
  File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
  File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
  File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

运行pytorch作业出现错误 RuntimeError: unable to write to file

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

提交基于pytorch的训练作业时,遇到如下错误:

RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: pytorch

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Traceback (most recent call last):
  File "pytorch/main.py", line 205, in <module>
    main()
  File "pytorch/main.py", line 116, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "pytorch/main.py", line 136, in train
    for i, (input, target) in enumerate(train_loader):
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 113, in default_collate
    storage = batch[0].storage()._new_shared(numel)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/storage.py", line 114, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 35, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 178, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 87) is killed by signal: Bus error.

Error when reading .npy/.npz files using moxing

基本信息

  • Python版本: (3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

Error when reading .npy/.npz files using moxing

Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow)

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Source code:
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))

Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'

absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'num_gpus'

基本信息

  • Python版本: ( 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:chrome

问题描述 / 重现步骤

使用num_gpus = mox.get_flag('num_gpus'),出现错误
(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow)

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:16核|128GiB|2*P100

相关源码 / 输出日志

WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
Traceback (most recent call last):
  File "bert/run_classifier_new.py", line 760, in <module>
    main()
  File "bert/run_classifier_new.py", line 608, in main
    if not flags.do_train and not flags.do_eval and not flags.do_predict:
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/flags.py", line 84, in __getattr__
    wrapped(_sys.argv)
  File "/home/work/anaconda3/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 630, in __call__
    name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'num_gpus'

checkpoint保存失败

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

在保存训练模型时, TensorBoard数据文件可以保存,但是ModelCheckpoint数据文件无法保存,另外参考文档https://bbs.huaweicloud.com/forum/thread-11660-1-1.html中有提到
image,“由于cp_callback、tb_callback不能直接写入,...”, 求解惑。

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8核 | 64GiB | 1*P100 | 750GB
    --

相关源码 / 输出日志

(Your Log Here)
![image](https://user-images.githubusercontent.com/32593000/55632541-1e603d00-57ed-11e9-9c9c-392c11ddfea2.png)

训练作业突然失败停止

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(1.8.2)
  • 浏览器:Chrome

问题描述 / 重现步骤

正常启动程序,训练ResNet50模型(300M左右模型文件),但是运行了多个epoch后突然显示以下信息(见Log),任务失败。原因是Unable to connect to endpoint,可能是OBS连接不稳定所致。

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID: resnet-42586680-10

  • 引擎类型: (TensorFlow)

  • 运行参数:无

  • 计算节点个数:1

  • 计算节点规格:单机8卡

相关源码 / 输出日志

Caused by op u'ModelSaver/save/SaveV2', defined at:
File "resnet_cloud_multi/imagenet_resnet_cloud.py", line 218, in
launch_train_with_config(config, trainer)
...
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): : Unable to connect to endpoint

预测客户端,no module named predict_pb2

在命令行中输入示例,如下,
python dls-tfserving-client/python/predict.py
--task_type="image_classification"
--host=my.host
--port=my.port
--data_path="xx/dls-tfserving-client/data/flowers/flower1.jpg"
--labels_file_path="xx/dls-tfserving-client/data/flowers/labels.txt"
--model_name="graph"
出现如下问题,请问如何解决呢
image

训练作业提示:数据集找不到,No such file or directory

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

训练数据集在OBS上存在,创建训练作业填写data_url为OBS路径,训练失败,提示:No such file or directory

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

日志截图:
image

相关代码:

  if not gfile.Exists(image_dir):
    print("Image directory '" + image_dir + "' not found.")
    return None
  result = {}
  sub_dirs = [x[0] for x in os.walk(image_dir)]

公有云运行TensorFlow训练作业出现错误:truncated record at xxxxx

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

在公有云上提交训练作业(比如冰山识别项目), 训练过程中连接obs读取tfrecord数据时出现错误,报错信息大致如下所示:

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
2018-08-30 13:29:29.379592: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
  File "code/train_iceberg.py", line 164, in <module>
    save_model_secs=120) 
  File "moxing/tensorflow/executor/learning_builder.pyx", line 379, in moxing.tensorflow.executor.learning_builder.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 287, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 484, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
  File "moxing/tensorflow/executor/learning.pyx", line 649, in moxing.tensorflow.executor.learning.Learning.run
  File "moxing/tensorflow/executor/learning.pyx", line 1117, in moxing.tensorflow.executor.learning.Learning.training
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
    ignore_live_threads=True)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
    target_list_as_strings, status, None)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]

创建训练作业失败(ValueError: learning_rate_strategy should be like: ...)

基本信息

  • Python版本: 2.7
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

这是训练集数据

image

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:train_url=s3://donotdel-dls/models/
    model_name=resnet_v1_50
    checkpoint_url=s3://donotdel-dls/models/resnet_v1_50/
    batch_size=2

  • 计算节点个数:1

  • 计算节点规格:8核 | 32GiB
    --

相关源码 / 输出日志

INFO:tensorflow:Total 3301 files found in s3://donotdel-dls-test/models/data/
INFO:tensorflow:Total number of samples is 3301
INFO:tensorflow:Labels index to name:
INFO:tensorflow:0: daisy
INFO:tensorflow:1: dandelion
INFO:tensorflow:2: roses
INFO:tensorflow:3: sunflowers
INFO:tensorflow:4: tulips
Traceback (most recent call last):
  File "resnet_v1_50_code/finetune_model.py", line 501, in <module>
    main()
  File "resnet_v1_50_code/finetune_model.py", line 262, in main
    raise ValueError("learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. "
ValueError: learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. e_i:l_i represents that from epoch e_i-1 to e_i use learning rate l_i. And training will stop at epoch e_n'

启动预测作业找不到模型, no versions of servable mode found

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

启动预测作业,如果提示信息类似如下:

tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:268] No versions of servable resnet_v1_50 found under base path s3://dls-test/log/resnet_v1_50/1/

作业基本信息

  • 相关作业类型: 预测作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

提交自定义网络模型时如何设置参数

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)
在DLS提交自定义的网络模型,如何设置运行参数

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet) TensorFlow

  • 运行参数:

  • 计算节点个数:8核

  • 计算节点规格:60G

相关源码 / 输出日志

(Your Log Here)

保存模型出现Unable to connect to endpoint错误

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

使用的模型是Transformer (paper: Attention is all you need)。
模型大小为1024 units (参数量约382x4 MB) 时,出现错误信息。
模型大小为512 units (参数量约148x4 MB) 时运行正常。
尝试设置环境变量 S3_REQUEST_TIMEOUT_MSEC 和 S3_REQUEST_TIMEOUT到更大的值,比如6000000,依旧在大模型上出现错误

训练使用的Adam,因此保存checkpoint的时候参数量会更大
如果使用SGD,也不会出现问题

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: Tensorflow 1.4

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8xP100

相关源码 / 输出日志

INFO:tensorflow:Create CheckpointSaverHook.
2018-06-30 15:52:46.447723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
INFO:tensorflow:Saving checkpoints for 1 into s3://mt-models/transformer/model_big/model.ckpt.
2018-06-30 15:55:35.289438: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 786, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 521, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 892, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 967, in run
raise six.reraise(*original_exc_info)
File "/usr/local/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1032, in run
run_metadata=run_metadata))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 452, in after_run
self._save(run_context.session, global_step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 468, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1573, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint

Caused by op 'save/SaveV2', defined at:
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 783, in _train_model
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 368, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 673, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 493, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 851, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 856, in _create_session
return self._sess_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 554, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 419, in create_session
self._scaffold.finalize()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver.build()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 742, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 381, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 355, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 296, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 239, in save_op
tensors)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1163, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): : Unable to connect to endpoint

无法打开上传的juypter notebook 也无法引用同目录下的.py 文件

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:chrome

问题描述 / 重现步骤

  1. 无法打开我上传的文件,文件路径:bucket-8579 >zhicheng-> keyword_extraction.ipynb. 打开以后提示说:string do not have split method之类的
  2. 试图引用 import Ipynb_importer ,但是报错说找不到文件,但是文件就在工作目录下啊
    (简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

(Your Log Here)

numpy.load读文件失败

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型: 训练

  • 作业ID:prejob-92946102

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8核|64GiB|1*P100

相关源码 / 输出日志

INFO:tensorflow:RGB checkpoint restored
Traceback (most recent call last):
  File "kinetics-i3d-master/evaluate_sample.py", line 146, in <module>
    tf.app.run(main)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "kinetics-i3d-master/evaluate_sample.py", line 113, in main
    rgb_sample = np.load(_SAMPLE_PATHS['rgb'])
  File "/usr/local/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 372, in load
    fid = open(file, "rb")
IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/v_CricketShot_g04_c01_rgb.npy'

TensorFlow-1.8作业反复打印aws_logging

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

基于TensorFlow-1.8启动训练作业,并在代码中使用 tf.gfile模块连接OBS。
(AKSK等基本环境变量在DLS中已经设置好)
启动训练作业后会频繁打印如下日志信息。

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet): TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

2018-07-17 11:29:04.024753: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:04.025943: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:04.026045: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.556230: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.556383: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.558030: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.558174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.582349: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.582455: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.583321: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.583429: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.585176: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.585264: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.586086: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.586174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.589516: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.589625: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.590120: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.590219: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.076022: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:07.076192: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.078109: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key

读数据报错

基本信息

  • Python版本: (2.7 / 3.6)

IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/label_map.txt'

callback保存训练模型报错

基本信息

  • Python版本: ( 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet)

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Traceback (most recent call last):
  File "CnnModel/DCNN.py", line 152, in <module>
    model = cnn_model()
  File "CnnModel/DCNN.py", line 66, in cnn_model
    callbacks=[check],
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
    validation_steps=validation_steps)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 219, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/callbacks.py", line 452, in on_epoch_end
    filepath = self.filepath.format(epoch=epoch + 1, **logs)
AttributeError: 'File' object has no attribute 'format'

训练作业失败,ImportError: No module named module_dir

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

代码结构如下:

project_dir
    |- main.py
  |- module_dir
    |- module_file.py

用户在main.py中有代码

from module_dir import module_file

发生如下错误:

Traceback (most recent call last):
  File "project_dir/main.py", line 1, in <module>
    from module_dir import module_file
ImportError: No module named module_dir

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

训练作业出错:input must be 4-dimensional[1,1,300,300,3].

问题描述 / 重现步骤

DLS服务-预置模型库-创建训练作业-选择自己的一个数据集并训练出现错误。

作业基本信息

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:
    train_url=s3://cat-body-six-classes/model.resnet_v1_50/
    batch_size=32
    learning_rate_strategy=10:0.01,20:0.001
    file_pattern=flowers_*
    max_epoches=20
    image_size=224
    num_classes=6
    samples_per_epoch=589
    checkpoint_exclude_patterns=logits.global_step

  • 计算节点个数:1

  • 计算节点规格:1*P100

  • 截图:

相关源码 / 输出日志

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]
INFO:tensorflow:Saving checkpoints for 2 into s3://cat-body-six-classes/log/model.ckpt.
Traceback (most recent call last):
File "resnet_v1_50_code/finetune_model.py", line 519, in <module>
main()
File "resnet_v1_50_code/finetune_model.py", line 515, in main
export_model=mox.ExportKeys.TF_SERVING)
File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 237, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 491, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
File "moxing/tensorflow/executor/learning.pyx", line 653, in moxing.tensorflow.executor.learning.Learning.run
File "moxing/tensorflow/executor/learning.pyx", line 1124, in moxing.tensorflow.executor.learning.Learning.training
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
self._close_internal(exception_type)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]

任务不执行

基本信息

  • Python版本: (3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

提交任务之后不执行代码,print信息未打印。

代码入口处:
if name == 'main':

print('Configuring CNN model...')
print('train level : ' + str(args.level))

作业基本信息

  • 相关作业类型:
  • 作业ID: 0e03da5b-c4a8-4de9-ac1f-38deb740b531
  • 引擎类型: TensorFlow1.4 PY 3.6
  • 运行参数:

method=train; level=2; embedding_dim=300; type=new; l2beta=0.001; hidden_dim=1000; max_contract_length=20000; print_per_batch=10; max_train=30000; model=rcnn; risk=Payment Collection; num_filters=256; learning_rate=0.001

  • 计算节点个数:1
  • 计算节点规格:64核|512GiB|8*P100

相关源码 / 输出日志

2018-07-02 15:23:35.936210: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-02 15:23:36.558640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:2d:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:36.896654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:31:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.245999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:35:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.607229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:39:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.989295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 4 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:a9:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.372123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 5 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:ad:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.768131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 6 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b1:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.173232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 7 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b5:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.189643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-07-02 15:23:39.189973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 4 5 6 7
2018-07-02 15:23:39.189989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y Y Y N N N N
2018-07-02 15:23:39.189998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y Y Y N N N N
2018-07-02 15:23:39.190005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: Y Y Y Y N N N N
2018-07-02 15:23:39.190014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: Y Y Y Y N N N N
2018-07-02 15:23:39.190021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 4: N N N N Y Y Y Y
2018-07-02 15:23:39.190027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 5: N N N N Y Y Y Y
2018-07-02 15:23:39.190036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 6: N N N N Y Y Y Y
2018-07-02 15:23:39.190042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 7: N N N N Y Y Y Y
2018-07-02 15:23:39.190068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:31:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:35:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)
reating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)

tensorflow在s3写tensorboard到达5GB时停止

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

提交训练作业,一直向tensorboard中写入数据,不到5GB时,报错:

2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:	
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

训练作业没有训练信息,很快就结束训练了

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

启动一个训练作业时,发现很快就结束了,控制台也没有打印任何与loss或是accuracy相关的信息。

输出日志信息如下:

INFO:tensorflow:Restoring parameters from s3://bucket_name/log/model.ckpt-xxx
INFO:tensorflow:Saving checkpoints for xxx into s3://bucket_name/log

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

开发环境中无法新建python

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

开发环境中新建python 提示:Unexpected error while saving file : Untitled.ipynb unable to open database file

(简单描述问题信息,如果是bug,请描述重现步骤)

已经解决,原因是由于用户调试过程中,自己的代码出错生产了这些core dump,导致磁盘空间不足。用户再调试的时候,可以删除掉/home/work目录下面产生的core dump,这样就不会有问题

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.