Git Product home page Git Product logo

dls-example's People

Contributors

calvinxky avatar cathy0908 avatar charliechow avatar gclouding avatar hahayc avatar hydrogenqaq avatar sleepfin avatar xinglihua2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dls-example's Issues

预置模型库读取TFRecord数据集很慢

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

运行作业日志提示如下信息,并经过很长时间都没有反应。日志一直卡在下面这句话。

INFO:tensorflow:Find tfrecord files. Using tfrecord files in this job.
INFO:tensorflow:Automatically extracting num_samples from tfrecord. If the dataset is large, it may take some time. You can also manually specify the num_samples to Dataset to save time.

作业基本信息

  • 相关作业类型: 预置模型库-创建训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

checkpoint保存失败

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

在保存训练模型时, TensorBoard数据文件可以保存,但是ModelCheckpoint数据文件无法保存,另外参考文档https://bbs.huaweicloud.com/forum/thread-11660-1-1.html中有提到
image,“由于cp_callback、tb_callback不能直接写入,...”, 求解惑。

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8核 | 64GiB | 1*P100 | 750GB
    --

相关源码 / 输出日志

(Your Log Here)
![image](https://user-images.githubusercontent.com/32593000/55632541-1e603d00-57ed-11e9-9c9c-392c11ddfea2.png)

训练作业失败,报AssertionError: can only join a started process

基本信息

  • Python版本: (2.7)
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

按照官网的教程,https://support.huaweicloud.com/usermanual-dls/dls_01_0077.html
在ROMA上训练失败

作业基本信息

  • 相关作业类型:

  • 作业ID:d2f13d03-cd72-4d60-88cf-15b5e3fad120

  • 引擎类型: (TensorFlow or MXNet)

  • 运行参数:train_url=s3://final-model; model_name=resnet_v1_50; checkpoint_url=s3://final-model/resnet_v1_50

  • 计算节点个数:1

  • 计算节点规格:8核|64GiB|1*P100

相关源码 / 输出日志

ceback (most recent call last):
  File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
  File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/local/anaconda2/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/mind/ci-workspace/tf-model-v1.0.0-dailybuild/moxing/build/moxing/tensorflow/datasets/raw/async_raw_reader.py", line 168, in request_stop
  File "/usr/local/anaconda2/lib/python2.7/multiprocessing/process.py", line 147, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

tensorflow在s3写tensorboard到达5GB时停止

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

提交训练作业,一直向tensorboard中写入数据,不到5GB时,报错:

2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:	
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

2018-08-17 13:06:50.929457: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-08-17 13:06:50.929633: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-08-17 13:06:50.936199: W tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
EntityTooLarge
Your proposed upload exceeds the maximum allowed object size.:
2018-08-17 13:06:50.936241: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

训练作业错误. InvalidArgumentError: slice index 2 of dimension 0 out of bounds.

基本信息

  • Python版本: 2.7
  • MoXing版本:1.0.6
  • 浏览器:Chrome

问题描述 / 重现步骤

预置模型库 -> ResNet_v1_50 -> 创建训练作业 -> 选择一个数据集并提交作业

作业基本信息

  • 相关作业类型: 训练作业
  • 作业ID:
  • 引擎类型: TensorFlow
  • 运行参数:

train_url=s3://zzy/zzy/data/log/carped/
model_name=resnet_v1_50
checkpoint_url=s3://zzy/zzy/pretrained_model/resnet_v1_50/

  • 计算节点个数:1
  • 计算节点规格:1*P100

相关源码 / 输出日志

2018-05-26 15:19:35.802270: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 2 of dimension 0 out of bounds.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, slice index 2 of dimension 0 out of bounds.
	 [[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]
INFO:tensorflow:Saving checkpoints for 273 into s3://zzy/zzy/data/log/carped/model.ckpt.
Traceback (most recent call last):
  File "resnet_v1_50_code/finetune_model.py", line 518, in <module>
    main()
  File "resnet_v1_50_code/finetune_model.py", line 514, in main
    export_model=mox.ExportKeys.TF_SERVING)
  File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 329, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 483, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
  File "moxing/tensorflow/executor/learning.pyx", line 661, in moxing.tensorflow.executor.learning.Learning.run
  File "moxing/tensorflow/executor/learning.pyx", line 1156, in moxing.tensorflow.executor.learning.Learning.training
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
    ignore_live_threads=True)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
    target_list_as_strings, status, None)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 2 of dimension 0 out of bounds.
	 [[Node: strided_slice_6 = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Shape_2, strided_slice_6/stack, strided_slice_6/stack_1, strided_slice_6/stack_2)]]

提交自定义网络模型时如何设置参数

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)
在DLS提交自定义的网络模型,如何设置运行参数

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet) TensorFlow

  • 运行参数:

  • 计算节点个数:8核

  • 计算节点规格:60G

相关源码 / 输出日志

(Your Log Here)

创建训练作业失败(ValueError: learning_rate_strategy should be like: ...)

基本信息

  • Python版本: 2.7
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

这是训练集数据

image

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:train_url=s3://donotdel-dls/models/
    model_name=resnet_v1_50
    checkpoint_url=s3://donotdel-dls/models/resnet_v1_50/
    batch_size=2

  • 计算节点个数:1

  • 计算节点规格:8核 | 32GiB
    --

相关源码 / 输出日志

INFO:tensorflow:Total 3301 files found in s3://donotdel-dls-test/models/data/
INFO:tensorflow:Total number of samples is 3301
INFO:tensorflow:Labels index to name:
INFO:tensorflow:0: daisy
INFO:tensorflow:1: dandelion
INFO:tensorflow:2: roses
INFO:tensorflow:3: sunflowers
INFO:tensorflow:4: tulips
Traceback (most recent call last):
  File "resnet_v1_50_code/finetune_model.py", line 501, in <module>
    main()
  File "resnet_v1_50_code/finetune_model.py", line 262, in main
    raise ValueError("learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. "
ValueError: learning_rate_strategy should be like: 'e_0:l_0,e_1:l_1,...,e_n:l_n. e_i:l_i represents that from epoch e_i-1 to e_i use learning rate l_i. And training will stop at epoch e_n'

开发环境中无法新建python

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

开发环境中新建python 提示:Unexpected error while saving file : Untitled.ipynb unable to open database file

(简单描述问题信息,如果是bug,请描述重现步骤)

已经解决,原因是由于用户调试过程中,自己的代码出错生产了这些core dump,导致磁盘空间不足。用户再调试的时候,可以删除掉/home/work目录下面产生的core dump,这样就不会有问题

训练作业没有训练信息,很快就结束训练了

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

启动一个训练作业时,发现很快就结束了,控制台也没有打印任何与loss或是accuracy相关的信息。

输出日志信息如下:

INFO:tensorflow:Restoring parameters from s3://bucket_name/log/model.ckpt-xxx
INFO:tensorflow:Saving checkpoints for xxx into s3://bucket_name/log

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

任务不执行

基本信息

  • Python版本: (3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

提交任务之后不执行代码,print信息未打印。

代码入口处:
if name == 'main':

print('Configuring CNN model...')
print('train level : ' + str(args.level))

作业基本信息

  • 相关作业类型:
  • 作业ID: 0e03da5b-c4a8-4de9-ac1f-38deb740b531
  • 引擎类型: TensorFlow1.4 PY 3.6
  • 运行参数:

method=train; level=2; embedding_dim=300; type=new; l2beta=0.001; hidden_dim=1000; max_contract_length=20000; print_per_batch=10; max_train=30000; model=rcnn; risk=Payment Collection; num_filters=256; learning_rate=0.001

  • 计算节点个数:1
  • 计算节点规格:64核|512GiB|8*P100

相关源码 / 输出日志

2018-07-02 15:23:35.936210: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-02 15:23:36.558640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:2d:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:36.896654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:31:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.245999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:35:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.607229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:39:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:37.989295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 4 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:a9:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.372123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 5 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:ad:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:38.768131: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 6 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b1:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.173232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 7 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b5:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-07-02 15:23:39.189643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-07-02 15:23:39.189973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 4 5 6 7
2018-07-02 15:23:39.189989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y Y Y N N N N
2018-07-02 15:23:39.189998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y Y Y N N N N
2018-07-02 15:23:39.190005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: Y Y Y Y N N N N
2018-07-02 15:23:39.190014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: Y Y Y Y N N N N
2018-07-02 15:23:39.190021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 4: N N N N Y Y Y Y
2018-07-02 15:23:39.190027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 5: N N N N Y Y Y Y
2018-07-02 15:23:39.190036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 6: N N N N Y Y Y Y
2018-07-02 15:23:39.190042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 7: N N N N Y Y Y Y
2018-07-02 15:23:39.190068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:31:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:35:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)
reating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:39:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:a9:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:ad:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0)
2018-07-02 15:23:39.190119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b5:00.0, compute capability: 6.0)

训练作业提示:数据集找不到,No such file or directory

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

训练数据集在OBS上存在,创建训练作业填写data_url为OBS路径,训练失败,提示:No such file or directory

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

日志截图:
image

相关代码:

  if not gfile.Exists(image_dir):
    print("Image directory '" + image_dir + "' not found.")
    return None
  result = {}
  sub_dirs = [x[0] for x in os.walk(image_dir)]

训练作业失败,ImportError: No module named module_dir

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

代码结构如下:

project_dir
    |- main.py
  |- module_dir
    |- module_file.py

用户在main.py中有代码

from module_dir import module_file

发生如下错误:

Traceback (most recent call last):
  File "project_dir/main.py", line 1, in <module>
    from module_dir import module_file
ImportError: No module named module_dir

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

启动预测作业找不到模型, no versions of servable mode found

基本信息

  • Python版本: 2.7
  • MoXing版本:
  • 浏览器:

问题描述 / 重现步骤

启动预测作业,如果提示信息类似如下:

tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:268] No versions of servable resnet_v1_50 found under base path s3://dls-test/log/resnet_v1_50/1/

作业基本信息

  • 相关作业类型: 预测作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

预测客户端,no module named predict_pb2

在命令行中输入示例,如下,
python dls-tfserving-client/python/predict.py
--task_type="image_classification"
--host=my.host
--port=my.port
--data_path="xx/dls-tfserving-client/data/flowers/flower1.jpg"
--labels_file_path="xx/dls-tfserving-client/data/flowers/labels.txt"
--model_name="graph"
出现如下问题,请问如何解决呢
image

读数据报错

基本信息

  • Python版本: (2.7 / 3.6)

IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/label_map.txt'

训练作业突然失败停止

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(1.8.2)
  • 浏览器:Chrome

问题描述 / 重现步骤

正常启动程序,训练ResNet50模型(300M左右模型文件),但是运行了多个epoch后突然显示以下信息(见Log),任务失败。原因是Unable to connect to endpoint,可能是OBS连接不稳定所致。

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID: resnet-42586680-10

  • 引擎类型: (TensorFlow)

  • 运行参数:无

  • 计算节点个数:1

  • 计算节点规格:单机8卡

相关源码 / 输出日志

Caused by op u'ModelSaver/save/SaveV2', defined at:
File "resnet_cloud_multi/imagenet_resnet_cloud.py", line 218, in
launch_train_with_config(config, trainer)
...
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): : Unable to connect to endpoint

训练作业出错:input must be 4-dimensional[1,1,300,300,3].

问题描述 / 重现步骤

DLS服务-预置模型库-创建训练作业-选择自己的一个数据集并训练出现错误。

作业基本信息

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:
    train_url=s3://cat-body-six-classes/model.resnet_v1_50/
    batch_size=32
    learning_rate_strategy=10:0.01,20:0.001
    file_pattern=flowers_*
    max_epoches=20
    image_size=224
    num_classes=6
    samples_per_epoch=589
    checkpoint_exclude_patterns=logits.global_step

  • 计算节点个数:1

  • 计算节点规格:1*P100

  • 截图:

相关源码 / 输出日志

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]
INFO:tensorflow:Saving checkpoints for 2 into s3://cat-body-six-classes/log/model.ckpt.
Traceback (most recent call last):
File "resnet_v1_50_code/finetune_model.py", line 519, in <module>
main()
File "resnet_v1_50_code/finetune_model.py", line 515, in main
export_model=mox.ExportKeys.TF_SERVING)
File "moxing/tensorflow/executor/learning_builder.pyx", line 375, in moxing.tensorflow.executor.learning_builder.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 237, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
File "moxing/tensorflow/executor/learning_wrapper.pyx", line 491, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
File "moxing/tensorflow/executor/learning.pyx", line 653, in moxing.tensorflow.executor.learning.Learning.run
File "moxing/tensorflow/executor/learning.pyx", line 1124, in moxing.tensorflow.executor.learning.Learning.training
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
self._close_internal(exception_type)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be 4-dimensional[1,1,300,300,3]
[[Node: ResizeBilinear = ResizeBilinear[T=DT_UINT8, align_corners=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims, ResizeBilinear/size)]]

无法打开上传的juypter notebook 也无法引用同目录下的.py 文件

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:chrome

问题描述 / 重现步骤

  1. 无法打开我上传的文件,文件路径:bucket-8579 >zhicheng-> keyword_extraction.ipynb. 打开以后提示说:string do not have split method之类的
  2. 试图引用 import Ipynb_importer ,但是报错说找不到文件,但是文件就在工作目录下啊
    (简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

(Your Log Here)

absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'num_gpus'

基本信息

  • Python版本: ( 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:chrome

问题描述 / 重现步骤

使用num_gpus = mox.get_flag('num_gpus'),出现错误
(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow)

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:16核|128GiB|2*P100

相关源码 / 输出日志

WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
WARNING:root:Use of the keyword argument names (flag_name, default_value, docstring) is deprecated, please use (name, default, help) instead.
Traceback (most recent call last):
  File "bert/run_classifier_new.py", line 760, in <module>
    main()
  File "bert/run_classifier_new.py", line 608, in main
    if not flags.do_train and not flags.do_eval and not flags.do_predict:
  File "/home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/flags.py", line 84, in __getattr__
    wrapped(_sys.argv)
  File "/home/work/anaconda3/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 630, in __call__
    name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'num_gpus'

保存模型出现Unable to connect to endpoint错误

基本信息

  • Python版本: 3.6
  • MoXing版本:(未使用则不填写)
  • 浏览器:Chrome

问题描述 / 重现步骤

使用的模型是Transformer (paper: Attention is all you need)。
模型大小为1024 units (参数量约382x4 MB) 时,出现错误信息。
模型大小为512 units (参数量约148x4 MB) 时运行正常。
尝试设置环境变量 S3_REQUEST_TIMEOUT_MSEC 和 S3_REQUEST_TIMEOUT到更大的值,比如6000000,依旧在大模型上出现错误

训练使用的Adam,因此保存checkpoint的时候参数量会更大
如果使用SGD,也不会出现问题

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: Tensorflow 1.4

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8xP100

相关源码 / 输出日志

INFO:tensorflow:Create CheckpointSaverHook.
2018-06-30 15:52:46.447723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:2d:00.0, compute capability: 6.0)
INFO:tensorflow:Saving checkpoints for 1 into s3://mt-models/transformer/model_big/model.ckpt.
2018-06-30 15:55:35.289438: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 786, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 521, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 892, in run
run_metadata=run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 967, in run
raise six.reraise(*original_exc_info)
File "/usr/local/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1032, in run
run_metadata=run_metadata))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 452, in after_run
self._save(run_context.session, global_step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 468, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1573, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: : Unable to connect to endpoint

Caused by op 'save/SaveV2', defined at:
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 616, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/work/user-job-dir/seq2seq-master/bin/train.py", line 611, in main
schedule=schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 219, in run
return _execute_schedule(experiment, schedule)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/learn_runner.py", line 47, in _execute_schedule
return task()
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 641, in train_and_evaluate
self.train(delay_secs=0)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 378, in train
hooks=self._train_monitors + extra_hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/experiment.py", line 823, in _call_train
hooks=hooks)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 303, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/work/user-job-dir/seq2seq-master/seq2seq/estimator/estimator.py", line 783, in _train_model
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 368, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 673, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 493, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 851, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 856, in _create_session
return self._sess_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 554, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 419, in create_session
self._scaffold.finalize()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver.build()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 742, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 381, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 355, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 296, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 239, in save_op
tensors)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1163, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): : Unable to connect to endpoint

运行pytorch作业出现错误 RuntimeError: unable to write to file

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

提交基于pytorch的训练作业时,遇到如下错误:

RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: pytorch

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Traceback (most recent call last):
  File "pytorch/main.py", line 205, in <module>
    main()
  File "pytorch/main.py", line 116, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "pytorch/main.py", line 136, in train
    for i, (input, target) in enumerate(train_loader):
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 113, in default_collate
    storage = batch[0].storage()._new_shared(numel)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/storage.py", line 114, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 35, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 178, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 87) is killed by signal: Bus error.

TensorFlow-1.8作业反复打印aws_logging

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

基于TensorFlow-1.8启动训练作业,并在代码中使用 tf.gfile模块连接OBS。
(AKSK等基本环境变量在DLS中已经设置好)
启动训练作业后会频繁打印如下日志信息。

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet): TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

2018-07-17 11:29:04.024753: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:04.025943: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:04.026045: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.556230: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.556383: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.558030: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.558174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.582349: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.582455: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.583321: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.583429: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.585176: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.585264: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.586086: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.586174: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.589516: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.589625: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:05.590120: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:05.590219: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.076022: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
2018-07-17 11:29:07.076192: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-07-17 11:29:07.078109: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key

执行TFServing预测作业的客户端代码报错,提示Connect Failed

基本信息

  • Python版本: 2.7
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

创建预测作业,所使用的引擎为TFServing,作业显示为运行中,但是使用样例客户端代码显示连接不上。

作业基本信息

  • 相关作业类型: 预测作业

  • 作业ID:

  • 引擎类型: TF

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

执行:python dls-tfserving-client/python/predict.py --task_type="image_classification" --host=117.78.40.73 --port=30955 --data_path="/data/aicenter_eric/learn/tensorflow/test/0_pb1.b276.ps1521.ps1521-s.page2.c_20170308143640954568336910845.jpg" --labels_file_path="dls-tfserving-client/data/flowers/labels.txt" --model_name="resnet_v1_50"
报:grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed")

callback保存训练模型报错

基本信息

  • Python版本: ( 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow or MXNet)

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Traceback (most recent call last):
  File "CnnModel/DCNN.py", line 152, in <module>
    model = cnn_model()
  File "CnnModel/DCNN.py", line 66, in cnn_model
    callbacks=[check],
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
    validation_steps=validation_steps)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 219, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/callbacks.py", line 452, in on_epoch_end
    filepath = self.filepath.format(epoch=epoch + 1, **logs)
AttributeError: 'File' object has no attribute 'format'

公有云运行TensorFlow训练作业出现错误:truncated record at xxxxx

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

在公有云上提交训练作业(比如冰山识别项目), 训练过程中连接obs读取tfrecord数据时出现错误,报错信息大致如下所示:

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]

作业基本信息

  • 相关作业类型: 训练作业

  • 作业ID:

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
2018-08-30 13:29:29.379592: W tensorflow/core/framework/op_kernel.cc:1192] Internal: : Unable to connect to endpoint
Traceback (most recent call last):
  File "code/train_iceberg.py", line 164, in <module>
    save_model_secs=120) 
  File "moxing/tensorflow/executor/learning_builder.pyx", line 379, in moxing.tensorflow.executor.learning_builder.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 228, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper.run
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 287, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._profile_train
  File "moxing/tensorflow/executor/learning_wrapper.pyx", line 484, in moxing.tensorflow.executor.learning_wrapper.LearningWrapper._run
  File "moxing/tensorflow/executor/learning.pyx", line 649, in moxing.tensorflow.executor.learning.Learning.run
  File "moxing/tensorflow/executor/learning.pyx", line 1117, in moxing.tensorflow.executor.learning.Learning.training
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in __exit__
    self._close_internal(exception_type)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
    self._sess.close()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
    ignore_live_threads=True)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
    target_list_as_strings, status, None)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 9832454
	 [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]

Error when reading .npy/.npz files using moxing

基本信息

  • Python版本: (3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

Error when reading .npy/.npz files using moxing

Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'

作业基本信息

  • 相关作业类型:

  • 作业ID:

  • 引擎类型: (TensorFlow)

  • 运行参数:

  • 计算节点个数:

  • 计算节点规格:

相关源码 / 输出日志

Source code:
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))

Error log:
Traceback (most recent call last):
File "src/main_kaggle.py", line 12, in
train_img, train_mask = read_train_data(mod)
File "/home/work/user-job-dir/src/data_util.py", line 18, in read_train_data
X_train = np.load(mox.file.read('s3://bucket-medical/ISLES/npy_data/train_'+MODULE_SELECTED+'_img.npy', binary=True))
File "/usr/local/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
AttributeError: 'bytes' object has no attribute 'read'

numpy.load读文件失败

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(未使用则不填写)
  • 浏览器:

问题描述 / 重现步骤

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型: 训练

  • 作业ID:prejob-92946102

  • 引擎类型: TensorFlow

  • 运行参数:

  • 计算节点个数:1

  • 计算节点规格:8核|64GiB|1*P100

相关源码 / 输出日志

INFO:tensorflow:RGB checkpoint restored
Traceback (most recent call last):
  File "kinetics-i3d-master/evaluate_sample.py", line 146, in <module>
    tf.app.run(main)
  File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "kinetics-i3d-master/evaluate_sample.py", line 113, in main
    rgb_sample = np.load(_SAMPLE_PATHS['rgb'])
  File "/usr/local/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 372, in load
    fid = open(file, "rb")
IOError: [Errno 2] No such file or directory: 's3://bucket-3216/train_data/data/v_CricketShot_g04_c01_rgb.npy'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.