杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是:
tensorflow1.15
cuda10
cudnn7.0.4
可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答
epoch 32 end time is : 2021-01-08 14:21:29.075227
train files shuffled!
is training ep : 33
total train batch num: 100
ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816
ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947
test pred bborder [[2 1 0]]
ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114
ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012
test pred bborder [[2 0 1]]
ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937
ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209
test pred bborder [[2 0 1]]
ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234
ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504
test pred bborder [[2 0 1]]
ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894
ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999
test pred bborder [[2 0 1]]
model saved in : ./log/train_mod/model033.cptk
epoch 33 end time is : 2021-01-08 14:21:44.245053
train files shuffled!
is training ep : 34
total train batch num: 100
ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307
ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805
test pred bborder [[0 2 1]]
ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615
ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012
test pred bborder [[0 2 1]]
ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554
ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296
test pred bborder [[1 0 2]]
2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries
[[{{node bbox/PyFunc}}]]
[[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]
(1) Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries