Comments (18)
Did you set distributed: True
? Please delete this parameter or set it to False.
nas:
trainer:
distributed: False
from vega.
I just set:" pipeline: [fully_train]" , so I just train the fullytrain process, when I set "distributed: True", it reports the above errors.
from vega.
This is a bug. Please change run_cluster_horovod_train.sh
to run_horovod_train.sh
in line 146 of the vega/core/pipeline/fully_train_pipe_step.py
file.
from vega.
Thank you for your reply !
from vega.
When I changed the above bug, it appeared another error:"RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message."
from vega.
Is there any other error before this message? Please provide more logs.
from vega.
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[1]: _raise(error)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[1]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[1]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f3eccac1f60>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[1]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[1]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[1]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[1]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[1]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[1]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[1]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[1]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[1]: result[0] = yield future
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[1]: value = future.result()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[1]: await self._ensure_connected(timeout=timeout)
from vega.
Can the entire log file be attached? Thanks!
from vega.
Sat Jan 30 08:15:57 2021[0]:INFO:root:worker id [0], epoch [1/1000], train step [530/542], loss [ 5.755, 9.083], lr [ 0.000]
Sat Jan 30 08:16:11 2021[0]:INFO:root:worker id [0], epoch [1/1000], train step [540/542], loss [ 5.659, 9.017], lr [ 0.000]
Sat Jan 30 08:17:35 2021[0]:INFO:root:worker id [0], epoch [1/1000], current valid perfs [SRMetric: 30.339], best valid perfs [SRMetric: 30.339]
Sat Jan 30 08:17:46 2021[0]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[0]:
Sat Jan 30 08:17:46 2021[0]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[0]:
Sat Jan 30 08:17:46 2021[0]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/horovod_train.py", line 44, in
Sat Jan 30 08:17:46 2021[0]: trainer.train_process()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 153, in train_process
Sat Jan 30 08:17:46 2021[0]: self._train_loop()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 291, in _train_loop
Sat Jan 30 08:17:46 2021[0]: self.callbacks.after_epoch(epoch)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/callback_list.py", line 185, in after_epoch
Sat Jan 30 08:17:46 2021[0]: callback.after_epoch(epoch, logs)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 36, in after_epoch
Sat Jan 30 08:17:46 2021[0]: self._broadcast(epoch)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 44, in _broadcast
Sat Jan 30 08:17:46 2021[0]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[0]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[0]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[0]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[0]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[0]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[0]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[0]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[0]: result[0] = yield future
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[0]: value = future.result()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[0]: await self._ensure_connected(timeout=timeout)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1092, in _ensure_connected
from vega.
Sat Jan 30 08:17:46 2021[0]: self.scheduler.address, timeout=timeout, **self.connection_args
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 245, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[1]: _raise(error)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[1]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[1]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f3eccac1f60>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/horovod_train.py", line 44, in
Sat Jan 30 08:17:46 2021[1]: trainer.train_process()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 153, in train_process
Sat Jan 30 08:17:46 2021[1]: self._train_loop()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 291, in _train_loop
Sat Jan 30 08:17:46 2021[1]: self.callbacks.after_epoch(epoch)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/callback_list.py", line 185, in after_epoch
Sat Jan 30 08:17:46 2021[1]: callback.after_epoch(epoch, logs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 36, in after_epoch
Sat Jan 30 08:17:46 2021[1]: self._broadcast(epoch)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 44, in _broadcast
Sat Jan 30 08:17:46 2021[1]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[1]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[1]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[1]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[1]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[1]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[1]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[1]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[1]: result[0] = yield future
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[1]: value = future.result()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[1]: await self._ensure_connected(timeout=timeout)
from vega.
The above is the entire log file, it seems that when the first epoch finished, it raised the " ConnectionRefusedError: [Errno 111] Connection refused" error.
from vega.
Have you sloved the above question?
from vega.
We're debugging. Please wait a moment.
@menglifenglin
from vega.
Can the esr_ea.yml file be shared?
from vega.
general:
# parallel_search: True
parallel_fully_train: True
backend: pytorch
local_base_path: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/examples/tasks/
pipeline: [fully_train]
fully_train:
pipe_step:
type: FullyTrainPipeStep
trainer:
type: Trainer
callbacks: ESRTrainerCallback
node_num: 20
epochs: 1000
distributed: True
optimizer:
type: Adam
params:
lr: 0.0001
lr_scheduler:
type: MultiStepLR
params:
milestones: [8000,12000,13500,14500]
gamma: 0.5
loss:
type: L1Loss
metric:
type: SRMetric
params:
scale: 2
max_rgb: 255
scale: 2
seed: 10
range:
node_num: 20
model:
models_folder: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/
dataset:
type: DIV2K
train:
root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/hr/
root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/lr/
upscale: 2
crop: 64
hflip: true
vflip: true
rot90: true
shuffle: false
batch_size: 16
fixed_size: true
test:
root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/hr/
root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/lr/
upscale: 2
fixed_size: true
crop: 64
from vega.
Please try this configure:
comment: '# parallel_fully_train: True'
replease "models_folder" with "model_desc_file", and specify a model description file, such as desc_2.json
.
general:
# parallel_search: True
# parallel_fully_train: True
backend: pytorch
local_base_path: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/examples/tasks/
pipeline: [fully_train]
fully_train:
pipe_step:
type: FullyTrainPipeStep
trainer:
type: Trainer
callbacks: ESRTrainerCallback
node_num: 20
epochs: 1000
distributed: True
optimizer:
type: Adam
params:
lr: 0.0001
lr_scheduler:
type: MultiStepLR
params:
milestones: [8000,12000,13500,14500]
gamma: 0.5
loss:
type: L1Loss
metric:
type: SRMetric
params:
scale: 2
max_rgb: 255
scale: 2
seed: 10
range:
node_num: 20
model:
# models_folder: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/
model_desc_file: "/data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/desc_<worker id>.json"
dataset:
type: DIV2K
train:
root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/hr/
root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/lr/
upscale: 2
crop: 64
hflip: true
vflip: true
rot90: true
shuffle: false
batch_size: 16
fixed_size: true
test:
root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/hr/
root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/lr/
upscale: 2
fixed_size: true
crop: 64
from vega.
It works well, thanks!
from vega.
After the training is complete, the following warnings will be displayed:
<stderr>:WARNING:root:model statics failed, ex= 'NoneType' object has no attribute'get'
The model is not in the output
directory. Obtain the model from the worker
directory.
This issue has been fixed in our upcoming release.
from vega.
Related Issues (20)
- Darts_cnn full_train跑不起来
- typo fixed in evaluate service but not fixed in evaluator
- dataset? HOT 1
- GPU配置 HOT 14
- 使用vega启动训练失败
- vega 需要对 mindspore 的 set_jit_config 接口更新适配
- vega依赖了torch HOT 2
- adelaide-ea训练失败 HOT 1
- vega-noah:esr-ea 训练失败 HOT 1
- 请问在moderarts上怎么配置,里边预训练模型这些在哪儿下? HOT 5
- sp-nas针对目标检测进行模型结构搜索,在加载数据集时,能够加载自己的数据集吗? HOT 3
- 'vega' 不是内部或外部命令,也不是可运行的程序 或批处理文件 HOT 3
- CARS运行时间问题
- curve lane detection
- CARS for the Cutsom ClassificationDataset
- Precision, Recall, F1 score for classification ?
- Add oriented-rcnn network
- Cannot download from Model Zoo?
- vega-noah:esr-ea/dnet-nas 网络废弃接口告警打印过多
- CARS RUN EFFOR:output file HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vega.