Git Product home page Git Product logo

Comments (18)

zhangjiajin avatar zhangjiajin commented on May 31, 2024

Did you set distributed: True? Please delete this parameter or set it to False.

nas:
    trainer:
        distributed: False

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

I just set:" pipeline: [fully_train]" , so I just train the fullytrain process, when I set "distributed: True", it reports the above errors.

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

This is a bug. Please change run_cluster_horovod_train.sh to run_horovod_train.sh in line 146 of the vega/core/pipeline/fully_train_pipe_step.py file.

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

Thank you for your reply !

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

When I changed the above bug, it appeared another error:"RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message."

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

Is there any other error before this message? Please provide more logs.

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[1]: _raise(error)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[1]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[1]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f3eccac1f60>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[1]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[1]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[1]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[1]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[1]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[1]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[1]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[1]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[1]: result[0] = yield future
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[1]: value = future.result()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[1]: await self._ensure_connected(timeout=timeout)

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

Can the entire log file be attached? Thanks!

@menglifenglin

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

Sat Jan 30 08:15:57 2021[0]:INFO:root:worker id [0], epoch [1/1000], train step [530/542], loss [ 5.755, 9.083], lr [ 0.000]
Sat Jan 30 08:16:11 2021[0]:INFO:root:worker id [0], epoch [1/1000], train step [540/542], loss [ 5.659, 9.017], lr [ 0.000]
Sat Jan 30 08:17:35 2021[0]:INFO:root:worker id [0], epoch [1/1000], current valid perfs [SRMetric: 30.339], best valid perfs [SRMetric: 30.339]
Sat Jan 30 08:17:46 2021[0]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[0]:
Sat Jan 30 08:17:46 2021[0]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[0]:
Sat Jan 30 08:17:46 2021[0]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/horovod_train.py", line 44, in
Sat Jan 30 08:17:46 2021[0]: trainer.train_process()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 153, in train_process
Sat Jan 30 08:17:46 2021[0]: self._train_loop()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 291, in _train_loop
Sat Jan 30 08:17:46 2021[0]: self.callbacks.after_epoch(epoch)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/callback_list.py", line 185, in after_epoch
Sat Jan 30 08:17:46 2021[0]: callback.after_epoch(epoch, logs)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 36, in after_epoch
Sat Jan 30 08:17:46 2021[0]: self._broadcast(epoch)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 44, in _broadcast
Sat Jan 30 08:17:46 2021[0]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[0]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[0]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[0]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[0]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[0]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[0]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[0]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[0]: result[0] = yield future
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[0]: value = future.result()
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[0]: await self._ensure_connected(timeout=timeout)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1092, in _ensure_connected

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

Sat Jan 30 08:17:46 2021[0]: self.scheduler.address, timeout=timeout, **self.connection_args
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 245, in connect
Sat Jan 30 08:17:46 2021[0]: _raise(error)
Sat Jan 30 08:17:46 2021[0]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[0]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[0]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f259f203048>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 234, in connect
Sat Jan 30 08:17:46 2021[1]: _raise(error)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in _raise
Sat Jan 30 08:17:46 2021[1]: raise IOError(msg)
Sat Jan 30 08:17:46 2021[1]:OSError: Timed out trying to connect to 'tcp://127.0.0.1:8000' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f3eccac1f60>: ConnectionRefusedError: [Errno 111] Connection refused
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:During handling of the above exception, another exception occurred:
Sat Jan 30 08:17:46 2021[1]:
Sat Jan 30 08:17:46 2021[1]:Traceback (most recent call last):
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/horovod_train.py", line 44, in
Sat Jan 30 08:17:46 2021[1]: trainer.train_process()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 153, in train_process
Sat Jan 30 08:17:46 2021[1]: self._train_loop()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer_base.py", line 291, in _train_loop
Sat Jan 30 08:17:46 2021[1]: self.callbacks.after_epoch(epoch)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/callback_list.py", line 185, in after_epoch
Sat Jan 30 08:17:46 2021[1]: callback.after_epoch(epoch, logs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 36, in after_epoch
Sat Jan 30 08:17:46 2021[1]: self._broadcast(epoch)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/trainer/callbacks/report_callback.py", line 44, in _broadcast
Sat Jan 30 08:17:46 2021[1]: record = Report().receive(self.trainer.step_name, self.trainer.worker_id)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/report.py", line 192, in receive
Sat Jan 30 08:17:46 2021[1]: value = ShareMemory("{}.{}".format(step_name, worker_id)).get()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 69, in init
Sat Jan 30 08:17:46 2021[1]: self.var = Variable(name, client=ShareMemoryClient().client)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/common/utils.py", line 39, in get_instance
Sat Jan 30 08:17:46 2021[1]: instances[cls] = cls(*args, **kw)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/zeus/report/share_memory.py", line 45, in init
Sat Jan 30 08:17:46 2021[1]: self._client = Client(address=address)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 736, in init
Sat Jan 30 08:17:46 2021[1]: self.start(timeout=timeout)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 940, in start
Sat Jan 30 08:17:46 2021[1]: sync(self.loop, self._start, **kwargs)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
Sat Jan 30 08:17:46 2021[1]: raise exc.with_traceback(tb)
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
Sat Jan 30 08:17:46 2021[1]: result[0] = yield future
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/tornado/gen.py", line 762, in run
Sat Jan 30 08:17:46 2021[1]: value = future.result()
Sat Jan 30 08:17:46 2021[1]: File "/root/.local/lib/python3.6/site-packages/distributed/client.py", line 1037, in _start
Sat Jan 30 08:17:46 2021[1]: await self._ensure_connected(timeout=timeout)

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

The above is the entire log file, it seems that when the first epoch finished, it raised the " ConnectionRefusedError: [Errno 111] Connection refused" error.

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

Have you sloved the above question?

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

We're debugging. Please wait a moment.
@menglifenglin

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

Can the esr_ea.yml file be shared?

@menglifenglin

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024
general:
#     parallel_search: True
    parallel_fully_train: True
    backend: pytorch
    local_base_path: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/examples/tasks/
pipeline: [fully_train]

        
fully_train:
    pipe_step:
        type: FullyTrainPipeStep

    trainer:
        type: Trainer
        callbacks: ESRTrainerCallback
        node_num: 20
        epochs: 1000
        distributed: True
        optimizer:
            type: Adam
            params:
                lr: 0.0001
        lr_scheduler:
            type: MultiStepLR
            params:
                milestones: [8000,12000,13500,14500]
                gamma: 0.5

        loss:
            type: L1Loss
        metric:
            type: SRMetric
            params:
                scale: 2 
                max_rgb: 255
        scale: 2
        seed: 10
        range:
            node_num: 20
            
    model: 
        models_folder: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/ 
        
    dataset:
        type: DIV2K
        train:
            root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/hr/
            root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/lr/
            upscale: 2
            crop: 64
            hflip: true
            vflip: true
            rot90: true 
            shuffle: false
            batch_size: 16
            fixed_size: true
        test:
            root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/hr/
            root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/lr/
            upscale: 2
            fixed_size: true
            crop: 64

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

Please try this configure:

comment: '# parallel_fully_train: True'
replease "models_folder" with "model_desc_file", and specify a model description file, such as desc_2.json.

general:
#     parallel_search: True
#     parallel_fully_train: True
    backend: pytorch
    local_base_path: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/examples/tasks/
pipeline: [fully_train]

        
fully_train:
    pipe_step:
        type: FullyTrainPipeStep

    trainer:
        type: Trainer
        callbacks: ESRTrainerCallback
        node_num: 20
        epochs: 1000
        distributed: True
        optimizer:
            type: Adam
            params:
                lr: 0.0001
        lr_scheduler:
            type: MultiStepLR
            params:
                milestones: [8000,12000,13500,14500]
                gamma: 0.5

        loss:
            type: L1Loss
        metric:
            type: SRMetric
            params:
                scale: 2 
                max_rgb: 255
        scale: 2
        seed: 10
        range:
            node_num: 20
            
    model: 
#        models_folder: /data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/ 
        model_desc_file: "/data/glusterfs_hz_cv_v2/11127485/NAS_test/vega-master/tasks3/0112.112555.518/output/nas/desc_<worker id>.json"
        
    dataset:
        type: DIV2K
        train:
            root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/hr/
            root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_train/lr/
            upscale: 2
            crop: 64
            hflip: true
            vflip: true
            rot90: true 
            shuffle: false
            batch_size: 16
            fixed_size: true
        test:
            root_HR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/hr/
            root_LR: /data/glusterfs_hz_cv_v2/11127485/dataset/div2k/div2k_valid/lr/
            upscale: 2
            fixed_size: true
            crop: 64

from vega.

menglifenglin avatar menglifenglin commented on May 31, 2024

It works well, thanks!

from vega.

zhangjiajin avatar zhangjiajin commented on May 31, 2024

After the training is complete, the following warnings will be displayed:

<stderr>:WARNING:root:model statics failed, ex= 'NoneType' object has no attribute'get'

The model is not in the output directory. Obtain the model from the worker directory.

This issue has been fixed in our upcoming release.

@menglifenglin

from vega.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.