Git Product home page Git Product logo

eda's People

Contributors

yanmin-wu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

eda's Issues

Question about Table 1 and the joint_det setting

Dear authors,

Thanks for your great job. I have question about the --joint_det setting. I find it is set in all the scripts, which means you always train the model with additional utterance data constructed from ScanNet object labels.

Is it a widely-used protocol in 3D Visual Grounding? And is the performance in Tab. 1 come from the joint_det setting?

Best,

Question of visualization

Thanks for your excellent job.I have run through your code,but because I just entered this research field recently,so I have problem to visualize your results,such as detection frame like Figure 1.Would you please
release code of visualization?Thank you very much!

should we add `self.text_encoder.eval()` ?

To freeze the text encoder, you set the param.requires_grad as False:

self.tokenizer = RobertaTokenizerFast.from_pretrained(t_type, local_files_only=True)
self.text_encoder = RobertaModel.from_pretrained(t_type, local_files_only=True)
for param in self.text_encoder.parameters():
    param.requires_grad = False

However, to make sure the Dropout layer in the RobertaModel work as in evaluation (do not drop out neurons randomly), we need add self.text_encoder.eval() at the end of the codes above:

self.tokenizer = RobertaTokenizerFast.from_pretrained(t_type, local_files_only=True)
self.text_encoder = RobertaModel.from_pretrained(t_type, local_files_only=True)
for param in self.text_encoder.parameters():
    param.requires_grad = False
self.text_encoder.eval()

Question about "point_instance_label"

Hi, thanks for you great work.

But I have a question about "point_instance_label" when reading the code.

In src/joint_det_dataset.py, function _get_target_boxes:

point_instance_label = -np.ones(len(scan.pc))
for t, tid in enumerate(tids):
    point_instance_label[scan.three_d_objects[tid]['points']] = t

Is there any problem with setting the label of the point to the sequence index t?

In my opinion, the code should be modified as follows:

point_instance_label = -np.ones(len(scan.pc))
for t, tid in enumerate(tids):
    point_instance_label[scan.three_d_objects[tid]['points']] = tid 

In this way, all points in the same scene can be correctly classified into the corresponding object IDs.

If we set the point_instance_label to t, the label of the point can only be 0 on the scanrefer dataset when joint_det is false, which leads to logic errors.

Ask for visualization code

Can i ask you for the visualization code about plotting the bounding box on the RGB-D data and save as the 2D RGB image, as shown in your figures?

ImportError: Could not import _ext module.

I created the environment as README says, and train on the SR3D,here are the errors:

/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train_dist_mod.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:3333
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=3333
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_0/1/error.json
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1972860) of binary: /home/sd/anaconda3/envs/EDA/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=3333
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_1/1/error.json
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1972896) of binary: /home/sd/anaconda3/envs/EDA/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=3333
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_2/1/error.json
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1972938) of binary: /home/sd/anaconda3/envs/EDA/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=3333
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_2x3bsv88/none_l9qwazu3/attempt_3/1/error.json
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 26, in
import pointnet2._ext as _ext
ImportError: /home/sd/.local/lib/python3.7/site-packages/pointnet2-0.0.0-py3.7-linux-x86_64.egg/pointnet2/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_dist_mod.py", line 19, in
from main_utils import parse_option, BaseTrainTester
File "/home/sd/Harddisk/sba/BS/EDA-master/main_utils.py", line 27, in
from models import HungarianMatcher, SetCriterion, compute_hungarian_loss
File "/home/sd/Harddisk/sba/BS/EDA-master/models/init.py", line 7, in
from .bdetr import BeaUTyDETR
File "/home/sd/Harddisk/sba/BS/EDA-master/models/bdetr.py", line 18, in
from .backbone_module import Pointnet2Backbone
File "/home/sd/Harddisk/sba/BS/EDA-master/models/backbone_module.py", line 23, in
from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_modules.py", line 21, in
import pointnet2_utils
File "/home/sd/Harddisk/sba/BS/EDA-master/pointnet2/pointnet2_utils.py", line 30, in
"Could not import _ext module.\n"
ImportError: Could not import _ext module.
Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1972968) of binary: /home/sd/anaconda3/envs/EDA/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007307529449462891 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1972968", "role": "default", "hostname": "Continent", "state": "FAILED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "1972969", "role": "default", "hostname": "Continent", "state": "FAILED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "Continent", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 3}}
/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 1972968 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/sd/anaconda3/envs/EDA/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


    train_dist_mod.py FAILED       

=======================================
Root Cause:
[0]:
time: 2023-12-23_12:33:23
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 1972968)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2023-12-23_12:33:23
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 1972969)
error_file: <N/A>
msg: "Process failed with exitcode 1"


Can anybody help me?

Problem of Text-Decoupling.

Text decoupling takes a lot of time at each training time and is not very friendly to debug the code. How can we improve this problem? Thans for your great work.

Data processing for VG-w/o-ON

Thanks for your great job.
I observe that EDA proposes a new task, namely VG-w/o-ON. However, I do not know how to get the object name in the sentence. Can you provide the code to process the data for VG-w/o-ON.

I am confused about the log.txt file

  1. I downloaded your publicly available log file log_67_6.txt regarding Sr3D, and noticed something strange: the max_epoch parameter recorded in the log file is 300, but according to the log file, the training lasted for only 48 epochs. Can you explain why? Did you manually stop the program?

  2. When I ran the command sh scripts/train_sr3d.sh, the log.txt showed that the max_epoch was 400, and the training seemed to really continue towards 400 epochs. Therefore, I had to manually stop the program at the 75th epoch.

  3. In your published log file, log_67_6.txt, detailed validation rerults is output every 3 epochs, but this does not happen when I run sh scripts/train_sr3d.sh. I have included a link to my log file for your reference at any time.

Problem of text label

As mentioned in the paper, you have decoupled every text into 5 parts with corresponding labels, how can I get these labels in the code? I have tried to use 'positive_map', 'modify_positive_map', 'pron_positive_map', 'other_entity_map', 'rel_positive_map', but they don't seem to correspond to the texts. I would really appreciate it if you could reply to this question.

problem of evaluation

When evaluating the model, how can I get the overall acc? I can't find overall acc in the output log.

group_free_pred_bboxes

How do you get the bboxes? Do the bboxes sizes are normalized? I can not visualize it in the ScanNetv2 dataset.

The pretrained PointNet++ backbone

Hi, is file "gf_detector_l6o256.pth" the backbone weights of the Group-Free which is trained on ScanNet with a vocabulary of 485 object categories ?

Problem of learning rate

Thanks for your excellent job!When doing experiments,I have find the model is particularly sensitive to learning rates.When training with one 3090GPU,without adjusting lr,the results are as follows:
The long line below is the baseline,the other short lines are the results training on single GPU without adjusting lr
113573aee00e537d09002d5b23e169e
9f5315541308ee4fd5e3e26c3fdabb3
Even when training with two 3090GPUs,adjusting learing rate such as multiple 2 or multiple 1.4 will show the same phenomenon.
So do you have any advice for this problem?Thank you very much!

Overfitting when training with 2 GPUs

Hi,
I tried to train EDA for ScanRefer with 2 A100 GPUs following all your settings, but the performance is about 2% lower than yours. Meanwhile, I obverved the overfitting as the figure shows( after about 21 epochs, the training losses have been decreasing, but verification losses are rising).
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.