eslambakr / lar-look-around-and-refer Goto Github PK

This is the official implementation for our paper;"LAR:Look Around and Refer".

License: MIT License

Python 46.89% C++ 48.74% Makefile 0.05% Shell 0.29% CMake 0.04% C 2.73% Cuda 1.26%

3d 3dvisualgrounding deeplearning deep-neural-networks cnn multimodal-deep-learning grounding transformers geometry machine-learning

lar-look-around-and-refer's Introduction

LAR-Look-Around-and-Refer

This is the official implementation for our paper

"LAR:Look Around and Refer:2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding"

Eslam Mohamed BAKR, Yasmeen Alsaedy, and Mohamed Elhouseiny

📡 :book: :clapper:

📢 News

(Aug 19, 2022): LAR:Look Around and Refer is accepted at NeurIPS2022.

🚀 Awesome Features in the code

Support distributed training.
Our projection codes can be adapted easily and extended to other applications, where the codes do not rely on any 3D library (e.g. open3D)
We adopted a fully pytorch implementation for P++ instead of cuda implementation.

📚 Introduction

3D visual grounding task has been explored with visual and language streams to comprehend referential language for identifying targeted objects in 3D scenes. However, most existing methods devote the visual stream to capture the 3D visual clues using off-the-shelf point clouds encoders. The main question we address is can we consolidate the 3D visual stream by 2D clues and efficiently utilize them in both training and testing phases?. The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, that empirically show their aptitude to boost the quality of the learned visual representations. We validate our approach through comprehensive experiments on Nr3D, Sr3D, and ScanRefer datasets. Our experiments show consistent performance gains against counterparts, where our proposed module, dubbed as LAR, significantly outperforms state-of-the-art 3D visual grounding techniques on three benchmarks.

🔥 Architecture

Detailed overview of LAR. The Visual Module incorporates rich 3D representation from the extracted 3D object points with the 2D synthetic image features using Visual Transformer. The 2D synthetic images are first extracted by SIG then processed by a shared ConvNext backbone. Simultaneously, language descriptions are transformed into tokens and embedded into a feature vector. Then, the Multi-Modal Module fuses the output of the Visual Module separately by two Transformers. Finally, the output of the Multi-Modal Transformers is processed by a Fusion Transformers.

📌 Prerequisites

Python >= 3.5
Pytorch >= 1.7.0
Install other common packages (numpy, pytorch_transformers, etc.)
Please refer to setup.py (From ReferIt3D).

📌 Data

👉 ScanNet

First you should download the train/val scans of ScanNet if you do not have them locally. Please refer to the instructions from referit3d for more details. The output is the scanfile keep_all_points_00_view_with_global_scan_alignment.pkl/keep_all_points_with_global_scan_alignment.pkl.

👉 Ref3D Linguistic Data

You can dowload the Nr3D and Sr3D/Sr3D+ from Referit3D, and send the file path to referit3D-file.

👉 LAR Synthetic 2D Features

You can generate the synthetic data once and save it to save training time. Projection codes

Simplified overview of our 2D Synthetic Images Generator (SIG) module. First, we determine the prominent face for each object w.r.t the scene center. Then, the camera is located at a distance $d^f$ from that face, and at distance $d^{up}$ from the room’s floor. Finally, we randomly extend the region of interest by $\epsilon$.

📌 Training

Please reference the following example command on Nr3D. Feel free to change the parameters. Please reference arguments for valid options.

scanfile=keep_all_points_00_view_with_global_scan_alignment.pkl ## keep_all_points_with_global_scan_alignment if include Sr3D
python train_referit3d.py --patience 100 --max-train-epochs 100 --init-lr 1e-4 --batch-size 16 --transformer --model mmt_referIt3DNet -scannet-file $scanfile -referit3D-file $nr3dfile_csv --log-dir log/$exp_id --n-workers 2 --gpu 0 --unit-sphere-norm True --feat2d clsvecROI --context_2d unaligned --mmt_mask train2d --warmup

📌 Evaluation

Please find the pretrained models here (clsvecROI on Nr3D). A known issue here.

python train_referit3d.py --transformer --model mmt_referIt3DNet -scannet-file $scanfile -referit3D-file $nr3dfile --log-dir log/$exp_id --n-workers 2 --gpu $gpu --unit-sphere-norm True --feat2d clsvecROI --mode evaluate --pretrain-path $pretrain_path/best_model.pth

💐 Credits

The project is built based on the following repository:

ReferIt3D.
P++

Part of the code or models are from ScanRef, MMF, TAP, and ViLBERT.

☎️ Contact us

[email protected]

📬 Citation

Please consider citing our paper if you find it useful.

@inproceedings{bakrlook, 
               title={Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding},
               author={Bakr, Eslam Mohamed and 
                       Alsaedy, Yasmeen Youssef and
                       Elhoseiny, Mohamed},
               booktitle={Advances in Neural Information Processing Systems} 
               }

lar-look-around-and-refer's People

Contributors

Stargazers

Watchers

Forkers

amiragamil hamed1337

lar-look-around-and-refer's Issues

Error running train/eval scripts

Hi authors,

I'm trying to run your train/eval code, but I got the following error:

(LAR) [fjd@gpu10 LAR-Look-Around-and-Refer]$ python train_referit3d.py --transformer --model mmt_referIt3DNet -scannet-file $scanfile -referit3D-file $nr3dfile_csv --log-dir log/$exp_id --n-workers 2 --gpu 0 --unit-sphere-norm True --feat2d clsvecROI --mode evaluate --pretrained-path $pretrain_path/best_model.pth -load-dense False -load-imgs False --img-encoder False
1.13.1+cu117
Number of available GPUs is: 1
True
{'augment_with_sr3d': None,
'batch_size': 32,
'camaug': False,
'checkpoint_dir': 'log/test/06-20-2023-13-43-55/checkpoints',
'clspred3d': False,
'cluster_pid': None,
'cocoon': False,
'config_file': None,
'context_2d': None,
'context_info_2d_cached_file': None,
'context_obj': None,
'contrastiveloss': False,
'dgcnn_intermediate_feat_dim': [128, 128, 128, 128],
'dist_backend': 'nccl',
'dist_url': 'tcp://224.66.41.62:23456',
'eval_path': None,
'experiment_tag': None,
'feat2d': 'clsvecROI',
'feat2ddim': 2048,
'fine_tune': False,
'freeze_backbone': False,
'geo3d': False,
'gpu': 0,
'graph_out_dim': 128,
'img_encoder': False,
'imgaug': False,
'imgsize': 32,
'init_lr': 0.0005,
'knn': 7,
'lang_cls_alpha': 0.5,
'language_fusion': 'both',
'language_latent_dim': 768,
'load_dense': False,
'load_imgs': False,
'log_dir': 'log/test/06-20-2023-13-43-55',
'max_distractors': 51,
'max_seq_len': 24,
'max_test_objects': 88,
'max_train_epochs': 100,
'mentions_target_class_only': True,
'min_word_freq': 3,
'mmt_latent_dim': 768,
'mmt_mask': None,
'mode': 'evaluate',
'model': 'mmt_referIt3DNet',
'multiprocessing_distributed': False,
'n_workers': 2,
'obj_cls_alpha': 0.5,
'object_encoder': 'pnet_pp',
'object_latent_dim': 768,
'patience': 10,
'points_per_object': 1024,
'pretrained_path': '/share/data/2pals/fjd/SAT_release/SAT_clsvecROI_Nr3D/best_model.pth',
'random_seed': 2020,
'rank': -1,
'referit3D_file': '/share/data/2pals/fjd/referit3d/refer_it_3d/nr3d_train.csv',
'resume_path': None,
's_vs_n_weight': None,
'save_args': True,
'scannet_file': '/share/data/2pals/fjd/referit3d/keep_all_points_00_view_with_global_scan_alignment/keep_all_points_00_view_with_global_scan_alignment.pkl',
'sceneCocoonPath': None,
'sharetwoTrans': False,
'softtripleloss': False,
'tensorboard_dir': 'log/test/06-20-2023-13-43-55/tb_logs',
'train_scanRefer': False,
'train_vis_enc_only': False,
'transformer': True,
'tripleloss': False,
'twoStreams': False,
'twoTrans': False,
'unit_sphere_norm': True,
'vocab_file': None,
'warmup': True,
'word_dropout': 0.1,
'word_embedding_dim': 64,
'world_size': -1}
Use GPU: 0 for training
starting caching the pkl files.....
Loaded in RAM 707 scans
524 instance classes exist in these scans
#train/test scans: 1201 / 312
Finish caching the pkl files, Done
Dropping utterances without explicit mention to the target class 28716->28716
95-th percentile of token length for remaining (training) data is: 20.0
Dropping utterances with more than 24 tokens, 28716->28716
(mean) Random guessing among target-class test objects nan
Dropped 196 scans to reduce mem-foot-print.
Length of vocabulary, with min_word_freq=3 is 1288
511 training scans will be used.
90-th percentile of number of objects in the (training) scans is: 52.00
Traceback (most recent call last):
File "train_referit3d.py", line 445, in
main_worker(args.gpu, ngpus_per_node, args)
File "train_referit3d.py", line 95, in main_worker
mean_rgb, vocab = compute_auxiliary_data(referit_data, all_scans_in_dict, args)
File "/share/data/ripl/fjd/LAR-Look-Around-and-Refer/referit3d/in_out/neural_net_oriented.py", line 199, in compute_auxiliary_data
obj_cnt = objects_counter_percentile(testing_scan_ids, all_scans, prc)
File "/share/data/ripl/fjd/LAR-Look-Around-and-Refer/referit3d/in_out/neural_net_oriented.py", line 45, in objects_counter_percentile
return np.percentile(all_obs_len, prc)
File "<array_function internals>", line 6, in percentile
File "/share/data/ripl/fjd/miniconda3/envs/LAR/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3868, in percentile
a, q, axis, out, overwrite_input, interpolation, keepdims)
File "/share/data/ripl/fjd/miniconda3/envs/LAR/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3988, in _quantile_unchecked
interpolation=interpolation)
File "/share/data/ripl/fjd/miniconda3/envs/LAR/lib/python3.7/site-packages/numpy/lib/function_base.py", line 3564, in _ureduce
r = func(a, **kwargs)
File "/share/data/ripl/fjd/miniconda3/envs/LAR/lib/python3.7/site-packages/numpy/lib/function_base.py", line 4098, in _quantile_ureduce_func
n = np.isnan(ap[-1])
IndexError: index -1 is out of bounds for axis 0 with size 0

I appreciate if you could provide any insights on why this error could happen and how to possibly solve it. Thanks!