3d-vista / 3d-vista Goto Github PK

View Code? Open in Web Editor NEW

172.0 5.0 6.0 6.13 MB

Official implementation of ICCV 2023 paper "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment"

Home Page: https://3d-vista.github.io

License: MIT License

Python 89.94% C 0.54% C++ 3.86% Cuda 5.67%

3d-vision-and-language pretraining transformer

3d-vista's Introduction

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng📧, Siyuan Huang📧, Qing Li📧

This repository is the official implementation of the ICCV 2023 paper "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment".

Paper | arXiv | Project | HuggingFace Demo | Checkpoints

Abstract

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

Install

Install conda package

conda env create --name 3dvista --file=environments.yml

install pointnet2

cd vision/pointnet2
python3 setup.py install

Prepare dataset

Follow Vil3dref and download scannet data under data/scanfamily/scan_data, this folder should look like

./data/scanfamily/scan_data/
├── instance_id_to_gmm_color
├── instance_id_to_loc
├── instance_id_to_name
└── pcd_with_global_alignment

Download scanrefer+referit3d, scanqa, and sqa3d, and put them under /data/scanfamily/annotations

data/scanfamily/annotations/
├── meta_data
│   ├── cat2glove42b.json
│   ├── scannetv2-labels.combined.tsv
│   ├── scannetv2_raw_categories.json
│   ├── scanrefer_corpus.pth
│   └── scanrefer_vocab.pth
├── qa
│   ├── ScanQA_v1.0_test_w_obj.json
│   ├── ScanQA_v1.0_test_wo_obj.json
│   ├── ScanQA_v1.0_train.json
│   └── ScanQA_v1.0_val.json
├── refer
│   ├── nr3d.jsonl
│   ├── scanrefer.jsonl
│   ├── sr3d+.jsonl
│   └── sr3d.jsonl
├── splits
│   ├── scannetv2_test.txt
│   ├── scannetv2_train.txt
│   └── scannetv2_val.txt
└── sqa_task
    ├── answer_dict.json
    └── balanced
        ├── v1_balanced_questions_test_scannetv2.json
        ├── v1_balanced_questions_train_scannetv2.json
        ├── v1_balanced_questions_val_scannetv2.json
        ├── v1_balanced_sqa_annotations_test_scannetv2.json
        ├── v1_balanced_sqa_annotations_train_scannetv2.json
        └── v1_balanced_sqa_annotations_val_scannetv2.json

Download all checkpoints and put them under project/pretrain_weights

Checkpoint	Link	Note
Pre-trained	link	3D-VisTA Pre-trained checkpoint.
ScanRefer	link	Fine-tuned ScanRefer from pre-trained checkpoint.
ScanQA	link	Fine-tined ScanQA from pre-trained checkpoint.
Sr3D	link	Fine-tuned Sr3D from pre-trained checkpoint.
Nr3D	link	Fine-tuned Nr3D from pre-trained checkpoint.
SQA	link	Fine-tuned SQA from pre-trained checkpoint.
Scan2Cap	link	Fine-tuned Scan2Cap from pre-trained checkpoint.

Run 3D-VisTA

To run 3D-VisTA, use the following command, task includes scanrefer, scanqa, sr3d, nr3d, sqa, and scan2cap.

python3 run.py --config project/vista/{task}_config.yml

Acknowledgement

We would like to thank the authors of Vil3dref and for their open-source release.

News

[ 2023.08 ] First version!
[ 2023.09 ] We release codes for all downstream tasks.

Citation:

@article{zhu2023vista,
  title={3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment},
  author={Zhu, Ziyu and Ma, Xiaojian and Chen, Yixin and Deng, Zhidong and Huang, Siyuan and Li, Qing},
  journal={ICCV},
  year={2023}
}

3d-vista's People

Contributors

Stargazers

Watchers

Forkers

rese1f whuhxb chy7074646 allinbsv xuweiyichen jiachen0212

3d-vista's Issues

What is the cls label you used for ReferIt3D fine-tuning?

Dear 3D-VisTA authors,

Thanks for your interesting work.
I wonder what class labels you used for fine-tuning ReferIt3D.

Best,
Runsen

pc_type

Dear Authors,

I am confused about the pc_type. When will you set the pc_type as "pred"?

Best,
Jian Ding

The `ScanScribe` dataset

I have noticed a released version at https://huggingface.co/datasets/edward2021/ScanScribe/tree/main, but the scene_ids seem strange and not correspond to a similar form as scene0000_00.

Demo

Is the demo code currently available? That is, use my own description to get the appropriate grounding results.

Require for the data/scanfamily/annotations/ data

Dear authors:

Thanks for your interesting work.

Can you share us with the data/scanfamily/annotations/ data？

Question about tgt_object_id

Hi, authors, thanks for your attempt on 3D pretraining job.

I have a question about the tgt_object_id of ScanReferDataset. In #12, you implied that the point clouds are sorted by score. If I understand correctly, it's in a descending order of the scores, as you extract the former 50 objects. Then I am confused that why the tgt_object_id is set to the lowest / last score's idx to supervise the cross_entropy?

How do I run the pre-training script?

Dear author:

Thanks for your interesting work.

I have downloaded the pre-training data from Hugging face. How do I run the pre-training script? Or when will pre-training related script be released?

Thanks!!

ScanScribe dataset

Hello! Cool work! Will there be more data released? I think there's currently only 219 unique scenes from 3DSSG scenes right?

scanrefer_vocab.pth and other files missing

Hello,
When I try to run scan2cap, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: './data/scanfamily/annotations/meta_data/scanrefer_vocab.pth'

where can I find this file?

The files of sr3d and nr3d also seem to be different than those linked. Can you point me in the direction of the '.jsonl' files under annotations/refer ?

Looking for your further release~

Thank you for your great job! Looking for your further update for this work~

Demo

Hello,thanks for your great work！
The demo in Huggingface is not working!
I would really appreciate it,thank you!

Question about the dense caption evaluation

It seems like the evaluation codes in https://github.com/3d-vista/3D-VisTA/blob/main/pipeline/pipeline_mixin.py#L291-L345 score the captions regardless of box IoUs, thus the output results could not match the reported values.

pre-trained dataset

hello, the 3D-VisTA is a nice work, I hope you can release the pre-trained scannet datasets.

`scannetv2_raw_categories.json` and `cat2glove42b.json` missing?

Hi, when trying to finetune on ScanQA it looks for scannetv2_raw_categories.json and cat2glove42b.json which I couldn't find. Could you please add these files?

Can not find scanrefer_corpus.pth & scanrefer_vocab.pth?

Dear author:

Thanks for your interesting work.

According to the instructions in the readme, I didn't find these two files: scanrefer_corpus.pth & scanrefer_vocab.pth, could you tell me how to find them correctly?

Thanks!!

How to run 3D-VisTA for Pre-Train?

Dear author:

Thanks for your interesting work.

I would like to understand the details of the pre-training section in the paper. How can I implement the code for the pre-training section? Or when will pre-training related code be released?

Thanks!!

Request for the data/scanfamily/annotations data

Dear Authors,

Thank you for your intriguing work.

Would it be possible to make the "data/scanfamily/annotations" data publicly accessible?

Looking for dataset and evaluation code!

Thanks for your great job~

save_mask.zip for ScanRefer training

Hello,thanks for your great work！
I notice that in save_mask.zip,there are only detection results for evaluating,could you release the save_mask.zip of the other scenes for ScanRefer training?
I would really appreciate it,thank you!

Fine-tuning on ScanQA

Dear Authors,

Thanks for sharing the excellent work. Has the code for fine-tuning on ScanQA been released? And what is the command and config?

Best,
Jian Ding

installation dependency

The environment.yml currently holds nearly 200 lines and i'm having problem when installing it.

Could you suggest the "core" dependent packages that should be installed first?

The code of processing the 3RScan data for pretrain.

Hi,
How do you process the 3RScan data into the pth format under the pcd_with_global_alignment folder for pretrain? Can you provide me with the source code? Thank you very much!

code and dataset

Thanks for your work, so when will it be fully open source?

Ask for more Mask3D results

Hi authors, thanks for your project.

I am now finetuning your model on my private dataset, which is a kind of 3D visual grounding.

However, i found that the segmentation results from the Mask3D in your project only include the scenes of ScanRefer.

I wonder if you have more Mask3D results that contains all the scenes of ScanNet.

Looking forward to your response!

Questions about the pc_type and scanrefer_metrics

Dear author:

When I run 3D-VisTA with the command python run.py --config project/vista/scanrefer_config.yml, the following two issues happened:

After set pc_type to pred, I can not find the save_mask folder under data/scanfamily
In the eval pipeline of scanrefer, the value of og_Acc_Iou25 and og_Acc_Iou50 is exactly the same, but theoretically the value of og_Acc_Iou25 should be higher

I wonder how to solve these two problems?

Best
Xiaolong

pointnet2 installing error

Hi,
Can you provide more details on installing the pointnet2? I have issue install pointnet2==3.0.0.

Thank you

3d-vista / 3d-vista Goto Github PK

3d-vista's Introduction

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Abstract

Install

Prepare dataset

Run 3D-VisTA

Acknowledgement

News

Citation:

3d-vista's People

Contributors

Stargazers

Watchers

Forkers

3d-vista's Issues

Recommend Projects

Recommend Topics

Recommend Org