Git Product home page Git Product logo

chat-3d-v2's Introduction

Chat-3D v2

This is an official repo for paper "Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers". [paper]

News

[2024.04] ๐Ÿ”ฅ A refined implementation of Chat-3D v2 is released. The old version v2.0 has been archived in branch v2.0. This main branch is now for the new version (v2.1).

[2024.01] Update training guide for grounding on ScanRefer.

[2023.12] Code release. The main training architecture is based on our former work Chat-3D.

๐Ÿ”ฅ v2.1 vs v2.0

๐Ÿ”จ Preparation

  • Prepare the environment:

    (Different from v2.0)

    conda create -n chat-3d-v2 python=3.9.17
    conda activate chat-3d-v2
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
  • Download LLM backbone:

    • We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the location of vicuna-7b-v1.5.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

๐Ÿค– Training and Inference

  • Training

    • Modify run.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      Explanation of "train_tag" and "val_tag"
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
      • You can try different combination of training datasets or add costumized datasets.

    • Run: bash scripts/run.sh

    • Brief training info:

      Batch Size GPU VRAM Usage per GPU Training Time ckpt
      32 4 * A100 ~ 70 GB ~ 8 hours Google Drive
      1 1 * A100 ~ 28 GB ~ 3 days -
  • Inference

    • Modify run.sh: (We provide the pretrained checkpoint in Google Drive)

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
    • Run: bash scripts/run.sh

๐Ÿ“„ Citation

If you find this project useful in your research, please consider cite:

@article{huang2023chat,
  title={Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers},
  author={Huang, Haifeng and Wang, Zehan and Huang, Rongjie and Liu, Luping and Cheng, Xize and Zhao, Yang and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2312.08168},
  year={2023}
}
@article{wang2023chat,
  title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
  author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
  journal={arXiv preprint arXiv:2308.08769},
  year={2023}
}

Stay tuned for our project. ๐Ÿ”ฅ

If you have any questions or suggestions, feel free to drop us an email ([email protected], [email protected]) or open an issue.

๐Ÿ˜Š Acknowledgement

Thanks to the open source of the following projects:

LLMs: LLaMA, Vicuna

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer

3D Segmentors: PointGroup, Mask3D

3D Encoders: ULIP, Uni3D

Multi-modal LLMs: VideoChat, LEO

3D Expert Models: vil3dref

chat-3d-v2's People

Contributors

zzzzchs avatar chat-3d avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.