Git Product home page Git Product logo

maskclustering's Introduction

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Mi Yan1,2, Jiazhao Zhang1,2 Yan Zhu1, He Wang1,2,3
1Peking University, 2Beijing Academy of Artificial Intelligence, 3Galbot

CVPR 2024


Given an RGB-D scan and a reconstructed point cloud, MaskClustering leverages multi-view verification to merge 2D instance masks in each frame into 3D instances, achieving strong zero-shot open-vocabulary 3D instance segmentation performance on the ScanNet, ScanNet++, and MatterPort3D datasets. teaser

Fast Demo

Step 1: Install dependencies

First, install PyTorch following the official instructions, e.g., for CUDA 11.8.:

conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Then, install Pytorch3D. You can try 'pip install pytorch3d', but it doesn't work for me. Therefore I install it from source:

cd third_party
git clone [email protected]:facebookresearch/pytorch3d.git
cd pytorch3d && pip install -e .

Finally, install other dependencies:

cd ../..
pip install -r requirements.txt

Step 2: Download demo data from Google Drive or from Baidu Drive (password: szrt). Then unzip the data to ./data and your directory should look like this: data/demo/scene0608_00, etc.

Step 3: Run the clustering demo and visualize the class-agnostic result using Pyviz3d:

bash demo.sh

Quantitative Results

In this section, we provide a comprehensive guide on installing the full version of MaskClustering, data preparation, and conducting experiments on the ScnaNet, ScanNet++, and MatterPort3D datasets.

Further installation

To run the full pipeline of MaskClustering, you need to install 2D instance segmentation tool Cropformer and Open CLIP.

CropFormer

The official installation of Cropformer is composed of two steps: installing detectron2 and then Cropformer. For your convenience, I have combined the two steps into the following scripts. If you have any problems, please refer to the original Cropformer installation guide.

cd third_party
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ../
git clone [email protected]:qqlu/Entity.git
cp -r Entity/Entityv2/CropFormer detectron2/projects
cd detectron2/projects/CropFormer/entity_api/PythonAPI
make
cd ../..
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
pip install -U openmim
mim install mmcv

We add an additional script into cropformer to make it sequentialy process all sequences.

cd ../../../../../../../../
cp mask_predict.py third_party/detectron2/projects/CropFormer/demo_cropformer

Finally, download the CropFormer checkpoint and modify the 'cropformer_path' variable in script.py.

CLIP

Install the open clip library by

pip install open_clip_torch

For the checkpoint, when you run the script, it will automatically download the checkpoint. However, if you want to download it manually, you can download it from here and set the path when loading CLIP model using 'create_model_and_transforms' function.

Data Preparation

ScanNet

Please follow the official ScanNet guide to sign the agreement and send it to [email protected]. After receiving the response, you can download the data. You only need to download the ['.aggregation.json', '.sens', '.txt', '_vh_clean_2.0.010000.segs.json', '_vh_clean_2.ply', '_vh_clean_2.labels.ply'] files. Please also set the 'label_map' on to download the 'scannetv2-labels.combined.tsv' file.

After downloading the data, you can run the following script to prepare the data. Please change the 'raw_data_dir', 'target_data_dir', 'split_file_path', 'label_map_file' and 'gt_dir' variables before you run.

cd preprocess/scannet
python process_val.py
python prepare_gt.py

After running the script, you will get the following directory structure:

data/scannet
  ├── processed
      ├── scene0011_00
          ├── pose                            <- folder with camera poses
          │      ├── 0.txt 
          │      ├── 10.txt 
          │      └── ...  
          ├── color                           <- folder with RGB images
          │      ├── 0.jpg  (or .png/.jpeg)
          │      ├── 10.jpg (or .png/.jpeg)
          │      └── ...  
          ├── depth                           <- folder with depth images
          │      ├── 0.png  (or .jpg/.jpeg)
          │      ├── 10.png (or .jpg/.jpeg)
          │      └── ...  
          ├── intrinsic                 
          │      └── intrinsic_depth.txt       <- camera intrinsics
          |      └── ...
          └── scene0011_00_vh_clean_2.ply      <- point cloud of the scene
  └── gt                                       <- folder with ground truth 3D instance masks
      ├── scene0011_00.txt
      └── ...

ScanNet++

Please follow the official ScanNet++ guide to sign the agreement and download the data. In order to help reproduce the results, we provide the configs we use to download and preprocess the scannet++ in preprocess/scannetpp. Please modify the paths in these configs and paste them to the corresponding folders before running the script. Then clone the ScanNet++ toolkit.

To extract the rgb and depth image, run the following script:

  python -m iphone.prepare_iphone_data iphone/configs/prepare_iphone_data.yml
  python -m common.render common/configs/render.yml

Since the original mesh is of super high resolution, we downsample it and generate the ground truth accordingly as the following:

  python -m semantic.prep.prepare_training_data semantic/configs/prepare_training_data.yml
  python -m semantic.prep.prepare_semantic_gt semantic/configs/prepare_semantic_gt.yml

After running the script, you will get the following directory structure:

data/scannetpp
  ├── data
      ├── 0d2ee665be
          ├── iphone                            
          |       ├── rgb
          │         ├── frame_000000.jpg 
          │         ├── frame_000001.jpg 
          │         └── ... 
          |       ├── render_depth 
          │         ├── frame_000000.png 
          │         ├── frame_000001.png 
          │         └── ... 
          |       └── ... 
          └── scans                        
      └── ...
  ├── gt 
  ├── metadata
  ├── pcld_0.25     <- downsampled point cloud of the scene
  └── splits

MatterPort3D

Please follow the official MatterPort3D guide to sign the agreement and download the data. We use a subset of its testing scenes to ensure Mask3D remains within memory constraints. The list of scenes we use can be found in splits/matterport3d.txt. Download only the following: ['undistorted_color_images', 'undistorted_depth_images', 'undistorted_camera_parameters', 'house_segmentations']. Upon download, unzip the files. Your directory structure should resemble (or you can modify the paths in 'preprocess/matterport3d/process.py' and 'dataset/matterport.py'):

data/matterport3d/scans
  ├── 2t7WUuJeko7
      ├── 2t7WUuJeko7
          ├── house_segmentations
          |         ├── 2t7WUuJeko7.ply
          |         └── ...
          ├── undistorted_camera_parameters
          |         └── 2t7WUuJeko7.conf
          ├── undistorted_color_images
          |         ├── xxx_i0_0.jpg
          |         └── ...
          └── undistorted_depth_images
                    ├── xxx_d0_0.png
                    └── ...
  ├── ARNzJeq3xxb
  ├── ...
  └── YVUC4YcDtcY

Then run the following script to prepare the ground truth:

cd preprocess/matterport3d
python process.py

Running Experiments

Simply find the corresponding config in the 'configs' folder and run the following command. Remember to change the 'cropformer_path' variable in the config and the 'CUDA_LIST' variable in the run.py.

  python run.py --config config_name

For example, to run the ScanNet experiment, you can run the following command:

  python run.py --config scannet

This run.py will get the 2D instance masks, run mask clustering, get open-vocabulary features and evaluate the results. The evaluation results will be saved in the 'data/evaluation' folder.

Time cost

We report the GPU hour of each step on Nvidia 3090 GPU.

2D mask prediction mask clustering CLIP feature extraction Overall time per scene
ScanNet 5 6.5 2 13.5 2.6 min
ScanNet++ 4.5 4 0.5 9 10min
MatterPort3D 0.5 1 0.25 2 15 min

Visualization

To visualize the 3D class-agnostic result of one specific scene, run the following command:

  python -m visualize.vis_scene --config scannet --seq_name scene0608_00

maskclustering's People

Contributors

miyandoris avatar

Stargazers

Wenyuan Zhang avatar killer9 avatar  avatar Zhang Ru avatar yzj2019 avatar Yuze Li avatar Zesong Yang avatar Leon Zheng avatar coco avatar  avatar  avatar  avatar  Lu Jihui avatar zhuyun97 avatar Chen_Weiguang avatar  avatar Roy Yang avatar  avatar Jingwen Wang avatar fyy avatar Yan Wang avatar Baoxiong Jia avatar zhuziyu avatar  avatar Hanxun Yu avatar Jing Zeng avatar Rui Shao avatar Alpha avatar Nguyen Duc Anh Phuc avatar  avatar Jiajun Deng avatar  avatar Hongwei Fan avatar Kashu Yamazaki avatar Mingdong Wu avatar  avatar  avatar Jiang Xueying avatar  avatar  avatar Iris Guo avatar Yan Xu avatar 6wmwm avatar Xiaobing Han avatar  avatar  avatar Luis avatar sun avatar  avatar Kuang Haofei avatar Yun Liu avatar Niccolò Parodi avatar David Mulero avatar Zhifeng Gu avatar L_WY avatar  avatar

Watchers

He Wang avatar Nguyen Duc Anh Phuc avatar Luis avatar  avatar  avatar

Forkers

whuhxb

maskclustering's Issues

Where can I get the gt_files for ScanNet while not ScanNet++ for evaluation?

Thanks for your excellent works and generous sharing! Where I notice the data structure of ScanNet in your readme goes as follows:
data/scannet/processed

├── scene0011_00
├── pose <- folder with camera poses
│ ├── 0.txt
│ ├── 10.txt
│ └── ...
├── color <- folder with RGB images
│ ├── 0.jpg (or .png/.jpeg)
│ ├── 10.jpg (or .png/.jpeg)
│ └── ...
├── depth <- folder with depth images
│ ├── 0.png (or .jpg/.jpeg)
│ ├── 10.png (or .jpg/.jpeg)
│ └── ...
├── intrinsic
│ └── intrinsic_depth.txt <- camera intrinsics
| └── ...
└── scene0011_00_vh_clean_2.ply <- point cloud of the scene
└── ...
Since I cannot find any files in txt or npz format for evaluation in above path, I just wonder where I can download it? Thanks!

How to improve 3D point cloud with missing labels?

Hello author, thanks for open source such an excellent job! I've been working with the code recently, and I've noticed that some of the finer details (such as table legs, chair legs, etc.) are missing after 3D blending, even though the 2D Mask splits them up.
image

Could you give some suggestions about how to avoid missing these fine details(such as the config setting)? Thanks a lot!

Problems with evaluation script and scannetpp

Congratulations on the amazing work and thank you for sharing this user-friendly code base.

I would like to ask for help with a few issues:

  1. It appears that each scene contains more iPhone/RGB images than the number of cameras defined in their iPhone/colmap files. Is this correct, or there is something wrong with my setup? If it is correct, could this be the reason for the segmented point clouds missing pieces of the sampled point cloud?

  2. Following the provided instructions, I was able to set up the ScanNetPP dataset and run the run.py script up to the class-agnostic evaluation step. However, I'm encountering issues with this evaluation. It outputs NaN for all classes except for doors. Despite this, the setup and code seem correct, as I can visualize the segmentation results and they look good. I have also tried evaluating single scenes to simplify the problem, but I only get metrics for the door class. I have attached the ground truth and prediction file for a scene in case it helps in identifying the problem.

a24f64f7fb.zip

Could you please assist me with these issues? Your help would be greatly appreciated.

About Training Time

Dear Authors,

Do you release the training code? How long does your method need to train?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.