kaichen1998 / geodiffusion Goto Github PK

View Code? Open in Web Editor NEW

55.0 4.0 0.0 4.2 MB

Official PyTorch implementation of GeoDiffusion in ICLR 2024 (https://arxiv.org/abs/2306.04607)

Home Page: https://kaichen1998.github.io/projects/geodiffusion/

License: MIT License

Python 98.73% Shell 1.27%

geodiffusion's Introduction

GeoDiffusion

This repository contains the implementation of the paper:

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation
Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung
International Conference on Learning Representations (ICLR), 2024.

Installation

Clone this repo and create the GeoDiffusion environment with conda. We test the code under python==3.7.16, pytorch==1.12.1, cuda=10.2 on Tesla V100 GPU servers. Other versions might be available as well.

Initialize the conda environment:

git clone https://github.com/KaiChen1998/GeoDiffusion.git
conda create -n geodiffusion python=3.7 -y
conda activate geodiffusion

Install the required packages:

cd GeoDiffusion
# when running training
pip install -r requirements/train.txt
# only when running inference with DPM-Solver++
pip install -r requirements/dev.txt

Download Pre-trained Models

Dataset	Image Resolution	Grid Size	Download
nuImages	256x256	256x256	HF Hub
nuImages	512x512	512x512	HF Hub
nuImages_time_weather	512x512	512x512	HF Hub
COCO-Stuff	256x256	256x256	HF Hub
COCO-Stuff	512x512	256x256	HF Hub

Detection Data Generation with GeoDiffusion

Download the pre-trained models and put them under the root directory. Run the following commands to run detection data generation with GeoDiffusion. For simplicity, we embed the layout definition process in the file run_layout_to_image.py directly. Check here for detailed definition.

python run_layout_to_image.py $CKPT_PATH --output_dir ./results/

Train GeoDiffusion

1. Prepare dataset

We primarily use the nuImages and COCO-Stuff datasets for training GeoDiffusion. Download the image files from the official websites. For better training performance, we follow mmdetection3d to convert the nuImages dataset into COCO format (you can also download our converted annotations via HuggingFace), while the converted annotation file for COCO-Stuff can be download via HuggingFace. The data structure should be as follows after all files are downloaded.

├── data
│   ├── coco
│   │   │── coco_stuff_annotations
│   │   │   │── train
│   │   │   │   │── instances_stuff_train2017.json
│   │   │   │── val
│   │   │   │   │── instances_stuff_val2017.json
│   │   │── train2017
│   │   │── val2017
│   ├── nuimages
│   │   │── annotation
│   │   │   │── train
│   │   │   │   │── nuimages_v1.0-train.json
│   │   │   │── val
│   │   │   │   │── nuimages_v1.0-val.json
│   │   │── samples

2. Launch distributed training

We use Accelerate to launch efficient distributed training (with 8 x V100 GPUs by default). We encourage readers to check the official documents for personalized training settings. We provide the default training parameters in this script, and to change the training dataset, we can directly change the dataset_config_name argument.

# COCO-Stuff
bash tools/dist_train.sh \
	--dataset_config_name configs/data/coco_stuff_256x256.py \
	--output_dir work_dirs/geodiffusion_coco_stuff

# nuImages
bash tools/dist_train.sh \
	--dataset_config_name configs/data/nuimage_256x256.py \
	--output_dir work_dirs/geodiffusion_nuimages

We also support continuing fine-tuning a pre-trained GeoDiffusion checkpoint on downstream tasks to support more geometric controls in the Textural Inversion manner by only training the newly added tokens. We encourage readers to check here and here for more details.

bash tools/dist_train.sh \
	--dataset_config_name configs/data/coco_stuff_256x256.py \
	--train_text_encoder_params added_embedding \
	--output_dir work_dirs/geodiffusion_coco_stuff_continue

3. Launch batch inference

Different from the more user-friendly inference demo provided here, in this section we provide the scripts to run batch inference throughout a dataset. Note that the inference settings might differ for different checkpoints. We encourage readers to check the generation_config.json file under each pre-trained checkpoint in the Model Zoo for more details.

# COCO-Stuff
# We encourage readers to check https://github.com/ZejianLi/LAMA?tab=readme-ov-file#testing
# to report quantitative results on COCO-Stuff L2I benchmark.
bash tools/dist_test.sh PATH_TO_CKPT \
	--dataset_config_name configs/data/coco_stuff_256x256.py

# nuImages
bash tools/dist_test.sh PATH_TO_CKPT \
	--dataset_config_name configs/data/nuimage_256x256.py

Qualitative Results

More results can be found in the main paper.

The GeoDiffusion Family

We aim to construct a controllable and flexible pipeline for perception data corner case generation and visual world modeling! Check our latest works:

GeoDiffusion: text-prompted geometric controls for 2D object detection.
MagicDrive: multi-view street scene generation for 3D object detection.
TrackDiffusion: multi-object video generation for MOT tracking.
DetDiffusion: customized corner case generation.
Geom-Erasing: geometric controls for implicit concept removal.

Citation

@article{chen2023integrating,
  author    = {Chen, Kai and Xie, Enze and Chen, Zhe and Hong, Lanqing and Li, Zhenguo and Yeung, Dit-Yan},
  title     = {Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt},
  journal   = {arXiv: 2306.04607},
  year      = {2023},
}

Acknowledgement

We adopt the following open-sourced projects:

diffusers: basic codebase to train Stable Diffusion models.
mmdetection: dataloader to handle images with various geometric conditions.
mmdetection3d & LAMA: data pre-processing of the training datasets.

geodiffusion's People

Contributors

Stargazers

Watchers

geodiffusion's Issues

Generation quality of the model

Thanks for your inspiring work!

However, I encountered a problem. When I use the model trained on COCO-Stuff and image size is 512*512, the generation quality seems poor.

The prompt from coco-stuff is:

  layout = {
    "bbox":
      [
        ['metal', 0.04218750074505806, 0.25647059082984924, 0.10000000149011612, 0.5247058868408203],
        ['chair', 0.17940625548362732, 0.4312705993652344, 0.35014063119888306, 0.5062353014945984],
        ['sky-other', 0.606249988079071, 0.0, 0.734375, 0.09882353246212006],
        ['person', 0.0, 0.5493882298469543, 0.07332812249660492, 0.7298117876052856],
        ['pavement', 0.0, 0.5976470708847046, 0.9781249761581421, 1.0],
        ['building-other', 0.0, 0.0, 1.0, 0.7152941226959229],
        ['person', 0.8331093788146973, 0.5236706137657166, 0.913937509059906, 0.8113176226615906],
        ['chair', 0.422062486410141, 0.4221176505088806, 0.6030937433242798, 0.499505877494812],
        ['bus', 0.1626562476158142, 0.29044705629348755, 0.8476094007492065, 0.9376470446586609],
        ['person', 0.32343751192092896, 0.3623529374599457, 0.792187511920929, 0.5176470875740051],
        ['person', 0.9270156025886536, 0.49814116954803467, 0.9953437447547913, 0.8023764491081238],
        ['clothes', 0.15000000596046448, 0.567058801651001, 1.0, 1.0]
      ]
  }

The generation config is:

{
 "dataset": "coco_stuff",
 "num_bucket_per_side": [256, 256],
 "width": 512,
 "height": 512,
 "prompt_template": "An image with {bbox}",
 "cfg_scale": 4.5,
 "num_inference_steps": 50,
 "max_num_bbox": 18
}

However, the generation result seems strange using run_layout_to_image.py:

I've tried different prompts and the results are very confusing.

What's wrong with my operation? Thanks!

When will open the full training code? Thanks

Pretrained models

Thank you for your incredible work! I couldn't find the pretrained model at https://huggingface.co/KaiChen1998/geodiffusion-nuimages-512x512. Could you please let me know how to download the pretrained model from this website?

Train dataset generate

In the trainability part, the paper say 'first filter bounding boxes smaller than 0.2% of the image area, then augment the bounding boxes by randomly flipping with 0.5 probability and shifting no more than 256 pixels.', how does this implemented?
I found that train_pipline set random_flip with a probability of 0.5 in data config, however, this will cause image different from gt annotations in faster rcnn training. Should I first filter annotations and random filp with a probability of 0.5, and then set train_pipline.random_flip zero in data config to generate the images for faster rcnn training?

COCO dataset's Trainablity？

I see that nuimages' Trainablity is trained with 800x456 GeoDiffusion. Is the COCO dataset's Trainablity also trained with 800x456 GeoDiffusion input size? Or is it 512x512?

512x512 training configuration

I want to know the specific configuration of 512x512 size training, --num_bucket_per_side should be 512, 512?

bbox_visualizer function is missing.

Why are the results or md5 all the same after multiple inferences when using the same layout？

Normally，the generate result should be different every time when use the same layout and prompt. But, use your model, the result and md5 of image are all the same

learning rate for train

I find the learning rate in the train script is 1.5e-4, different from "4e−5 for U-Net and 3e−5 for the text encoder" in paper. I also try some different learning rates, finding that lr between 5e-5 and 1.5e-4 is ok, the loss plots are similar, what should the learning rate set?

Generation quality of the nuimages-model

hello,
when I follow your codes train nuimages dataset, the clarity of image generation is very low

coco-stuff 256x256 pretrained model recurrent

The coco-stuff 256x256 pretrained model generates 3097x5 = 15485 images for testing, and then calculates the fid with the val set 3097 images, which is only 24.11, not reaching the accuracy of the paper. I resize all the images to 256x256 and then calculate the fid. Is this the correct procedure?

Nuimages fid？

Does nuimages also generate five times as much data to calculate fid? Is it the same as COCO?

the nuimage 256x256 recurrence

The pretrained nuimage 256x256 model pretrained model generates 14772x5 = 15485 images for testing, and then calculates the fid with the val set 14772 images, while the fid is 19.48. Moreover, I retrain the nuimage 256x256 model on stable diffusion which reach 15.90, also not reaching the accuracy of paper(14.58). I resize the images to 256x256 and use fid_score of pytorch_fid package to calculate the fid. Is this correct ?

no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory

Hello！When I try to load the model that I downloaded from Huggingface, I get an error which says the model is missing the corresponding file, do I need to add other files? Hope you can reply.

Congratulations on Your Paper Being Accepted to ICLR 2024!

Dear Chen, Kai and Xie, Enze and Chen, Zhe and Hong, Lanqing and Li, Zhenguo and Yeung, Dit-Yan,

I hope this message finds you well. I am writing to extend my heartfelt congratulations to you and your co-authors on the acceptance of your paper, "GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation," to the International Conference on Learning Representations (ICLR) 2024!

Your achievement reflects the dedication, hard work, and innovative thinking that went into your research. ICLR is one of the premier conferences in machine learning and artificial intelligence, and being accepted to present your work there is a significant recognition of its quality and impact on the field.

On behalf of the open source community, I want to express our sincere appreciation for your contributions to advancing the frontiers of knowledge in machine learning. Your research not only pushes the boundaries of what is possible but also inspires others in our community to explore new ideas and approaches.

Additionally, I would like to extend our gratitude for your commitment to open science and sharing your knowledge with the community. It is through the generosity of researchers like you that we can collectively accelerate progress and foster collaboration in the field.

In light of the importance of reproducibility and open access to research, we kindly request your assistance in making the training code associated with your paper publicly available. Having access to the training code would greatly benefit researchers and practitioners interested in replicating and building upon your findings.

As a token of our appreciation, we would like to offer our assistance in any way possible. Whether it be providing support with open-sourcing the training code associated with your paper or promoting your work through our channels, please do not hesitate to reach out to us.

Once again, congratulations on this well-deserved accomplishment! We look forward to seeing your presentation at ICLR 2024 and witnessing the continued impact of your research in the years to come.

Warm regards,
CatLoves

When will the training code be released?

Excuse me , When will the complete code (including training code)be released ? Thank you!

Can you release the extra dataset generated by Geodiffusion according to COCO's train set gt as layout, which in paper is used to train a Faster RCNN

A wonderful work!
In your paper, you mentioned that to demonstrate the effect of the data generated by Geodiffusion on improving the performance of the detector, a Geodiffusion model was trained using the COCO dataset, and an extra dataset was obtained by using the COCO train set as the input layout for Geodiffusion. Could you release this generated extra dataset?

Trainability: Faster R-CNN training- input resize of real Images?

Could you tell me

In the Trainability chapter, do both Nuimages and COCO use Faster R-CNN detectors?
When training Faster R-CNN, is the size also resized to 800x456? Or is it another size?
Thank you very much.

Training on custom dataset (KITTI)

I am trying to use your work to train on KITTI dataset, but it does not work.

Error message says that num_samples becomes zero.

The following is my directory tree for the KITTI dataset:
kitti-tree.txt

Any ideas on why this is happening?

请问代码什么时候开源呢

期待

coco-stuff dataset had been filtered？

I used coco-stuff to train data collection 256x256 and found that the fid accuracy is low. I looked at the paper and found that the data had been filtered. Is this the reason? Can you provide the dataset in the paper?