BoxDiff 🎨 (ICCV 2023)

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Jinheng Xie¹ Yuexiang Li² Yawen Huang² Haozhe Liu^2,3 Wentian Zhang² Yefeng Zheng² Mike Zheng Shou¹

¹ National University of Singapore ² Tencent Jarvis Lab ³ KAUST

Setup

Note that we only test the code using PyTorch==1.12.0. You can build the environment via pip as follow:

pip3 install -r requirements.txt

To apply BoxDiff on GLIGEN pipeline, please install diffusers as follow:

git clone [email protected]:gligen/diffusers.git
pip3 install -e .

Usage

To add spatial control on the Stable Diffusion model, you can simply use run_sd_boxdiff.py. For example:

CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "as the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed" --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30] --token_indices [3,12,21,30,46] --bbox [[1,3,512,202],[75,344,421,495],[1,327,508,507],[2,217,507,341],[1,135,509,242]] --refine False

or another example:

CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud"  --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4] --bbox [[67,87,366,512],[66,130,364,262]]

Note that you can specify the token indices as the indices of words you want control in the text prompt and one token index has one corresponding conditoning box. P and L are hyper-parameters for the proposed constraints.

When --bbox is not specified, there is a interface to draw bounding boxes as conditions.

CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud"  --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4]

To add spatial control on the GLIGEN model, you can simply use run_gligen_boxdiff.py. For example:

CUDA_VISIBLE_DEVICES=0 python3 run_gligen_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud" --gligen_phrases ["a rabbit","sunglasses"] --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4] --bbox [[67,87,366,512],[66,130,364,262]] --refine False

The direcory structure of synthetic results are as follows:

outputs/
|-- text prompt/
|   |-- 0.png 
|   |-- 0_canvas.png 
|   |-- 1.png
|   |-- 1_canvas.png 
|   |-- ...

Customize Your Layout

VisorGPT can customize layouts as spatial conditions for image synthesis using BoxDiff.

Citation

@article{xie2023boxdiff,
  title={BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion},
  author={Xie, Jinheng and Li, Yuexiang and Huang, Yawen and Liu, Haozhe and Zhang, Wentian and Zheng, Yefeng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2307.10816},
  year={2023}
}

Acknowledgment - the code is highly based on the repository of diffusers, google, and yuval-alaluf.

yqgao716 / boxdiff Goto Github PK

boxdiff's Introduction

BoxDiff 🎨 (ICCV 2023)

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Setup

Usage

Customize Your Layout

Citation

boxdiff's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent