Git Product home page Git Product logo

parts2whole's Introduction

Parts2Whole

[Arxiv 2024] From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

  • Inference code and pretrained models.
  • Evaluation code.
  • Training code.
  • Training data.
  • New model based on Stable Diffusion 2-1.

๐Ÿ”ฅ Updates

[2024-05-06] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Code is released. Enjoy the human parts composition!

๐Ÿ  Project Page | Paper | Model

img:teaser

Abstract: We propose Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. We first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask information from the reference human images, allowing for precise selection of any part.

๐Ÿ”จ Method Overview

img:pipeline

โš’๏ธ Installation

Clone our repo, and install packages in requirements.txt. We test our model on a 80G A800 GPU with 11.8 CUDA and 2.0.1 PyTorch. But inference on smaller GPUs is possible.

conda create -n parts2whole
conda activate parts2whole
pip install -r requirements.txt

Download checkpoints here into pretrained_weights/parts2whole dir. We also provide a simple download script, using:

python download_weights.py

๐ŸŽจ Inference

Check inference.py. Modify the checkpoint path and input as you need, and run command:

python inference.py

You may need to modify the following code in the inference.py script:

### Define configurations ###
device = "cuda"
torch_dtype = torch.float16
seed = 42
model_dir = "pretrained_weights/parts2whole"  # checkpoint path in your local machine
use_decoupled_cross_attn = True
decoupled_cross_attn_path = "pretrained_weights/parts2whole/decoupled_attn.pth" # include in the model_dir
### Define input data ###
height, width = 768, 512
prompt = "This person is wearing a short-sleeve shirt." # input prompt
input_dict = {
    "appearance": {
        "face": "testset/face_man1.jpg",
        "whole body clothes": "testset/clothes_man1.jpg",
    },
    "mask": {
        "face": "testset/face_man1_mask.jpg",
        "whole body clothes": "testset/clothes_man1_mask.jpg",
    },
    "structure": {"densepose": "testset/densepose_man1.jpg"},
}

โญ๏ธโญ๏ธโญ๏ธ Notably, the input_dict should contain keys appearance, mask, and structure. The first two mean specifying the appearance of parts of multiple reference images, and structure means postures such as densepose.

โญ๏ธโญ๏ธโญ๏ธ The keys in these three parts also have explanations. Keys in appearance and mask should be the same. The choices include "upper body clothes", "lower body clothes", "whole body clothes", "hair or headwear", "face", "shoes". Key of structure should be "densepose". (The openpose model has not been release.)

๐Ÿ”จ๐Ÿ”จ๐Ÿ”จ In order to conveniently obtain the mask of each reference image, we also provide corresponding tools and explain how to use them in Tools. First, you can use Real-ESRGAN to increase the resolution of the reference image, and use segformer to obtain the masks of various parts of the human body.

๐Ÿ˜Š Evaluation

For evaluation, please install additional packages firstly:

pip install git+https://github.com/openai/CLIP.git # for clip
pip install dreamsim # for dreamsim
pip install lpips # for lpips

We provide easy-to-use evaluation scripts in scripts/evals folder. The scripts receive a unified formated data, which is organize as two lists of images as input. Modify the code for loading images as you need. Check our scripts for more details.

๐Ÿ”จ Tools

Real-ESRGAN

To use Real-ESRGAN to restore images, please download RealESRGAN_x4plus.pth into ./pretrained_weights/Real-ESRGAN firstly. Then run command:

python -m scripts.real_esrgan -n RealESRGAN_x4plus -i /path/to/dir -o /path/to/dir --face_enhance

SegFormer

To use segformer to segment human images and obtain hat, hair, face, clothes parts, please run command:

python scripts/segformer_b2_clothes.py --image-path /path/to/image --output-dir /path/to/dir

Labels: 0: "Background", 1: "Hat", 2: "Hair", 3: "Sunglasses", 4: "Upper-clothes", 5: "Skirt", 6: "Pants", 7: "Dress", 8: "Belt", 9: "Left-shoe", 10: "Right-shoe", 11: "Face", 12: "Left-leg", 13: "Right-leg", 14: "Left-arm", 15: "Right-arm", 16: "Bag", 17: "Scarf"

๐Ÿ˜ญ Limitations

At present, the generalization of the training data is average, and the number of women is relatively large, so the generalization of the model needs to be improved, such as stylization, etc. We are working hard to improve the robustness and capabilities of the model, and we also look forward to and welcome contributions/pull requests from the community.

๐Ÿค Acknowledgement

We appreciate the open source of the following projects:

diffusers โ€‚ magic-animate โ€‚ Moore-AnimateAnyone โ€‚ DeepFashion-MultiModal โ€‚ Real-ESRGAN

๐Ÿ“Ž Citation

If you find this repository useful, please consider citing:

@misc{huang2024parts2whole,
  title={From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation},
  author={Huang, Zehuan and Fan, Hongxing and Wang, Lipeng and Sheng, Lu},
  journal={arXiv preprint arXiv:2404.15267},
  year={2024}
}

parts2whole's People

Contributors

huanngzh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.