Better Aligning Text-to-Image Models with Human Preference

Project page | Arxiv | Space demo

This is the official repository for the paper: Better Aligning Text-to-Image Models with Human Preference. The paper demonstrates that Stable Diffusion can be improved via learning from human preferences. By learning from human preferences, the model is better aligned with user intentions, and also produce images with less artifacts, such as weird limbs and faces.

Updates

[03/29/2023] We released a Web demo using Gradio on Hugging Face. Check it out!

Human preference dataset

The dataset is collected from the Stable Foundation Discord server. We record human choices on images generated with the same prompt but with different random seeds. The compressed dataset can be downloaded from here. Once unzipped, you should get a folder with the following structure:

dataset
---- preference_images/
-------- {instance_id}_{image_id}.jpg
---- preference_train.json
---- preference_test.json

The annotation file, preference_{train/test}.json, is organized as:

[
    {
        'human_preference': int,
        'prompt': str,
        'id': int,
        'file_path': list[str],
        'user_hash': str,
        'contain_name': boolean,
    },
    ...
]

The annotation file contains a list of dict for each instance in our dataset. Besides the image paths, prompt, id and human preference, we also provide the hash of user id. The prompts with names are flagged out by the contain_name field.

Human Preference Classifier

The pretrained human preference classifier can be downloaded from OneDrive. Before running the human preference classifier, please make sure you have set up the CLIP environment as specified in the official repo.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)
params = torch.load("path/to/hpc.pth")['state_dict']
model.load_state_dict(params)

image1 = preprocess(Image.open("image1.png")).unsqueeze(0).to(device)
image2 = preprocess(Image.open("image2.png")).unsqueeze(0).to(device)
images = torch.cat([image1, image2], dim=0)
text = clip.tokenize(["your prompt here"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(images)
    text_features = model.encode_text(text)

    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    hps = image_features @ text_features.T

Remember to replace path/to/hpc.pth with the path of the downloaded checkpoint. The training script is based on OpenCLIP. We thank the community for their valuable work. The script will be released soon.

Adapted model

Checkpoint

The LoRA checkpoint of the adapted model can be found here. We also provide the regularization only model trained without the guidance of human preferences at here.

Inference

You will need to have diffusers and pytorch installed in your environment. Please check this blog for details. After that, you can run the following command for inference:

python generate_images.py --unet_weight /path/to/checkpoint.bin --prompts /path/to/prompt_list.json --folder /path/to/output/folder

We highlight that you need to add 'Weird image. ' to the negative prompt when doing inference, for which the reason is explained in our paper. If you want to do inference on AUTOMATIC1111/stable-diffusion-webui, please check this issue.

Gradio demo

We also provide a UI for testing our method that is built with gradio. Running the following command in a terminal will launch the demo:
```
# install dependencies
pip install -r gradio_requirements.txt
python app_gradio.py
```
This demo is also hosted on HuggingFace here.

Training

Please refer to the paper for the training details. The training script will be released soon.

Visualizations

Citation

If you find the work helpful, please cite our paper:

@misc{wu2023better,
      title={Better Aligning Text-to-Image Models with Human Preference}, 
      author={Xiaoshi Wu and Keqiang Sun and Feng Zhu and Rui Zhao and Hongsheng Li},
      year={2023},
      eprint={2303.14420},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

undercontroller / align_sd Goto Github PK

align_sd's Introduction

Better Aligning Text-to-Image Models with Human Preference

Project page | Arxiv | Space demo

Updates

Human preference dataset

Human Preference Classifier

Adapted model

Checkpoint

Inference

Gradio demo

Training

Visualizations

Citation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent