thudm / cogview2 Goto Github PK

official code repo for paper "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers"

License: Apache License 2.0

Python 99.44% Shell 0.56%

transformer pytorch pretrained-models text-to-image

cogview2's Introduction

Generate vivid Images for Chinese / English text

CogView2 is a hierarchical transformer (6B-9B-9B parameters) for text-to-image generation in general domain. This implementation is based on the SwissArmyTransformer library (v0.2).

Read our paper CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers on ArXiv for a formal introduction. The LoPAR accelarate the generation and CogLM enables the model for bidirectional completion.
Run our pretrained models from text-to-image generation or text-guided completion! Please use A100 GPU.
Cite our paper if you find our work is helpful~

@article{ding2022cogview2,
  title={CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers},
  author={Ding, Ming and Zheng, Wendi and Hong, Wenyi and Tang, Jie},
  journal={arXiv preprint arXiv:2204.14217},
  year={2022}
}

Web Demo

Thank the Huggingface team for integrating CogView2 into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:
Thank the Replicate team to deploy a web demo! Try at .

Getting Started

Setup

Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size or training smaller models on less powerful GPUs.
Environment: install dependencies via pip install -r requirements.txt.
LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.

git clone https://github.com/Sleepychord/Image-Local-Attention
cd Image-Local-Attention && python setup.py install

If you don't install this kernel, you can also run the first stage (20*20 tokens) via --only-first-stage for text-to-image generation.

Download

Our code will automatically download or detect the models into the path defined by envrionment variable SAT_HOME. You can download from here and place them (folders named coglm/cogview2-dsr/cogview2-itersr) under SAT_HOME.

Text-to-Image Generation

./text2image.sh --input-source input.txt

Arguments useful in inference are mainly:

--input-source [path or "interactive"]. The path of the input file, can also be "interactive", which will launch a CLI.
--output-path [path]. The folder containing the results.
--batch-size [int]. The number of samples will be generated per query.
--max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
--debug. Only save concatenated images for all generated samples, and name them by input text and date.
--with-id. When it toggled, you must specify an "id" before each input, e.g. 001\t一个漂亮的女孩, \t denoting TAB (NOT space). It will generate batch-size split images in a folder named "id" for each input. Confict with --debug.
--device [int]. Running on which GPU.
--inverse-prompt. Use the perplexity to generate the original text to sort the generated images.
--only-first-stage.
--style. The style of the generated images, choices=['none', 'mainbody', 'photo', 'flat', 'comics', 'oil', 'sketch', 'isometric', 'chinese', 'watercolor']. The default style is mainbody, usually an isolated object with white background.

You'd better specify a environment variable SAT_HOME to specify the path to store the downloaded model.

Chinese input is usually much better than English input.

Text-guided Completion

./text_guided_completion.sh --input-source input_comp.txt

The format of input is text image_path h0 w0 h1 w1, where all the separation are TAB (NOT space). The image at image_path will be center-cropped to 480*480 pixels and mask the square from (h0,w0)to (h1,w1). These coordinations are range from 0 to 1. The model will fill the square with object described in text. Please use a square much larger than the desired region.

Gallery

cogview2's People

Contributors

Stargazers

Watchers

cogview2's Issues

Some questions about finetuning

Hi,
I would like to finetune CogView2 on my own dataset. I used cogdata to process the datasets with a JSON file {'img1':'text1'; 'img2': 'text2', ...} and a tar file which includes images in the keys of the JSON file.

When I run pretrain_coglm.py, an error occurs:

File "pretrain_coglm.py", line 210, in forward_step
    tokens, position_ids, labels, attention_mask, loss_mask = get_batch(
  File "pretrain_coglm.py", line 61, in get_batch
    raise ValueError('temporally not support pure image samples')

I commented out the raise line and encountered another error:

File "pretrain_coglm.py", line 214, in forward_step
    logits, *mems = model(tokens, position_ids, attention_mask)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/torch/nn/modules/mo
dule.py", line 1102, in _call_impl                                                                
    return forward_call(*input, **kwargs)                                                         
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/deepspeed/utils/nvt
x.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1568, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 111, in forward
    return self.transformer(*args, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 411, in forward
    hidden_states = HOOKS_DEFAULT['word_embedding_forward'](self, input_ids, output_cross_layer=output_cross_layer,**kw_args)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/SwissArmyTransforme
r/transformer_defaults.py", line 117, in word_embedding_forward_default
    return self.transformer.word_embeddings(input_ids)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/torch/nn/modules/mo
dule.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/SwissArmyTransforme
r/mpu/layers.py", line 121, in forward
    output_parallel = F.embedding(masked_input, self.weight,
  File "/home/xinpeng/miniconda3/envs/cogview_py38/lib/python3.8/site-packages/torch/nn/functional
.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

My environment is as follows:

python==3.8.0
torch==1.10.0+cu111
ipython==7.21.0
deepspeed==0.6.3
SwissArmyTransformer==0.2.1
icetk==0.0.7
sentencepiece==0.1.98

Additionally, I would like to know if there is a command to resume the checkpoint you provided for further fine-tuning. How can I do that?"

Thank you!

SwissArmyTransformer使用0.2.x版本，跑text2image都会存在这个问题

ModuleNotFoundError: No module named 'localAttention'
请问这是为什么呢谢谢

ValueError: temporally not support pure image samples

When I run the pretrain_coglm.py use my own dataset made by cogdata, this error occurred while training the fifth epoch.
The setting of cogdata is: cogdata create_task --description test --task_type IcetkImageTextTask --saver_type BinarySaver --length_per_sample 512 --img_sizes 256 --txt_len 111 --dtype int32 --model_path="/home/.icetk_models" test_task

ask for suggestions on "RuntimeError: CUDA out of memory. Tried to allocate..." error

Hi! I am new to Pytorch and would like to try this fantastic text-to-image project. I just tried to clone and run text2image.sh for only predict (not training) on different gpus, like 1 X v100-32G, or 2 X 3090Ti - 24G, but both pop up this "cuda out of memory" error. also tried reduce batch size and max-inference-batch-size both down to "1" , but still not working.

So, any suggestions on that issue expect move to higher performance GPUs like A100 or RTX A6000? like is it possible to change some configs to allow the machine to predict only on cpu? or provide smaller size of model "pt" file? or even modify some part of code/config to fully use 2 X 24G gpus (currently only 1 is used when prediction), etc

Thanks!

Demo desn't work

The demo at https://replicate.com/thudm/cogview2 desn't work, there is always no output.

invalid syntax SwissArmyTransformer

getting this error when running

text2image.sh --input-source input.txt

Traceback (most recent call last):
File "cogview2_text2image.py", line 20, in
from SwissArmyTransformer.model import CachedAutoregressiveModel
File "/usr/local/lib/python3.7/dist-packages/SwissArmyTransformer/model/init.py", line 5, in
from .encoder_decoder_model import EncoderDecoderModel
File "/usr/local/lib/python3.7/dist-packages/SwissArmyTransformer/model/encoder_decoder_model.py", line 85
return encoder_outputs, decoder_outputs, *mems
^
SyntaxError: invalid syntax

using SwissArmyTransformer==0.2.1

zipfile.BadZipFile: File is not a zip file

用./text2image.sh --input-source input.txt 执行最后的推理时，如果不加 --only-first-stage参数，会报 zipfile.BadZipFile: File is not a zip file错误。

报错详情：

WARNING: No training data specified
using world size: 1 and model-parallel size: 1

initializing model parallel with size 1
building InferenceModel model ...
number of parameters on model parallel rank 0: 5902307328
global rank 0 is loading checkpoint /sharefs/cogview-new/coglm/432000/mp_rank_00_model_states.pt
successfully loaded /sharefs/cogview-new/coglm/432000/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "/home/dell/gxc/txt2image/CogView2-main/CogView2-main/cogview2_text2image.py", line 233, in
main(args)
File "/home/dell/gxc/txt2image/CogView2-main/CogView2-main/cogview2_text2image.py", line 58, in main
srg = SRGroup(args)
File "/home/dell/gxc/txt2image/CogView2-main/CogView2-main/sr_pipeline/sr_group.py", line 24, in init
dsr_path = auto_create('cogview2-dsr', path=home_path)
File "/home/dell/anaconda3/envs/videoimage/lib/python3.9/site-packages/SwissArmyTransformer/resources/download.py", line 49, in auto_create
f = zipfile.ZipFile(file_path, 'r')
File "/home/dell/anaconda3/envs/videoimage/lib/python3.9/zipfile.py", line 1266, in init
self._RealGetContents()
File "/home/dell/anaconda3/envs/videoimage/lib/python3.9/zipfile.py", line 1333, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

模型是让代码自动下载的，我也看了一下，是zip格式

Hi can your share your training code?

I wanna train/finetune over my own dataset, but I didn't find the training code:)

Fix README

This part right here is incorrect:

...and place them (folders named coglm/dsr/itersr) under SAT_HOME.

dsr and itersr should be named to cogview2-dsr and cogview2-itersr respectively if you choose to download the files manually.

Model download link not working

Hi Model download link is not working taking too much time do anyone have alternative links

PermissionError: [Errno 13] Permission denied: '/sharefs'

(py3.8) 202312150037@ubuntu:/data/zht/learn_pytorch/CogView2-main$ ./text2image.sh --input-source input.txt
[2024-04-03 09:01:21,408] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

initializing model parallel with size 1
Traceback (most recent call last):
File "cogview2_text2image.py", line 233, in
main(args)
File "cogview2_text2image.py", line 48, in main
model, args = InferenceModel.from_pretrained(args, 'coglm')
File "/home/202312150037/anaconda3/envs/py3.8/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 152, in from_pretrained
model_path = auto_create(name, path=home_path, url=url)
File "/home/202312150037/anaconda3/envs/py3.8/lib/python3.8/site-packages/SwissArmyTransformer/resources/download.py", line 36, in auto_create
os.makedirs(os.path.dirname(file_path), exist_ok=True)
File "/home/202312150037/anaconda3/envs/py3.8/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/202312150037/anaconda3/envs/py3.8/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/sharefs'

InferenceModel class do not have from_pretrained()

I build cogview2 with colab.
And run your script, I have this error.

InferenceModel class do not have from_pretrained().

!./text2image.sh --input-source input.txt

WARNING: No training data specified
using world size: 1 and model-parallel size: 1 
Traceback (most recent call last):
  File "cogview2_text2image.py", line 233, in <module>
    main(args)
  File "cogview2_text2image.py", line 48, in main
    model, args = InferenceModel.from_pretrained(args, 'coglm')
AttributeError: type object 'InferenceModel' has no attribute 'from_pretrained'

Are there any omissions in the committed sources?

CUDA out of memory

Hi, is there any way to run three stages on three separate GPUs? 24G memory still meets out of memory error.

Model download and GPU

I clicked on the hyperlinked "here" to download the model and it times out. My plan is to make a quick Colab Notebook. Related: on the GPU, you say to use an A100. With Google's Pro+ account, I usually get a V100 with 16gb memory. Is this enough? I do occasionally get an A100 but not often.

加载成功模型后输出killed 但是没有看到输出的图片请问这是什么原因，我使用的不是A100GPU

New error message: Prediction was canceled.

I have not seen this one before: Prediction was canceled.

I press the Submit button, it turns light grey with a spinning circle. Error message appears in the right panel. I press the cancel button but the message remains.

I have to refresh the page to get out of a loop that appears to do nothing.

Save individual frames

Hi, I was wondering how to save individual frames when using the --only-first-stage flag.
Thanks!

Checkpoint download error

When I run:

from baai_modelhub import AutoPull
auto_pull = AutoPull()
auto_pull.get_model(model_name='cogview2-ch',
                    model_save_path='./checkpoints/'
                    )

It returns an error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='120.131.5.115', port=8080): Max retries exceeded with url: /api/downloadFromCode (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c360d85b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

How many tokens should be trained when pretraining the text-to-image generation in the paper?

Better download links

Would it be possible to provide better download links? The downloads from https://model.baai.ac.cn/model-detail/100041 fail before they are finished and the automatic downloads take upwards to 10 hours.

Can Lora be trained on this model？

Image variations

Hi! Great research and implementation! Is it possible to make variations of the input image? As I read the paper this should be possible. Maybe you can suggest how to do it?

Replicate CUDA out of memory

replicate fails after some images

CUDA out of memory. Tried to allocate 1.22 GiB (GPU 0; 39.59 GiB total capacity; 30.86 GiB already allocated; 1.07 GiB free; 34.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

New error condition - Replicate interface frozen.

Hello
It looks like the run has frozen. The wheels are spinning but the display has not moved for 7 minutes. Thank you

2.1.2+cu121的pytorch在load coglm的时候，时长过长

环境 :
Driver Version: 535.129.03 CUDA Version: 12.2
sat: 0.4.9 (是最新版本的sat)

测试load coglm的速度的代码:

结果：
torch version：1.13.1+cu117
load model需要的时间：209.14s

torch version: 2.1.2+cu121
load model需要的时间：527.55s

**另外有一个问题：是否有huggingface的coglm生成图片的script？因为我看到有huggingface的weights了https://huggingface.co/THUDM/CogView2 **

Change images size for when using '--only-first-stage'

Hi,
how can I determine the image size when using --only-first-stage flag . the images are in 480 pixels but I would expect 256 pixels.

Thanks!

web demo

Hi, thanks for releasing the code and models, would you be interested in creating a web demo for CogView2 using Gradio on Hugging Face?

The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

here is a example Gradio Demo for dalle-mini: https://huggingface.co/spaces/dalle-mini/dalle-mini

and another example that for a cvpr 2022 paper: https://huggingface.co/spaces/CVPR/ml-talking-face

and here is a guide for adding web demo to the organization: https://huggingface.co/blog/gradio-spaces

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

Fantastic Job!

what does parameter max-inference-batch-size truly mean?

I don't clearly understand what the parameter max-inference-batch-size means， What is the relationship between it and batch_size?

`cd Image-Local-Attention && python setup.py install` does not work

Thanks for uploading the code. I am able to run the inference code using --only-first-stage. However, I'm unable to run the whole model because the local attention kernel installation does not work.

Previously, it gave me an error saying my GCC version was too old. (needs to be 5 or higher, and 8 or lower)
When I updated GCC, it throws a long error trace with the two main errors being:

FAILED: ./CogView2/Image-Local-Attention/build/temp.linux-x86_64-3.10/src/weighting.o

raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

Could you specify which gcc version you used?

Uploading pretrained models to Hugging Face Hub

First of all, awesome work, and thank you for making the pretrained models publicly available.

My question is about redistribution of the models. Is it OK to make your pretrained models publicly available through Hugging Face Hub?

I've made this gradio demo app for CogView2 and am currently working on make the app available on Hugging Face Space. To speed up the downloading time, I've uploaded the models in a private repo. Since the repo is private, I'm the only one who can download them from it for now. But I think it'd be appreciated to make it publicly available as a second source of download.

which coglm.zip should be downloaded?

You can download from [here](https://model.baai.ac.cn/model-detail/100041) and place them (folders named coglm/cogview2-dsr/cogview2-itersr) under SAT_HOME.

There are 3 zip files, the script says should download coglm.zip but the text above looks we should download cogview2-itersr.zip?

The demo is not running

The demo at: https://huggingface.co/spaces/THUDM/CogView2 appears to not run. The site displays "Runtime Error" and an error traceback.

How long should install script run?

It's been 11 hours and still going. Is the model this large a download or is something wrong? Is there a verbose install option so we can see what's happening? I'm afraid to start over and wait another 11 hours.

If it's unusual, I will stop process and try again.

ValueError('could not find the metadata file')

    raise ValueError('could not find the metadata file {}, please check --load'.format(
ValueError: could not find the metadata file /dev/shm/sharefs/cogview2-itersr/latest, please check --load

Can`t run on GPU

Hello! Work looks awesome! But I can`t run it on my GPU. I have Nvidia 3060 with 12 GB space but have RuntimeError: CUDA out of memory. I tried to use --max-inference-batch-size 1 and --only-first-stage but it doesn't help me. Anyone can help me? Or prompt me how to run it on CPU

Suggestions for roadmap

Hi CogView team. First off, great work! This method's results are very impressive. I just wanted to post some observations that I've had that might help inform future roadmap.

Generations tend to include watermarks and other artifacts of online images. One common one I see is a white bar at the bottom with black pseudo-text (see examples below)
Add support for non-square inpainting/replacement boundaries.
Hands and arms seem to be deformed, have extra fingers, etc. Suggest adding more data with pictures of hands to help model fix those issues.

Congrats on the paper and keep up the good work!

Add requirements.txt

The readme mentions that you should pip install -r requirements.txt, but this file isn't in the repo. It would be great if we could add that 😄

Can you give an example of the first stage of training?

Dear all,
Thank you very much for your work, but I don't know how to train your first stage, can you give some examples how it works?