CogVideo

This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.

News! The demo for CogVideo is available!

News! The code and model for text-to-video generation is now available! Currently we only supports simplified Chinese input.

CogVideo_samples.mp4

Read our paper CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers on ArXiv for a formal introduction.
Try our demo at https://wudao.aminer.cn/cogvideo/
Run our pretrained models for text-to-video generation. Please use A100 GPU.
Cite our paper if you find our work helpful

@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Web Demo

The demo for CogVideo is at https://wudao.aminer.cn/cogvideo/, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Generated Samples

Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

Getting Started

Setup

Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size and --batch-size or training smaller models on less powerful GPUs.
Environment: install dependencies via pip install -r requirements.txt.
LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.

git clone https://github.com/Sleepychord/Image-Local-Attention
cd Image-Local-Attention && python setup.py install

Download

Our code will automatically download or detect the models into the path defined by environment variable SAT_HOME. You can also manually download CogVideo-Stage1 and CogVideo-Stage2 and place them under SAT_HOME (with folders named cogvideo-stage1 and cogvideo-stage2)

Text-to-Video Generation

./script/inference_cogvideo_pipeline.sh

Arguments useful in inference are mainly:

--input-source [path or "interactive"]. The path of the input file with one query per line. A CLI would be launched when using "interactive".
--output-path [path]. The folder containing the results.
--batch-size [int]. The number of samples will be generated per query.
--max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
--stage1-max-inference-batch-size [int] Maximum batch size per forward in Stage 1. Reduce it if OOM.
--both-stages. Run both stage1 and stage2 sequentially.
--use-guidance-stage1 Use classifier-free guidance in stage1, which is strongly suggested to get better results.

You'd better specify an environment variable SAT_HOME to specify the path to store the downloaded model.

Currently only Chinese input is supported.

billygareth / cogvideo Goto Github PK

cogvideo's Introduction

CogVideo

Web Demo

Generated Samples

Getting Started

Setup

Download

Text-to-Video Generation

cogvideo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent