Emu: An Open Multimodal Generalist

Generative Pretraining in Multimodality

Quan Sun^1*, Qiying Yu^2,1*, Yufeng Cui^1*, Fan Zhang^1*, Xiaosong Zhang^1*, Yueze Wang¹, Hongcheng Gao¹,
Jingjing Liu², Tiejun Huang^1,3, Xinlong Wang¹

¹ BAAI, ² THU, ³ PKU
^* Equal Contribution

| Paper | Demo |

Emu is a multimodal generalist that can seamlessly generate images and texts in multimodal context. Emu is trained with a unified autoregressive objective, i.e., predict-the-next-element, including both visual embeddings and textual tokens. Trained under this objective, Emu can serve as a generalist interface for both image-to-text and text-to-image tasks.

Generalist Interface

Emu serves as a generalist interface capable of diverse multimodal tasks, such as image captioning, image/video question answering, and text-to-image generation, together with new abilities like in-context text and image generation, and image blending:

Setup

Clone this repository and install required packages:

git clone https://github.com/baaivision/Emu
cd Emu

pip install -r requirements.txt

Model Weights

We release the pretrained and instruction-tuned weights of Emu. Our weights are subject to LLaMA's license.

Model name	Weight
Emu	🤗 HF link (27GB)
Emu-I	🤗 HF link (27GB)

Inference

At present, we provide inference code that can process interleaved image-text as input, and output text.

For instruction-tuned model, we provide examples for image captioning, visual question answering, and interleaved multi-image understanding:

python inference.py --instruct --ckpt-path $Instruct_CKPT_PATH

For pretrained model, we provide an example for in-context learning:

python inference.py --ckpt-path $Pretrain_CKPT_PATH

Schedule

We are committed to open-sourcing all Emu related materials, including:

The weights of Emu and Emu-I
Inference example for interleaved image-text as input, text as output
Video inference example
Weights of image decoder & image generation/blending example
YT-Storyboard-1B pretraining data
Pretraining code
Instruction tuning code
Evaluation code

We hope to foster the growth of our community through open-sourcing and promoting collaboration👬. Let's step towards multimodal intelligence together🍻.

Acknowledgement

We thank the great work from LLaMA, BLIP-2, Stable Diffusion, and FastChat.

Citation

If you find Emu useful for your research and applications, please consider starring this repository and citing:

@article{Emu,
  title={Generative Pretraining in Multimodality},
  author={Sun, Quan and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Yueze and Gao, Hongcheng and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong},
  publisher={arXiv preprint arXiv:2307.05222},
  year={2023},
}

lemo2012 / emu Goto Github PK

emu's Introduction

Emu: An Open Multimodal Generalist

Generative Pretraining in Multimodality

Generalist Interface

Setup

Model Weights

Inference

Schedule

Acknowledgement

Citation

Misc

emu's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent