Git Product home page Git Product logo

umoe-scaling-unified-multimodal-llms's Introduction

If you appreciate our project, please consider giving us a star โญ on GitHub to stay updated with the latest developments.

๐Ÿš€ Welcome to the repo of Uni-MOE!

Uni-MoE is a MoE-based unified multimodal model and can handle diverse modalities including audio, speech, image, text, and video.

๐Ÿค—Hugging Face Project Page Demo Paper

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang

๐Ÿ”ฅ News

  • [5/31] ๐Ÿ”ฅ The checkpoint of Uni-MoE-v2 with 8 experts is now available for downloading and inference. For more details, please refer to the Uni_MoE_v2_weights table.
  • [4/28] ๐Ÿ”ฅ We have upgraded the Uni-MoE codebase to facilitate training across multiple Nodes and GPUs. Explore this enhanced functionality in our revamped fine-tuning script. Additionally, we have introduced a version that integrates distributed MoE modules. This enhancement allows for training our model with parallel processing at both the expert and modality levels, enhancing efficiency and scalability. For more details, please refer to the Uni_MoE_v2 documentation.
  • [3/7] ๐Ÿ”ฅ We released Uni-MOE: Scaling Unified Multimodal LLMs with Mixture of Experts. We proposed the development of a unified Multimodal LLM (MLLM) utilizing the MoE framework, which can process diverse modalities, including audio, image, text, and video. Checkout the paper and demo.

Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.

๐ŸŽจ Case Show

๐Ÿ“€ Demo Video

Demo 2 contains the real-time understanding of speech (Starting from 30S).

demo1.mp4
demo2.mp4

๐ŸŒŸ Structure

The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.

โšก๏ธ Install

The following instructions are for Linux installation. We would like to recommend the requirements as follows.

  • Python == 3.9.16
  • CUDA Version >= 11.7
  1. Clone this repository and navigate to the Uni-MoE folder
git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE
  1. Install Package
conda create -n unimoe python==3.9.16
conda activate unimoe
pip install -r env.txt
  1. Replace all the absolute pathnames '/path/to/' with your specific path to the Uni-MoE file (Including all the eval_x.py/inference_x.py/train_mem_x.py/data.py/demo.py files and config.json files from the model weights)

โšก๏ธ Uni-MOE Weights

To use our model, all weights should be downloaded.

After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:

โ””โ”€โ”€ checkpoint
    โ”œโ”€โ”€ Uni-MoE-audio-base
    โ”œโ”€โ”€ Uni-MoE-audio-e2
    โ”œโ”€โ”€ Uni-MoE-speech-base
    โ”œโ”€โ”€ Uni-MoE-speech-e2
    โ”œโ”€โ”€ Uni-MoE-speech-base-interval
    โ”œโ”€โ”€ Uni-MoE-speech-v1.5
    โ”œโ”€โ”€ clip-vit-large-patch14-336
    โ”œโ”€โ”€ whisper-small
    โ””โ”€โ”€ BEATs_iter3_plus_AS2M.pt
Model Checkpoint
vision encoder CLIP ViT-L/14 336px
speech encoder whisper small
audio encoder BEATs_iter3+ (AS2M)
Uni-MoE-audio-base-model Uni-MoE/Uni-MoE-audio-base
Uni-MoE-audio-fine-tuned-chekpoint Uni-MoE/Uni-MoE-audio-e2
Uni-MoE-speech-base-model Uni-MoE/Uni-MoE-speech-base
Uni-MoE-speech-fine-tuned-chekpoint Uni-MoE/Uni-MoE-speech-e2
Uni-MoE-speech-base-interval Uni-MoE/Uni-MoE-speech-base-interval
Uni-MoE-speech-v1.5 Uni-MoE/Uni-MoE-speech-v1.5
  • Uni-MoE-speech refers to the MOE-Task2 and Uni-MoE-audio refers to the MOE-Task3 in our paper.
  • 'Uni-MoE-base' is the backbone containing LLMs and trained parameters gained from Training Stage 2: Training Modality-Specific Expert.

๐Ÿ—๏ธ Dataset

Training Data

DataSet Type
LLaVA-Instruct-150K image(train2014)
Video-Instruct-Dataset video(from youtube)
WavCaps audio
AudioCaps audio(Cap)
ClothoAQA audio(QA)
ClothoV1 audio(Cap)
MELD audio(Music)
RACE Speech(TTS)
LibriSpeech Speech(Long)

We use TTS technical to convert long text to speech to construct long speech understanding data.

Evaluation Data

DataSet Input Type
AOKVQA Text-Image
OKVQA Text-Image
VQAv2 Text-Image
ClothoAQA Text-Audio
ClothoV1 Text-Audio
ClothoV2 Text-Audio
POPE Text-Image
TextVQA Text-Image
MM-Vet Text-Image
SEEDBench(Image) Text-Image
MMBench Text-Image
MMBench-Audio Text-Image-Speech(Long)
English-High-School-Listening Text-Speech(Long)
RACE Text-Speech(Long)
MSVD Text-Video-Audio
Activitynet-QA Text-Video-Audio

College Entrance English Examination Listening Part

We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.

Experimental Results

๐ŸŒˆ How to infer and deploy your demo

  1. Make sure that all the weights are downloaded and the running environment is set correctly.
  2. run inference scripts inference_audio.sh and inference_speech.sh using bash inference_audio.sh bash inference_speech.sh or run the following commands to inference:
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/inference_all.py
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/inference_all.py

To launch the online demo ( It is highly recommended to launch the demo with Uni-MoE-speech-v1.5 that need the basic parameters of Uni-MoE-speech-base-interval), run:

cd /path/to/Uni_MoE
conda activate unimoe
python demo/demo.py
python demo/app.py

๐ŸŒˆ How to train and evaluate on datasets

Training:

  1. Make sure that all the weights are downloaded and the environment is set correctly, especially for the base model.
  2. Our training data can be downloaded from UMOE-Speech-453k.json and UMOE-Cap-453k.json.
  3. Relevant vision and audio files: Dataset
  4. Run training scripts: finetune_audio.sh or finetune_speech.sh using bash finetune_audio.sh bash finetune_speech.sh, remember to modify the training set with your own preference.
  5. For multiple GPUs training, run training scripts: finetune_speech_dp.sh using bash finetune_speech_dp.sh, remember to modify the training set with your own preference.

Evaluation:

  1. Prepare the evaluation set using the form as samples.json.
  2. Run evaluation scripts: eval_audio.sh or eval_speech.sh using bash eval_audio.sh bash eval_speech.sh or run the following commands to eval:
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/eval.py\
 --data_path /path/to/clotho.json\
 --data_type clothov1\
 --output test.json
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/eval.py\
 --data_path /path/to/vqa_eval.json\
 --data_type vqa\
 --output test.json

We recommend using 80GB GPU RAM to run all experiments.

Star History

Star History Chart

Citation

If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:

@article{li2024uni,
  title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
  author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
  journal={arXiv preprint arXiv:2405.11273},
  year={2024}
}

umoe-scaling-unified-multimodal-llms's People

Contributors

expapa avatar yunxinli avatar xychen-hitsz avatar eltociear avatar

Stargazers

Fanheng Kong avatar  avatar ่‚†่‚† avatar  avatar ๆฝ˜ๆ™“ๅฝค avatar  avatar Xin Zhang avatar Zheng Bowen avatar ryxyc avatar Jinlong Xue avatar Janis Melnikovics avatar  avatar Shixin Jiang avatar  avatar he neng avatar  avatar Thanh Sang avatar ymzhang avatar  avatar Jaheimm avatar  avatar init avatar Nikolaus Schlemm avatar  avatar Zeliang Zhang avatar FantasticGNU avatar  avatar Qoboty avatar  avatar Dfdvdg_li avatar  avatar Lisheng Zhou, Ph.D. avatar LinChen avatar  avatar  avatar Lim Geun Taek avatar Jiaqing Zhang avatar Jiayi Zhou avatar Bowen Yuan avatar Boyuan Chen avatar Chuanming avatar  avatar Yuxuan Wang avatar codingbull avatar Chen Kezhou avatar allen.hu avatar  avatar hertz avatar Abdulrahman Tabaza avatar  avatar  avatar Mingyue Li avatar yanjzh avatar Dazz1e avatar O.Lin avatar  avatar  avatar Ryan Loil avatar  avatar Shuoran Jiang avatar yxdu avatar  avatar  avatar Hao Tan avatar  avatar  avatar  avatar Wayne W. avatar Alex (Jinyuan) Guo avatar Xiaomin Tang avatar Shanshan Du avatar  avatar Zoe avatar  avatar Dapeng Du avatar yingxi avatar Bruno Pio avatar liujingkang avatar Tan Shaohui avatar xxxx avatar Shiqi Yang avatar  avatar Techearch avatar Wenbin An avatar  avatar  avatar Adam Erickson avatar  avatar  avatar Ren Hang avatar  avatar Bruce avatar Markus Mรผller avatar [H][O][T][A][R][U] avatar Nikola's Tech Lab avatar  avatar Nicholas Baird avatar Ivan Hanloth avatar Prince Yahshua avatar joie de vivre avatar

Watchers

infinite42 avatar Nickolay V. Shmyrev avatar Wenhan Luo avatar Longyue Wang avatar  avatar  avatar tingweichen avatar Muhammad Ajmal Siddiqui avatar Sanctuary avatar ๅผ ๆ€็ปฎ avatar Renbo Wu avatar

umoe-scaling-unified-multimodal-llms's Issues

demo

demo1.mp4
demo2.mp4

Audio Understanding for Uni-MoE v2

I found that Uni-MoE v2 is not trained on audio understanding tasks and not utilizing the BEATs audio encoder.

Is Uni-MoE v2 not designed for understanding general audio events, like natural sounds?

Error when running demo.py

When I try to run file demo.py on one H100 - 80GB, I got this error (when load model) (I really download all models from requirements and install all dependencies), pls help me to check this issue: @longyuewangdcu @eltociear @YanshekWoo @imryanxu @expapa

While copying the parameter named "base_model.model.model.layers.30.mlp.experts.3.down_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.30.mlp.gate.lora_B.default.weight", whose dimensions in the model are torch.Size([4, 8]) and whose dimensions in the checkpoint are torch.Size([4, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.q_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.k_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 4096]) and whose dimensions in the checkpoint are torch.Size([8, 4096]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).
While copying the parameter named "base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([4096, 8]) and whose dimensions in the checkpoint are torch.Size([4096, 8]), an exception occurred : ('CUDA error: no kernel image is available for execution on the device\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n',).

Clarification on 3-Step Training Approach and Commands for Uni-MoE v2

I like the three step innovative training approach to train the MLLMs. This intrigued me more and I was going through the scripts trying to replicate 3 step training technique to train my own model. However, I have few queries.

  1. is it possible to replicate all three training steps with the scripts in uni-moe-v2 folder?
  2. Could you share the command to train uni-moe-v2-speech as there are only inference and eval scripts?
  3. relating to the 3 step training approach and the given model checkpoints, Uni-MoE 8-expert base is the result of step1, Uni_MoE 8-expert experts model after step 2 and Uni_MoE 8-expert finetune model is the model after step 3. Is my understanding correct?

Can't download audio encoder

Can't download [Fine-tuned BEATs_iter3+ (AS2M)](https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D)

The link is inaccessible. There is the error report:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:2fcc9820-701e-003e-1003-b0562d000000 Time:2024-05-27T07:03:18.3397596Z</Message>
<AuthenticationErrorDetail>Signature did not match. String to sign used was rl 2023-03-01T07:51:05Z 2033-03-02T07:51:00Z /blob/valle/share 2020-08-04 c </AuthenticationErrorDetail>
</Error>

Is there a Huggingface link for this Encoder?

Upgrade Uni_Moe to support LLaVa correctly

LLaVA requires 3.10 python (https://github.com/haotian-liu/LLaVA) and majority of containers are only with 3.10.x. Python 3.9.x is not a stable version for long term. Given all these, it is fundamental for upgrading Uni_Moe to 3.10.x ecosystem asap. It is non-usable otherwise.

Errors reported if not upgraded when install from Uni_Moe

pip install -r env.txt
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining llava from git+https://github.com/haotian-liu/LLaVA.git@e61aa3f88f58f8e871b9c2476d743724e271c776#egg=llava (from -r env.txt (line 83))
  Skipping because already up-to-date.
  Installing build dependencies ... error
  error: subprocess-exited-with-error
  
  ร— pip subprocess to install build dependencies did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [7 lines of output]
      Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com, https://pypi.ngc.nvidia.com
      Collecting setuptools>=61.0
        Downloading setuptools-70.1.0-py3-none-any.whl.metadata (6.0 kB)
      Downloading setuptools-70.1.0-py3-none-any.whl (882 kB)
         โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 882.6/882.6 kB 24.1 MB/s eta 0:00:00
      Installing collected packages: setuptools
      ERROR: Cannot set --home and --prefix together
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

ร— pip subprocess to install build dependencies did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.