Git Product home page Git Product logo

llava-phi's Introduction

LLaVA-Phi: Small Multi-Modal Assistant

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model [Paper]

Release

[1/15] Our model and training codes are released.

[1/5] Our codes are currently undergoing an internal review and will be released shortly (expected next week)

Contents

Install

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi
  1. Install Package
conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

LLaVA-Phi Weights

#Todo

Demo

#Tdo

Train

Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of this version for now. We'll add them in a separate doc later.

LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data (with VQA data from academic-oriented tasks) to teach the model to follow multimodal instructions.

LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below. We note that the hyperparameters may not be the same as we reported in the arxiv paper, as this is an on-going project and we are making frequent changes on our codes.

  1. Pretraining
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-Phi 256 1e-3 1 2048 0
  1. Finetuning
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-Phi 128 2e-5 1 2048 0

Download base checkpoints

Our base model phi-2, you should download the weights from here.

Intergate the model

Pretrain (feature alignment)

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.

Training script with DeepSpeed ZeRO-2: pretrain.sh.

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.

Visual Instruction Tuning

  1. Prepare data

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2
  1. Start training!

You may download our pretrained projectors in Model Zoo. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.

Training script with DeepSpeed ZeRO-3: finetune.sh.

New options to note:

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
  • --image_aspect_ratio pad: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.
  • --group_by_modality_length True: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See [Evaluation.md].

Usage and License Notices

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. This project is licensed permissively under the Apache 2.0 license and does not impose any additional constraints.

Citation

If you find LLaVA-Phi useful for your research and applications, please cite using this BibTeX:

@article{zhu2024llava,
  title={LLaVA-$$\backslash$phi $: Efficient Multi-Modal Assistant with Small Language Model},
  author={Zhu, Yichen and Zhu, Minjie and Liu, Ning and Ou, Zhicai and Mou, Xiaofeng and Tang, Jian},
  journal={arXiv preprint arXiv:2401.02330},
  year={2024}
}

Acknowledgement

We build our project based on

  • LLaVA: an amazing open-sourced project for vision language assistant
  • LLaMA-Factory: We use this codebase to finetune Phi model

llava-phi's People

Contributors

zhuyiche avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.