Git Product home page Git Product logo

voco-llama's Introduction

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

TL;DR

We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By fully utilizing the LLMs' understanding paradigm of vision tokens, our method can compress hundreds of vision tokens into a single VoCo token, while minimizing visual information loss.

VoCo-LLaMA demonstrates the ability to understand video through continuous training using time-series compressed token sequences of video frames.

VoCo-LLaMA presents a promising way to unlock the full potential of VLMs' contextual window.

image

News

  • [2024/06/17] Upload paper and release vision compression code.

Preparation

Install

  1. Clone this repository and navigate to VoCo-LLaMA folder
git clone https://github.com/Yxxxb/VoCo-LLaMA.git
cd VoCo-LLaMA
  1. Install Package
conda create -n voco_llama python=3.10 -y
conda activate voco_llama
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cp VoCo-LLaMA/llava/model/language_model/cache_py/modeling_attn_mask_utils.py /data/miniconda3/envs/voco_llama/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py

Data and Pre-trained weights

VoCo-LLaMA training requires only visual instruction fine-tuning. Please download the aligned LLaVA checkpoints (base LLM and projection layers). Please download the annotation of the LLaVA instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

Train

VoCo-LLaMA is trained on 8 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Train VoCo-LLaMA with vision instruction tuning by running following command:

bash scripts/finetune.sh

Evaluation

There are evaluations about visual understanding we follow the relevant settings in LLaVA. Please refer to the LLaVA official repository for details of data setup and testing.

Citation

If you find this work useful, please consider citing our paper:

@article{ye2024voco,
  author={Ye, Xubing and Gan, Yukang and Huang, Xiaoke and Ge, Yixiao and Shan, Ying and Tang, Yansong},
  title={{VoCo-LLaMA: Towards Vision Compression with Large Language Models}},
  journal={arXiv preprint arXiv:2406.12275},
  year={2024},
}

Acknowledgement

  • LLaVA: the codebase we built upon.
  • Vicuna: our base model Vicuna-7B that has the amazing language capabilities!

voco-llama's People

Contributors

yxxxb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.