Git Product home page Git Product logo

lavender's Introduction

[CVPR 2023] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Paper | Slide | Poster | Video

This repo is the offcial implementation of CVPR 2023 paper
"LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling"
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu and Lijuan Wang

We explore a unified video-language framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can

  • Seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned
  • Generalize to various downstream tasks with limited training samples
  • Enable zero-shot evaluation on video question answering tasks

Table of contents

Requirements

This code is largely based on the official pytorch implementation of VIOLET, implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Data preprocessing

Copied from VIOLET

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 5 frames for both pre-training and downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

  • Visit Video Swin Transformer to download pre-trained weights models. Place swin_base_patch244_window877_kinetics*_22k.pth under ${REPO_DIR}/_models/video_swin_transformer directory. The data structure should follow the hierarchy below.
    ${REPO_DIR}  
    |-- _models  
    |   |-- video_swin_transformer
    |    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
    |    |   |-- swin_base_patch244_window877_kinetics400_22k.pth
    |-- _args 
    |-- _datasets
    |-- _imgs 
    |-- ... 
    |-- ... 
    
  • Download pretraining datasets (WebVid2.5M & CC3M) provided by VIOLET to ./_datasets. The data structure should follow the hierarchy below.
    ${REPO_DIR}  
    |-- _models 
    |-- _args 
    |-- _datasets
    |   |-- txt_webvid2.5.json
    |   |-- webvid2.5_val.tsv
    |   |-- webvid2.5_val.lineidx
    |   |-- webvid2.5_train_1.tsv
    |   |-- webvid2.5_train_1.lineidx
    |   |-- ...
    |   |-- webvid2.5_train_9.tsv
    |   |-- webvid2.5_train_9.lineidx
    |   |-- txt_cc3m.json
    |   |-- cc3m_val.tsv
    |   |-- cc3m_val.lineidx
    |   |-- cc3m_train_1.tsv
    |   |-- cc3m_train_1.lineidx
    |   |-- ...
    |   |-- cc3m_train_9.tsv
    |   |-- cc3m_train_9.lineidx
    |-- _imgs 
    |-- ... 
    |-- ... 
    
  • Pretrain via single-node multi-gpu distributed training.

Task-specific Baseline: Pre-training with Video-Text Matching (VTM) + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
  • Pretrained checkpoint on WebVid2.5M+CC3M: link

LAVENDER: Unified Pre-training with VTM as MLM + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
  • Pretrained checkpoint on WebVid2.5M+CC3M: link
  • Scale-up pre-trained checkpoint with 14M videos + 16M images: link

Downstream

Download downstream datasets to ./_datasets.

Multiple-Choice Question Answering

  • TGIF-Action
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-action.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • TGIF-Transition
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-transition.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSRVTT-MC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_msrvtt-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • LSMDC-MC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_lsmdc-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, update the main script to main_qamc_task_specific.py or main_retmc_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Open-Ended Question Answering

  • TGIF-Frame
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_tgif-frame.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSRVTT-QA
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msrvtt-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSVD-QA
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msvd-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • LSMDC-FiB
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm_lsmdc_fib.py --config _args/args_lsmdc-fib.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, update the main script to main_qaoe_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Text-to-Video Retrieval

  • MSRVTT
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_ckpt <path to the finetuned msrvtt-retrieval model ckpt>
    
  • DiDeMo
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
    
  • MSVD
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_ckpt <path to the finetuned msvd-retrieval model ckpt>
    
  • LSMDC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
    

For task-specific baseline, update the main script to main_retrieval_task_specific.py or eval_retrieval_task_specific.py, and point --path_ckpt to task-specific checkpoints.

Video Captioning (MSRVTT, MSVD)

Finetuning on video captioning requires additional enviroment and dataset setup. We closely follow the instructions from SwinBERT. Please check their repo for more details.

Note that the data folder should have the following structure:

${REPO_DIR}  
    |-- _datasets  
    |   |-- MSRVTT-v2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |-- MSVD  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |-- ... 
    |-- ... 

Once the docker enviroment and the dataset has been setup correctly, run the following command for training.

  • MSRVTT
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msrvtt-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSVD
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msvd-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, simply update --path_ckpt to task-specific pre-trained weights.

Multi-task Training

Data Filtering

As mentioned in our paper, the testing splits of all above tasks may overlap. We perform a data filtering step first to remove the testing data of a task from the training data of other tasks.

python _tools/multi_task_vid_filter.py --dataset lsmdc

python _tools/multi_task_vid_filter.py --dataset msrvtt

python _tools/multi_task_vid_filter.py --dataset msvd 

python _tools/multi_task_vid_filter.py --dataset tgif

Training

CUDA_VISIBLE_DEVICES='0,1,2,3,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_multi_task_mlm.py --config _args/args_multi-task_all.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_multi_task_multi_head.py and point --path_ckpt to task-specific pre-trained weights.

Citation

If you find this code useful, please consider citing the following papers:

@inproceedings{li2023lavender, 
  author = {Linjie Li and Zhe Gan and Kevin Lin and Chung-Ching Lin and Ce Liu and Zicheng Liu and Lijuan Wang}, 
  title = {LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}
@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

License

Our research code is released under MIT license.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

lavender's People

Contributors

linjieli222 avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lavender's Issues

What is the dependency version?

Hi, could you give the original dependency version? I met some bugs when initialize deepspeed.


raise RuntimeError("The given group does not exist")

Question about image-text dataset

Thanks for your great job!
I find that you also use the image-text datasets for pre-training. How do you use these datasets to pre-train the video backbone?
Do you also use curriculum learning as in Frozen? If so, how do you handle relative bias and position encoding of VideoSwin?

Question about image-text dataset

Thanks for your great job!
I find that you also use the image-text datasets for pre-training. How do you use these datasets to pre-train the video backbone?
Do you also use curriculum learning as in Frozen? If so, how do you handle relative bias and position encoding of VideoSwin?

main_caption.py

If you can't reproduce the results of the paper, can't you directly execute main_caption.py? Is any other operation required?

What makes LAVENDER-TS outperform VIOLET?

In Section 4.5, you mentioned that:

VIOLET [16] adopts the same model architecture and finetuning objectives as our task-specific baseline LAVENDER-TS.

However, the LAVENDER-TS in your paper has already outperformed VIOLET. From Table 2,5,6 we know:

TGIF-action MSVD-QA DiDeMo-Avg. MSRVTT-Cap.
LAVENDER-TS 94.5 46.7 59.0 55.7
VIOLET 92.5 - 56.7 -

I am not challenging the contribution of the unified framework, since it has been proven by Table 2. I just wonder what technical improvements you've made, because VIOLET used much more data than LAVENDER-TS.

How to test LAVENDAR on a single video?

Hi I wasn't able to follow the instructions (it seems its only for evaluation on the downstream specific trained models, a large pre-set dataset?)

How can I test the model on a single video?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.