Git Product home page Git Product logo

acast's Introduction

CAST: Cross-Attention in Space and Time for Video Action Recognition [NeurIPS 2023][Project Page][Arxiv]

CAST Framework

PWC

GitHub last commit
Website Status
GitHub issues GitHub closed issue

๐Ÿ”ง Installation

We conduct all the experiments with 16 NVIDIA GeForce RTX 3090 GPUs. First, install PyTorch 1.10.0+ and torchvision 0.11.0.

conda create -n vmae_1.10  python=3.8 ipykernel -y
conda activate vmae_1.10
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch

Then, install timm, triton, DeepSpeed, and others.

pip install triton==1.0.0
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git checkout 3a3dfe66bb
DS_BUILD_OPS=1 pip install . --global-option="build_ext"
pip install TensorboardX decord einops scipy pandas requests
ds_report

If you have successfully installed Deepspeed, after running the 'ds_report' command, you can see the following results. For other Deepspeed-related issues, please refer to the DeepSpeed GitHub page.

DS_REPORT

๐Ÿ“ Data Preparation

EPIC-KITCHENS-100

  • The pre-processing of EPIC-KITCHENS-100 can be summarized into 3 steps:

    1. Download the dataset from official website.

    2. Preprocess the dataset by resizing the short edge of video to 256px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id>,<verb_class>,<noun_class>" in annotations). The annotation usually includes train.csv, val.csv. The format of *.csv file is like:

      video_1,verb_1,noun_1
      video_2,verb_2,noun_2
      video_3,verb_3,noun_3
      ...
      video_N,verb_N,noun_N
      
    4. All video files are located inside the DATA_PATH.

Something-Something-V2

  • The pre-processing of Something-Something-V2 can be summarized into 3 steps:

    1. Download the dataset from official website.

    2. Preprocess the dataset by changing the video extension from webm to .mp4 with the original height of 240px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      ...
      video_N.mp4  label_N
      
    4. All video files are located inside the DATA_PATH.

Kinetics-400

  • The pre-processing of Kinetics400 can be summarized into 3 steps:

    1. Download the dataset from official website or OpenDataLab.

    2. Preprocess the dataset by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      ...
      video_N.mp4  label_N
      

  1. All video files should be splited into DATA_PATH/train and DATA_PATH/val.

Expert model preparation

We use the pre-trained weights of spatial and temporal experts. The pretrained weight of the spatial expert (CLIP) uses the official weight. The pre-trained weight of the temporal expert (VideoMAE) uses the pre-trained weights from the three datasets EK100, K400, and SSV2. Of these, K400 and SSV2 use the official weights, and EK100 uses the weights we pre-trained ourselves. Put each downloaded expert weight into the VMAE_PATH and CLIP_PATH of the fine-tune script.

Fine-tuning CAST

We provide the off-the-shelf scripts in the scripts folder.

  • For example, to fine-tune CAST on Kinetics400 with 16 GPUs (2 nodes x 8 GPUs) script.
DATA_PATH=YOUR_PATH
VMAE_MODEL_PATH=YOUR_PATH
CLIP_MODEL_PATH=YOUR_PATH


OMP_NUM_THREADS=1 python -m torch.distributed.launch \
  --nproc_per_node=2 \
  --master_port ${YOUR_NUMBER} --nnodes=8 \
  --node_rank=${YOUR_NUMBER} --master_addr=${YOUR_NUMBER} \
  YOUR_PATH/run_bidirection_compo.py \
  --data_set Kinetics-400 \
  --nb_classes 400 \
  --vmae_model compo_bidir_vit_base_patch16_224 \
  --anno_path ${ANNOTATION_PATH}
  --data_path ${DATA_PATH} \
  --clip_finetune ${CLIP_MODEL_PATH} \
  --vmae_finetune ${VMAE_MODEL_PATH} \
  --log_dir ${YOUR_PATH} \
  --output_dir ${YOUR_PATH} \
  --batch_size 6 \
  --input_size 224 \
  --short_side_size 224 \
  --save_ckpt_freq 25 \
  --num_sample 1 \
  --num_frames 16 \
  --opt adamw \
  --lr 1e-3 \
  --opt_betas 0.9 0.999 \
  --weight_decay 0.05 \
  --epochs 70 \
  --dist_eval \
  --test_num_segment 5 \
  --test_num_crop 3 \
  --num_workers 8 \
  --drop_path 0.2 \
  --layer_decay 0.75 \
  --mixup_switch_prob 0 \
  --mixup_prob 0.5 \
  --reprob 0. \
  --init_scale 1. \
  --update_freq 6 \
  --seed 0 \
  --enable_deepspeed \
  --warmup_epochs 5 \

Evaluation

Evaluation commands for the EK100.

python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval

Evaluation commands for the SSV2, K400.

python ./run_bidirection.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval

Model Zoo

EPIC-KITCHENS-100

Method Spatial Expert Temporal expert Epoch #Frames x Clips x Crops Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on EK100) 50 16x2x3 log/checkpoint
49.3

Something-Something V2

Method Spatial Expert Temporal expert Epoch #Frames x Clips x Crops Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on SSV2) 50 16x2x3 log/checkpoint
71.6

Kinetics-400

Method Spatial Expert Temporal expert Epoch #Frames x Clips x Crops Fine-tune Top-1
CAST CLIP-B/16 VideoMAE-B/16 (pre-trained on K400) 70 16x5x3 log/checkpoint
85.3

Acknowledgements

This project is built upon VideoMAE, MAE, CLIP and BEiT. Thanks to the contributors of these great codebases.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@article{cast,
  title={CAST: Cross-Attention in Space and Time for Video Action Recognition},
  author={Lee, Dongho and Lee, Jongseo and Choi, Jinwoo},
  booktitle={NeurIPS}},
  year={2023}

acast's People

Contributors

jong980812 avatar potatowarriors avatar khuvll avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.