mcg-nju / videomae Goto Github PK

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Home Page: https://arxiv.org/abs/2203.12602

License: Other

Python 92.38% Shell 7.62%

self-supervised-learning action-recognition video-understanding masked-autoencoder transformer vision-transformer video-transformer mae pytorch video-representation-learning video-analysis neurips-2022

videomae's Introduction

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2023.4.18] 🎈Everyone can download Kinetics-400, which is used in VideoMAE, from this link.
[2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo!
[2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework.
[2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! 🎉
[2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available!
[2022.12.27] 🎈Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo.
[2022.11.20] 👀 VideoMAE is integrated into and , supported by @Sayak Paul.
[2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully.
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available!
[2022.10.19] The pre-trained models and scripts on UCF101 are available!
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉
[2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now!
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details.
[2022.4.24] Code and pre-trained models are available now!
[2022.3.24] ~~Code and pre-trained models will be released here.~~ Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-S	224x224	16x2x3	66.8	90.3
VideoMAE	no	ViT-B	224x224	16x2x3	70.8	92.4
VideoMAE	no	ViT-L	224x224	16x2x3	74.3	94.6
VideoMAE	no	ViT-L	224x224	32x1x3	75.4	95.2

✨ Kinetics-400

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Top-1	Top-5
VideoMAE	no	ViT-S	224x224	16x5x3	79.0	93.8
VideoMAE	no	ViT-B	224x224	16x5x3	81.5	95.1
VideoMAE	no	ViT-L	224x224	16x5x3	85.2	96.8
VideoMAE	no	ViT-H	224x224	16x5x3	86.6	97.1
VideoMAE	no	ViT-L	320x320	32x4x3	86.1	97.3
VideoMAE	no	ViT-H	320x320	32x4x3	87.4	97.6

✨ AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection.

Method	Extra Data	Extra Label	Backbone	#Frame x Sample Rate	mAP
VideoMAE	Kinetics-400	✗	ViT-S	16x4	22.5
VideoMAE	Kinetics-400	✓	ViT-S	16x4	28.4
VideoMAE	Kinetics-400	✗	ViT-B	16x4	26.7
VideoMAE	Kinetics-400	✓	ViT-B	16x4	31.8
VideoMAE	Kinetics-400	✗	ViT-L	16x4	34.3
VideoMAE	Kinetics-400	✓	ViT-L	16x4	37.0
VideoMAE	Kinetics-400	✗	ViT-H	16x4	36.5
VideoMAE	Kinetics-400	✓	ViT-H	16x4	39.5
VideoMAE	Kinetics-700	✗	ViT-L	16x4	36.1
VideoMAE	Kinetics-700	✓	ViT-L	16x4	39.3

✨ UCF101 & HMDB51

Method	Extra Data	Backbone	UCF101	HMDB51
VideoMAE	no	ViT-B	91.3	62.6
VideoMAE	Kinetics-400	ViT-B	96.1	73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

videomae's People

Contributors

Stargazers

Watchers

Forkers

yztongzhan amerssun liguoyu1 johndpope hzhang57 brodra xingfeili dongziqi001 jlqzzz bnbsking gary-code nielsrogge techthiyanes shinchan3131 likemby forks-learning chuhanxx zhanghm1995 patrolli linusgi xlsean zero0kiriyu maryamhaghighat cv-ip wolfworld6 vbansal1 a8356555 achun321 expert68 junshern hoanganh27042001 kongto-yu naoki-wake simplexsigil mjlbach potatowarriors dombergka praysimanjuntak andy1621 zi-hao-wei manzhihuangnian shuowang-ai donglin8506 hwijune ngujiang moohnai dramaguy dl-mae wf1024966 leomlck williamxuanyu yinanhe mayuelala heyufan1995 gaohuan2015 sunshinewhy sanyam83 ahmadmobeen lilygeorgescu mythlee ajunlonglive nappingman liuricky pierrefdz daveishan phoenixdigitalfx summerflowers zhenligod mukeshnarendran7 yokko123 ertkonuk donghwankim0101 polarisant guozhenzhang1999 huazhanghu congee524 iff-0303 tzirakis leiwangr cytsinghua bweng001 henglinshi hhs0425 yuqunw ronghanghu whuhxb 1040242795 fanstering codwest jackzhousz jerryflymi wenjinzhang yoghurtea raywang-iat bpiyush bobochow abhishek21441 limgeuntaekk 1416924176 zahragh996

videomae's Issues

Accuracy calculation in test has redundant instances

Under this section Finetune, it's written that during test, you consider multiple segments and multiple crops

    --test_num_segment 2 \
    --test_num_crop 3 \

But while calculating accuracy, we don't aggregate accuracy scores over all these segments/crops--
https://github.com/MCG-NJU/VideoMAE/blob/main/engine_for_finetuning.py#L180
https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/metrics.py#L25

acc1, acc5 = accuracy(output, target, topk=(1, 5))
Instead while calculating accuracy we should have done something like this--

scores=pd.dataFrame({ "id" : ids, "outputs" : outputs, "labels" : labels })
scores=scores.groupby([ "ids" , "labels" ]).aggregate({"outputs": lambda x : max(x)})

Adding VideoMAE to HuggingFace Transformers

Hi VideoMAE team :)

I've implemented VideoMAE as a fork of 🤗 HuggingFace Transformers, and I'm going to add it soon to the library (see huggingface/transformers#17821). Here's a notebook that illustrates inference with it: https://colab.research.google.com/drive/1ZX_XnM0ol81FbcxrFS3nNLkmn-0fzvQk?usp=sharing

The reason I'm adding VideoMAE is because I really like the simplicity of it, it was literally a single line of code change from ViT (nn.Conv2d -> nn.Conv3d).

As you may or may not know, any model on the HuggingFace hub has its own Github repository. E.g. the VideoMAE-base checkpoint fine-tuned on Kinetics-400 can be found here: https://huggingface.co/nielsr/videomae-base. If you check the "files and versions" tab, it includes the weights. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

Are you interested in creating an organization on the hub, such that we can store all model checkpoints there (rather than under my user name)?

Let me know!

Kind regards,

Niels
ML Engineer @ HuggingFace

Request of a best pretrained checkpoint for downstream tasks

Thanks for you excellent work. With only self-supervised learning, the VideoMAE can perform surprisely well. Besides that, I think the (self-supervised + full-supervised) models are also very meaningful, especially for the downstream tasks/datasets.

@yztongzhan Could you please providing a model/checkpoint that is the best pretrained model you can offer for learning video representation towards any unknown downstream video datasets?

For example, a model pretrained on the Kinetic-700 with VideoMAE self-supervisly, and then pretrained on Kinetic-700 supervisely.

Such a model should be very useful. One can easily fine-tune it on any downstream datasets for many possible purposes. Just like the ResNet-50 pretrained on ImageNet.

MoCoV3 weights

Hi,
First of all, thank you very much for your clean code and impressive results.
Second, is it possible for you to share the weights of the pretrained MoCoV3 baseline?

Best regards,
Sepehr

pretrained models on ImageNet1K

Hi,

Thank you for sharing the exciting work!

Could you please release the pre-trained models on ImageNet1K(with input_size 224x224 or 320x320)?

Many many thanks!

how to preprosessing?

Hello.
First, thank you for this repository.

I'm trying to preprocess.

I already looked at install.md and installed it.
Now look at data.md and try to follow it.
However, I already have the kinetics-400 downloaded before, so I just need to pre-process the data.

i) Download the dataset from the official website.

ii) Preprocess the dataset by resizing the short edges of the video to 320px. You can refer to the MMAction2 data benchmark for TSN and SlowOnly.

iii) Create the annotations required by the dataloader ("<path_to_video> <video_class>" in the annotations). Comments usually include train.csv, val.csv, and test.csv (here val.csv is equivalent to test.csv). The file format *.csv is:

I'm on the second of this course you mentioned, and I'm not sure how to do it. How do I pre-process?

Thank you for read it!

Why use the original mean and var of each patch when visualizing the reconstruction video?

When I set a high mask ratio, some unpredictable content will be roughly predicted.

release csv label files for ssv2

Hi, congratulate on your great work!
Could you release label files (train.csv val.csv test.csv) for ssv2?
I tried to generate these files following the guidance in DATASET.md and use them for evaluation on ssv2 by running run_class_finetuning.py with --eval, but the accuracy is abnormal(too low), and I guess maybe the csv files I generated were wrong.
Could you release the csv files of K400 and ssv2 for everyone to download? Thanks!

Training receipt for AVA

Thanks for the excellent work!

Could you provide the training configuration for the AVA dataset? I tried to reproduce the results but observed overfitting.

Linear probe experiment

Is there a script or details available for linear probing? The original MAE paper suggests theres a substantial difference between setups for finetuning and linear probing for MAE models and wanted to be able to reproduce results.

Thanks!

Preprocess for ssv2 dataset

Hello, the author! I read your DATASETS.md for ssv2. But I have no ideo how to run the data_clean.py for original ssv2 videos? Why are .txt files used in data_clean.py? Could you kindly give one or two instructions to run data_clean.py for ssv2?

Does batch-size matter?

Hi, thank you for the nice work.

I have a question.
Does batch-size matter when I train a model?
What if someone only has two 24Gb GPUs, then what is the good choice of batch-size for the case?
Can I apply the equation appeared in Appendix B to compute the learning rate for small batch-size, such as 2 or 4, even 1?

Preprocessing for UCF101

Hi, I was wondering if there are any available resources on the preprocessing for UCF101 similar to the given data_clean.py script for SSV2?

Can I use less GPU for pretraining?

Hi, thank you for sharing this amazing work.

I know you have used 64 GPUs spreading in 8 nodes for pretraining. However, I don't have so many GPU resources, I wonder whether I could pretrain this model in 8 GPUs or less.

Whether there are large margin performance decrease?

Code for demo/inference

Thanks for the great work! I am wondering whether the demo/inference code has been released so that we can quickly test out the model on customized videos?

reproduce performance on K400 without using deepspeed

Hi,
Thanks for your work!
When I reproduce the finetuning experiments on K400 without deepspeed backend (by applying your released pretrained checkpoint), I only got 78.0 with VIT-B, which is 2.7% lower than the number (80.7) stated in the paper. Do you know the reason for that? Does the different backend is the main reason?
Thanks!

Performance

Nice. I'm reproducing it.

Code release

When the code for training and testing will be released?

Please confirm whether the procedure of preprocssing are done twice in run_mae_pretraining.py ?

Like the tile, I found that the videos are preprocessed twice, refer to: dataset and the train_one_epoch. can you help check it ?

tensorRT

Thank you very much for your code and contributions. Is it possible to use tensorRT to speed up

inference

How do I use a pre-trained model to complete my inference?

loss_scale_value has no key "scale"

Hi author:
Thanks for your impressive work, I run your code "run_mae_pretraining.py" and it comes an error that "engine_for_pretraining.py", line 79, in train_one_epoch
loss_scale_value = loss_scaler.state_dict()["scale"]" , as shown below; p.s. I used the pretrained ssv2's checkpoint;

about inference speed?

Thanks for the great work! But I have one question that was not discussed in the paper, which is the inference speed.

I understand that during training the speed is fast, since you masked out 90% of the patches. However, during test time, I assume you retain all the patches. Since the attention in ViT is quadratic in cost, does it mean you will need 10*10 = 100 times the flops compared to the pretrain phase?

If that is the case, how much time does it take to do a single forward pass?

weights from pretrained model "decoder.blocks.0 ... decoder.blocks.1 ..." not used in finetune training?

Hi Author:
I am studying your paper and trained a mae on kinetic dataset, and I have two questions; I would appreciate that if u could give a feedback;
1. Now I run "run_class_finetuning.py" based on a pretrained model, and warning shows below, is this correct ?
2. finetuning shows that the GPU occupation is 18G while batchsize are 2, could I freeze the part of foward_feature and only train the
classification head?

the accuracy i get is low

i train the model on UCF101, pretrain （100 epochs,finetune（100 epochs）, the accuracy is 67-68,but your accuracy is above 90， Is it related to parameter settings? can you tell me the parameter settings

Original image leaking during reconstruction

Hi,
thanks for you sharing this code.

I have one question that you use the nn.Conv3d operation in L145 in PatchEmbed class to mitigate the patching process, which gets the patches in temporal and spatial dimensions. After this operation, you use x_vis = x[~mask].reshape(B, -1, C) # ~mask means visible to get the unmasked patches.

However, I am curious about whether this learnable nn.Conv3d module could extract the original masked image features, and resulting in the masked patches information leaking? Since the respective field of this 3D convolution may contain the masked patches in the original images.

I hope you could solve my confusion.

Finetune without deepspeed

Fantastic job! If I try to finetune the model without deepspeed, will the performance drop? Thanks!

About preparing SthV2

Hi, Thank you for your work!

I read your 'DATASET.md'.

Are there two key points in processing sthv2 datasets: the first is to change the suffix to MP4, the second is to reduce the short side to 320p? (and can only videos with an original height of 240p be zoomed out?)
中文：处理sthv2数据集是不是就2个核心要义：第一个是将后缀改为.mp4，第二个是将短边放缩至320p？（而且是不是只有原始高度为240p的视频才能被选中，然后再去放缩？）

About input of model

What is the type of the inputs of the model? frames or videos?

Pretraining time

Hi, thanks for your solid works!

Would you mind sharing some pretraining time statistics? Like how long does it take to pretrain on Kinetics400 using 64 V100s.

The learning rate for ssv2 dataset

Hi, I have tried to reproduce the VideoMAE performance on SSV2 dataset. I run the experiments on four A100 machines (each includes eight 80G GPUs) and modify --nnodes=4, batch_size 64, such that our total batch size is the same. However, the performance is not consistent with the reported one. I checked the log you provided and notice that the learning rate is different. It seems that your learning rate is not 1.2e-3 (1.5e-4*2048/256) after the warm-up stage. Thanks a lot and I am looking forward to your reply.

encontered missing keys error while finetunening

I encontered missing keys error while finetunening, I resume from model pretrained with pretrain model and using it while finetune.
how to solve this problem?

RuntimeError: Unknown model (pretrain_videomae_base_patch16_224)

Hello, thanks for your sharing again. I have downloaded the kinetic_400_vitl_epoch_1600 pretrained weight and trying to do visualization on videos. The vis.sh I run is same as the official except for the path. However, timm raise the error "RuntimeError: Unknown model (pretrain_videomae_base_patch16_224)" while loading the model.

I have tried to modify pretrain_videomae_base_patch16_224 to vit_base_patch16_224, then another error raised "TypeError: init() got an unexpected keyword argument 'decoder_depth'". Subsequently, I comment the keyword "deocder_path" in line80 of run_videomae_viz.py but the error occurs again "AttributeError: 'VisionTransformer' object has no attribute 'encoder'".

I really appreciate if you can help it. Thanks.

preprocess for ssv2

Could you release the preprocess code for ssv2?
That's will help me a lot.

About the batch_size

Hi,
Thank you for awesome work.
I got a problem about the batch size. To be specific, when pre-training VIT-B on K400, the script sets the batch_size as 32, which means 32 videos per GPU. If one video clip consists of 16 frames (as set), one gpu will need to process 32*16=512 frames.
Is this right? Or do I misunderstand something?
Another problem, your paper reports the batch_size as 1024 when pre-training VIT-B, which is inconsistent with your scripts here.

Resume training from checkpoint

Hi! I finetuned model on my dataset, but I'd like to resume from saved checkpoint .pt. But if I start the finetuning it always begins from 0 epoch.
My finetune.sh:

Set the path to save checkpoints

OUTPUT_DIR='/home/jovyan/people/Murtazin/VideoMAE/output_ckpts/eval_lr_1e-3_epoch_55'
# path to Kinetics set (train.csv/val.csv/test.csv)
DATA_PATH='/home/jovyan/datasets/sign_language/WLASL/WLASL_kinetic_hardcode'
# path to pretrain model
MODEL_PATH='/home/jovyan/people/Murtazin/VideoMAE/ckpts/checkpoint.pth' 
PT_PATH='/home/jovyan/people/Murtazin/VideoMAE/output_ckpts/eval_lr_1e-3_epoch_100/checkpoint-45/mp_rank_00_model_states.pt' 

# batch_size can be adjusted according to number of GPUs
# this script is for 64 GPUs (8 nodes x 8 GPUs)
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 \
    run_class_finetuning.py \
    --model vit_large_patch16_224 \
    --data_set WLASL \
    --nb_classes 2000 \
    --data_path ${DATA_PATH} \
    --resume ${PT_PATH} \
    --log_dir ${OUTPUT_DIR} \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 2 \
    --num_sample 2 \
    --input_size 224 \
    --short_side_size 224 \
    --save_ckpt_freq 10 \
    --num_frames 32 \
    --sampling_rate 2 \
    --opt adamw \
    --lr 2e-3 \
    --opt_betas 0.9 0.999 \
    --weight_decay 0.05 \
    --epochs 55 \
    --dist_eval \
    --test_num_segment 5 \
    --test_num_crop 3 \
    --enable_deepspeed \

Pre-training and fine-tuning model/scripts on UCF101 or HMDB51

Hi, thanks for sharing your work, it's a brilliant job！

Even though you mentioned training and fine-tuning configurations on UCF101 and HMDB51 in the appendix of the paper, I would appreciate if you could provide pre-trained models for these two benchmarks, or provide scripts for training and fine-tuning.

Thanks.

Pretraining VideoMAE on HMDB51

Hi Zhan,

Thank you for your excellent work! We are surprised by VideoMAE's data efficiency (paper section 4.3), especially on small datasets like HMDB51. We are trying to use your code to reproduce your experiment on HMDB51. However, we cannot achieve the same finetune accuracy as the table2 reported (61.1 %):

Our experiments show that the model is converged after 2400 pretraining epochs. We are using eight Tesla V100 32GB GPUs. Also, we changed the batch size, learning rate and temporal stride as the appendix described.

I wonder if you can also share your complete experiment configurations for UCF101 and HMDB51? Some settings like warmup epochs number may also be critical for reproducing.

Pretraining time for UCF101

Hello author,

I saw you mentioned there were 8 GPUs used for pretraining 9.5k UCF101 video clips. May I ask what the total pretraining time for pretraining UCF101 for 3200 epochs using batch size 192 (the setting in your paper) is?

Appreciate it!

from petrel_client.client import Client

Hi,

I am trying to use data_clean.py for Kinetics 400 but am unable to use petrel_client from the import statement:
from petrel_client.client import Client
I have installed petrel but that has not worked and I have not been able to find a petrel_client package online.

How is the Classifcation Head added onto the encoder?

Looking in into the code, I can find a linear layer on top of the pre-trained encoder that looks as follows (for kinetics400):

      (11): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=False)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath(p=0.10000000149011612)
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm): Identity()
    (fc_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
    (head): Linear(in_features=768, out_features=400, bias=True)

What I don't understand is, how exactly the layer is connected to the encoder. The encoder should deliver N (sequence length) tokens with d dimensions (768 in this case). Did you employ a CLS token (I don't think so), are you taking the mean over the N tokens or how exactly do you connect the encoder to the classification head? I cannot find any information on that in your paper, I might be wrong though - sorry in this case :)

Mixup and Cutmix

Hello,

Thank you for sharing your work. I am looking at the code for finetuning the model and trying to understand how to apply mixup and cutmix to videos. The train dataloader seems to provide a batch of size (B, C, T, H, W). However, the mixup function from timm requires a batch of size (B, C, H, W). I couldn't find the code for reshaping the batch before sending it to mixup function. Am I missing something? Should we reshape the batch from (B, C, T, H, W) to (B, C*T, H, W) or (B*T, C, H, W)? Which is the correct way?

Thank you.

Bias in "Attention" layer.

In modeling_finetune.py Line 82, why you set the bias of k to be 0 (requires_grad=False), which means the model only learns the bias parameters of q and v. The official timm code for image learns all the three biases of q, k, and v. Is there something special in videos?

BTW, why should the qkv be written in 84-85 lines instead of the 83 line which is commented out. Thanks!

def forward(self, x):
    B, N, C = x.shape
    qkv_bias = None
    if self.q_bias is not None:
        qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
    # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
    qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
    qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
    q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)

    q = q * self.scale
    attn = (q @ k.transpose(-2, -1))

    
    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)

    x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

Missing Comparison with MaskFeat

Good job! After reading your VideoMAE, I feel this work should compare with MaskFeat [https://arxiv.org/abs/2112.09133] regarding to both method and performance. MaskFeat, the concurrent work with MAE (ImageMAE), also performs the masked image modelling for video classification task. The main difference between the MaskFeat and VideoMAE is that the reconstructed target is different (HOG feature vs. RGB image). Interest not relevant. It is just my personal opinion.

Downstream training loss and accuracy are nearly constants

Hello, thanks for your effort on coding VideoMAE. I have followed the installation guide completely, then I selected two categories from UCF-101 to do downstream training (video binary classification). The data is preprocessed including resized the short edge to 240px or 320px as well as saved in .mp4 format. The train.csv, val.csv, test.csv are all prepared also. However, how could the training loss /accuracy are nearly contstants 0.69/0.5 throughout the training?

The dataset I specified is UCF-101 and the nb_classes is set as 2. The only other thing I modfied from the official script is the learning rate due to working on single GPU non-distributed environment. I have tried different multiple setting of learning rate from 64x to 0.015x, but all of them leads to constant losses. Could you please help me, thank you a lot.

How to extract features using the pre-trained model

@yztongzhan @wanglimin Thanks for your great works!

Could you please tell me how to extract features using the pre-trained model? Could you please provide a script as finetune and pretrain?

Thanks for your nice help!

Why should the video corresponding label be added to the annotations?

And can you provide the train.csv file in the form of dataset_root/video_1.mp4 label_1 during the Data Preparation stage

Hi ,Could you provide AVA Pretrained model?

Thank you very much

novograd optimizer removed from timm

ModuleNotFoundError: No module named 'timm.optim.novograd'

it seems like timm just gives you nvnovograd if you ask for novograd

https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/optim_factory.py

elif opt_lower == 'novograd' or opt_lower == 'nvnovograd':
    optimizer = NvNovoGrad(parameters, **opt_args)

Add the length of the video when preparing datasets

According to the code to load data, it seems we should add the length of each video when preparing datasets. See https://github.com/MCG-NJU/VideoMAE/blob/main/DATASET.md for details. The generated annotation of video datasets should be like:
dataset_root/video_1.mp4 100 label_1, where 100 means the length of video_1.mp4.