opengvlab / videomaev2 Goto Github PK

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Home Page: https://arxiv.org/abs/2303.16727

License: MIT License

Python 90.25% Shell 9.75%

cvpr2023 foundation-model self-supervised-learning video-understanding action-detection action-recognition temporal-action-detection

videomaev2's Introduction

[CVPR 2023] Official Implementation of VideoMAE V2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao
Nanjing University, Shanghai AI Lab, CAS

News

[2023.05.29] VideoMAE V2-g features for THUMOS14 and FineAction datasets are available at TAD.md now.
[2023.05.11] We have supported testing of our distilled models at MMAction2 (dev version)! See PR#2460.
[2023.05.11] The feature extraction script for TAD datasets has been released! See instructions at TAD.md.
[2023.04.19] ViT-giant model weights have been released! You can get the download links from MODEL_ZOO.md.
[2023.04.18] Code and the distilled models (vit-s & vit-b) have been released!
[2023.04.03] ~~Code and models will be released soon.~~

Model Zoo

We now provide the model weights in MODEL_ZOO.md. We have additionally provided distilled models in MODEL_ZOO.

Model	Dataset	Teacher Model	#Frame	K710 Top-1	K400 Top-1	K600 Top-1
ViT-small	K710	vit_g_hybrid_pt_1200e_k710_ft	16x5x3	77.6	83.7	83.1
ViT-base	K710	vit_g_hybrid_pt_1200e_k710_ft	16x5x3	81.5	86.6	85.9

Installation

Please follow the instructions in INSTALL.md.

Data Preparation

Please follow the instructions in DATASET.md for data preparation.

Pre-training

The pre-training instruction is in PRETRAIN.md.

Fine-tuning

The fine-tuning instruction is in FINETUNE.md.

Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@InProceedings{wang2023videomaev2,
    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14549-14560}
}

@misc{videomaev2,
      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
      year={2023},
      eprint={2303.16727},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

videomaev2's People

Contributors

Stargazers

Watchers

videomaev2's Issues

预训练的大模型下载地址希望给一个国内能访问的

你好，我想下载你们预训练好的模型用于视频质量评估的研究，但是无法访问你们提供的下载地址，希望能提供一个国内能访问的下载链接

The hyperparameter Settings in the script seem to be inconsistent with those in the paper

Both the learning rate and epoch seem to be different from those described in the paper.

Where to find the script of finetuning on 'Temporal action detection' task?

Thank you for your amazing work. I was trying to use VideoMAEv2 to do a temporal action detection task. But I can't find the detailed implementation related to this task. Could you please give me some advice?

你好！请问可以提供ViT-base蒸馏模型finetune的script或者提供ViT-base的普通模型吗？非常感谢！！！我的邮箱是[email protected]

Do you have the finetuned checkpoints for UCF101?

I did not see the file in the Google Sheet. Thanks.

Error when running runclass_finetuning.py

Detected CUDA files, patching ldflags
Emitting ninja build file /home/ravindu.nagasinghe/.cache/torch_extensions/py38_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
ERROR: No supported gcc/g++ host compiler found.
Use 'nvcc -ccbin ' to specify a host compiler.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_class_finetuning.py", line 927, in
main(opts, ds_init)
File "run_class_finetuning.py", line 727, in main
model, optimizer, _, _ = ds_init(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 303, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1202, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1264, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

Initialize student model's weights

Hello! Thank you for the great work!

In the paper, page 11, there is a sentence:

Distillation results.
Using the procedure of [9], we are able to compress VideoMAE V2-g into a mush smaller ViT-B. Specifically, we initialize the student model with the VideoMAE V2-B weights after the post-pre-training.

Can you explain this process? Where is this part in the code? How can I do such initialization?

My intention is to just use a ViT-Small model for finetuning by loading from a bigger pretrained model like vit_base or vit_huge.
I tried setting --model vit_small_patch16_224 while loading MODEL_PATH=vit_g_hybrid_pt_1200e.pth but it didn't load weights due to size mismatching.

MODEL_PATH='YOUR_PATH/model_zoo/vit_g_hybrid_pt_1200e.pth'  # Model for initializing parameters
...
--model vit_giant_patch14_224 \
...

Thank you!

Apply VideoMAEV2 to other directions.

Hello, thanks for your good works! Your works have achieved good results in action recognition. And i have an idea to apply VideoMAEV2 to ficial expression recognition, do you think this is feasible?

I find that there seems to be some strange things in the evaluation of model.

I did some simple finetuning training and it seems that some of it looks normal :

But when I retested the saved pt file with the '--eval' parameter, I got slightly different results: in particular, the results of the single view test were quite different (65.xx vs. 67.xx):

Is this normal or a bug? Is there something wrong with my understanding?

Apply for the model weight of 'vit_g_hybrid_pt_1200e_k710_it_k400_ft'

Hello, from the download link, I can only find the model weight fine-tuned with 'vit_g_hybrid_pt_1200e_k710_ft' and 'vit_g_hybrid_pt_1200e_ssv2_ft'. Could you provide me with the model fine-tuned on kinetics400, i.e., 'vit_g_hybrid_pt_1200e_k710_it_k400_ft'. My email is [email protected]. Thank you.

fine-tuning AVA dataset for spatiotemporal detection

I wonder if the preparation of custom AVA-format dataset is the same as the VideoMAE, are the process of fine-tuning AVA format custom dataset the same as the process in VideoMAE-action-detection? Thanks.

Finetuned smaller models

Hello! I'm very thankful for the weights that you have released to the public, but may I get access to ViT-B and ViT-S finetuned weights, please?

VideoMAEv2-L Weights/Checkpoints

I noticed in the distilled models section that there are weights available for VideoMAEv2-small and base for the k710 dataset.
By chance, are there any fine-tuned VideoMAEv2-Large or huge weights available for the K710 dataset as well?

Failed to achieve claimed accuracy using vit-b on K400 dataset

I try to recurrent the accuracy in paper. So I used the pretrained vit-b model and finetuned it on K400 dataset.

However, as training goes on, the training top1 acc continues decrease. I used the training policy provided here: https://github.com/OpenGVLab/VideoMAEv2/blob/master/scripts/finetune/vit_b_k400_ft.sh

Pre-train Action recognition videoMAE model on UCF101

Hi!!
I'm interested in pre-training an action recognition model using videoMAEV2 on the UCF101 dataset. But I am having some difficulties and want assistance from you guys. Please help me.

Data Preparation: For Pretraining you have written

for video data line:
video_path 0 -1

What does mean by 0 and -1???
What are the next steps to pre-train the model on UCF101? What are some good practices that should be followed before going to pre-train the model??

Could you provide ActivityNet 1.2 and ActivityNet 1.3 features extracted by videomaev2 ?

thank you so much!

could you please provide the weights of VideoMAEv2 pre-trained on Kinetics-400?

Hi,
Thanks for sharing this great work. I am doing an evaluation of VIT models on specific video benchmarks and would like to include videomaev2. To make the comparison fair with other video SSL works, I need the weights of the ViT-B model pretrained on the Kinetics-400 dataset with the videomae2 loss function. Is it possible to share the corresponding weights? My email is [email protected]

2333

Unable to load the distilled model weights provided in the model zoo

How can one load and use the pre-trained distilled models from the model zoo?

First, creating the model using (needed to comment out all non-default params as they are not recognized):

model = create_model(
        'vit_base_patch16_224',
        img_size=224,
        pretrained=False,
        num_classes=710,
        #all_frames=args.num_frames * args.num_segments,
        #tubelet_size=args.tubelet_size,
        #drop_rate=args.drop,
        #drop_path_rate=args.drop_path,
        #attn_drop_rate=args.attn_drop_rate,
        #head_drop_rate=args.head_drop_rate,
        #drop_block_rate=None,
        #use_mean_pooling=args.use_mean_pooling,
        #init_scale=args.init_scale,
        #with_cp=args.with_checkpoint,
    )

When I am trying to load the weights:
https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/distill/vit_s_k710_dl_from_giant.pth

using the utils.load_state_dict() function, I get multiple errors, including:
size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([768, 3, 2, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).

I assume this might be because the tubelet size is missing, which by default is set to 2 (and could be the dimension I am missing). So I guess the main question is, how to load the model (and which model)?

Any help appreciated, thanks!

Could you provide features for ActivityNet 1.2 and ActivityNet 1.3 features extracted by videomaev2 ?

Extracting Features from Frame Level Data

Can you provide code for extracting features from frame level datasets？Thank you very much！！

Pretrained smaller models availability

Hello. Thank you for the great work.
1. Could you provide me with the ViT-B adn ViT-S model?
2. How much GPU VRAM required when I fine-tune pretrained ViT-G model on custom video dataset? When I try to finetune it with batch size of 1 on V100 with 32GB memory, it is showing CUDA out of memory error. Is there sth wrong with what I am doing?

The parameter grad_norm appears to be inf and then nan when input resolution is 112*112 during the pre-training on VIT-Small backbone

Hello, thank you very much for your significant contribution to the computer vision community! When I set my input resolution to 112*112 and do the pre-training on VIT-Small backbone the parameter grad_norm appears to be inf and then nan and then back to normal, is this normal or abnormal? If the training is abnormal what should I do to avoid this, looking forward and thanking you for your answer!

why videomaev2-large don't need repeat videos? is it type error?

I can't find any content reledvant to repeat video in your pretrain script.
Is type error?

Question about multiple view fusion?

请问论文表格6(a)中的multiple view fusion是什么意思呢？

(Feature request) Batched feature extraction

Hello, thank you for releasing the code and great work!

Is there a way to increase batch size in the simple feature extraction examples? The current script only utilizes about 7gb of vram during feature extraction.

Could you provide the "misc/label_710to710.json" file?

For the data_set 'Kinetics-710', the file "label_710to710.json" is required, but I can't find it in the "misc" directory. Could you please provide the file? Thanks.

Pretrained Action Detection on AVA-Kinetics model weights

Hello,
I'm Rohit. I'm a researcher from IIIT-Hyderabad,India. I am working on a project related to Action Detection. Can you please provide your pretrained model weights on AVA-Kinetics/AVAv2.2. It would really help me in building a model that uses these pretrained models weights and also in ablation studies.

I filled your form for model weights. But I am getting model weights for video action classification which is not what I want. I want pretrained model weights for action detection

Thanks,
Rohit

Starting the pretraining from checkpoint..

Can I just do:
torchrun --standalone --nproc_per_node=${NGPU} run_mae_pretraining.py . . . .
--model pretrain_videomae_giant_patch14_224
with --resume checkpoints/vit_g_hybrid_pt_1200e.pth ??

This gave: Error(s) in loading state_dict for OptimizedModule:

Can you share the vit_base_patch16_224 checkpoint after 1200e?

Reproducing of TAD

Awesome Work!
Do you have plans for sharing the reproduction of Temporal Action Detection Task?

Turning VideoMAEv2 into a next-frame prediction model

Great work and thanks for the code!

I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?

Hello! Can you publish the basic model of ViT-base? Or is it just a ViT-base distillation model?

could you please provide me the weight of VideoMAEv2 pre-trained on Kinetics-400?

I want to use the the weight to extract the features of the thumos14.
I'll appreciate it very much if you could share me the weights or the features of the thumos14 extracted by Kinetics-400 ！

What should I do if I want to get the features of ActivityNet-1.3?

Clarification on published logs.

In the log from the link in Google Sheets at vit_g_hybrid_pt_1200e_log.txt
The epoch is only up to 300. It should be up to 1200, isn't it? The loss seems to have just been settled to ~0.52. Is this correct?

I would love it if you could publish the weight for the base variant as well.

Request for the training script for VideoMAE-V2-Base

Hi. Thanks for your nice work! I need to train VideoMAE-V2-Base on the Kinetics-400 dataset, but I didn't find the training script for VideoMAE-V2-Base. And I didn't find the performance of VideoMAE-V2-Base on the Kinectics-400 dataset in your paper. Could you tell me the performance of VideoMAE-V2-Base on the Kinectics-400 dataset?

How to train my own dataset?

I want to do some experiments on my own dataset, using the fine-tuning method.
But in the training script, the param 'data_set' only contains some open datasets.
My dataset only has 10 classes. If I changed the --nb_class to 10, an AssertionERROR will raise.
After skipping this assertion, my dataset starts to train but the ACC@1 is always 0%.
So is there something that I should do correctly train my own dataset?

In datasets/build.py > def build_dataset > should the "anno_path" for mode = 'test' be (args.data_path, 'test.csv') instead of what it is currently (args.data_path, 'val.csv')?

def build_dataset(is_train, test_mode, args): if is_train: mode = 'train' anno_path = os.path.join(args.data_path, 'train.csv') elif test_mode: mode = 'test' anno_path = os.path.join(args.data_path, 'val.csv') else: mode = 'validation' anno_path = os.path.join(args.data_path, 'val.csv')

你好！再向你请教一个问题，就是我把部分模块冻结不更新参数的时候，跑的V2版本的vit_b_k400_ft.sh，batch size设置为4的时候一个epoch训练时间为1小时20分钟，batch size设置为8的时候一个epoch训练时间也为1小时左右，batch size设置为32的时候一个epoch训练时间也为1小时左右，请问这是正常现象么，就是batch size增大4倍的时候，每一个step时间也会增大四倍，然后一个epoch的总时间就不怎么变化，但无论batch size是4，8，还是32，GPU利用率好像都是满的（GPU-Util Compute M.这一列），请问我这里成倍数增加batch size而不能成倍数减少训练时间是正常的吗，目前batch size为4和8都能完整训练十个epoch，但是为32的时候会报错RuntimeError: DataLoader worker (pid 34621) is killed by signal: Killed.

batch size设置为4的时候：
Epoch: [0] [ 520/14651] eta: 1:15:14 lr: 0.000000 min_lr: 0.000000 loss: 5.9574 (5.9875) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.9344 (3.8902) time: 0.3269 (0.2472 -- 0.6107) data: 0.0189 (0.0001 -- 0.1967) max mem: 1399
Epoch: [0] [ 540/14651] eta: 1:15:06 lr: 0.000000 min_lr: 0.000000 loss: 6.2149 (5.9955) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.9445 (3.8901) time: 0.3168 (0.2267 -- 1.0072) data: 0.0463 (0.0001 -- 0.7580) max mem: 1399
Epoch: [0] [ 560/14651] eta: 1:15:02 lr: 0.000000 min_lr: 0.000000 loss: 6.1452 (6.0008) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 4.0181 (3.8955) time: 0.3250 (0.2347 -- 1.2508) data: 0.0027 (0.0001 -- 0.0454) max mem: 1399
Epoch: [0] [ 580/14651] eta: 1:14:44 lr: 0.000000 min_lr: 0.000000 loss: 5.7761 (5.9944) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.9082 (3.8942) time: 0.2955 (0.2435 -- 0.7480) data: 0.0323 (0.0001 -- 0.5108) max mem: 1399
Epoch: [0] [ 600/14651] eta: 1:14:26 lr: 0.000000 min_lr: 0.000000 loss: 5.8527 (5.9894) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 4.0570 (3.8975) time: 0.2939 (0.2322 -- 0.6804) data: 0.0239 (0.0003 -- 0.4686) max mem: 1399
Epoch: [0] [ 620/14651] eta: 1:14:29 lr: 0.000000 min_lr: 0.000000 loss: 6.1052 (5.9927) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.7728 (3.8941) time: 0.3377 (0.2268 -- 0.9082) data: 0.0112 (0.0001 -- 0.1255) max mem: 1399
[2023-05-18 23:48:49,839] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 23:48:49,839] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

batch size设置为8的时候：
Epoch: [0] [ 520/7325] eta: 1:07:53 lr: 0.000001 min_lr: 0.000001 loss: 5.5630 (5.9820) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.5644 (3.7585) time: 0.6204 (0.4038 -- 2.1102) data: 0.0023 (0.0004 -- 0.0050) max mem: 2446
Epoch: [0] [ 540/7325] eta: 1:07:22 lr: 0.000001 min_lr: 0.000001 loss: 6.0109 (5.9832) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.7292 (3.7575) time: 0.5229 (0.3955 -- 1.2074) data: 0.0026 (0.0003 -- 0.0143) max mem: 2446
Epoch: [0] [ 560/7325] eta: 1:06:38 lr: 0.000001 min_lr: 0.000001 loss: 5.7411 (5.9757) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 3.9810 (3.7646) time: 0.4626 (0.3977 -- 0.6022) data: 0.0028 (0.0002 -- 0.0100) max mem: 2446
Epoch: [0] [ 580/7325] eta: 1:06:15 lr: 0.000001 min_lr: 0.000001 loss: 6.0568 (5.9757) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.8520 (3.7678) time: 0.5432 (0.3913 -- 1.0046) data: 0.0483 (0.0012 -- 0.5670) max mem: 2446
Epoch: [0] [ 600/7325] eta: 1:05:51 lr: 0.000001 min_lr: 0.000001 loss: 6.0425 (5.9778) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 3.8539 (3.7686) time: 0.5360 (0.3927 -- 0.8390) data: 0.0418 (0.0010 -- 0.4310) max mem: 2446
Epoch: [0] [ 620/7325] eta: 1:05:36 lr: 0.000001 min_lr: 0.000001 loss: 6.0304 (5.9792) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.8550 (3.7696) time: 0.5713 (0.4063 -- 1.3603) data: 0.0031 (0.0008 -- 0.0113) max mem: 2446
[2023-05-17 19:13:54,854] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-17 19:13:54,855] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

batch size设置为32的时候：
Epoch: [0] [ 520/1831] eta: 0:51:08 lr: 0.000014 min_lr: 0.000014 loss: 6.0383 (5.9943) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.6213 (3.5492) time: 2.2246 (1.5183 -- 6.5994) data: 0.0011 (0.0003 -- 0.0022) max mem: 8732
Epoch: [0] [ 540/1831] eta: 0:50:09 lr: 0.000015 min_lr: 0.000015 loss: 5.9155 (5.9919) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.6299 (3.5474) time: 2.0826 (1.3866 -- 4.9960) data: 0.0013 (0.0004 -- 0.0036) max mem: 8732
Epoch: [0] [ 560/1831] eta: 0:49:23 lr: 0.000015 min_lr: 0.000015 loss: 5.8051 (5.9882) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 3.5953 (3.5460) time: 2.3389 (1.4880 -- 8.4142) data: 0.0011 (0.0006 -- 0.0026) max mem: 8732
Epoch: [0] [ 580/1831] eta: 0:48:25 lr: 0.000016 min_lr: 0.000016 loss: 5.8914 (5.9849) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.7235 (3.5488) time: 2.0733 (1.5494 -- 6.0287) data: 0.0010 (0.0005 -- 0.0022) max mem: 8732
Epoch: [0] [ 600/1831] eta: 0:47:43 lr: 0.000016 min_lr: 0.000016 loss: 6.0225 (5.9861) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 3.7547 (3.5498) time: 2.4241 (1.5126 -- 9.1236) data: 0.0012 (0.0007 -- 0.0032) max mem: 8732
Epoch: [0] [ 620/1831] eta: 0:46:56 lr: 0.000017 min_lr: 0.000017 loss: 5.8877 (5.9853) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.5791 (3.5478) time: 2.3172 (1.4694 -- 9.0961) data: 0.0011 (0.0002 -- 0.0023) max mem: 8732
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

No module named 'petrel_client'

code：

try:
from petrel_client.client import Client
petrel_backend_imported = True
except (ImportError, ModuleNotFoundError):
petrel_backend_imported = False

How can i install petrel_client? Thanks

Visualization Script

Hi,
thanks for the great repo!
Can you please publish code for the visualization of the predicted completions for a video?

Thanks,
Kfir

The feature details of THUMOS14 dataset.

Hi! Thanks for your excellent work! Could you please give me some details about the feature extraction of the THUMOS14 dataset (eg. FPS of videos, the number of frames in each clip, the stride)? Is the feature a fixed length or is it a different length for each video? Thank you very much!

你好！我跑的V2版本的vit_b_k400_ft.sh，最终测试final_test需要20个小时，如下面所示，然后我又跑VideoMAE的final_test，发现也差不多那么久，但是我记得之前跑测试就俩小时左右啊，这是怎么回事啊，是我记错了么，修改了一下午v2版本的然后还是这样，突然找不到原因了

Test: [4430/9263] eta: 10:39:52 loss: 0.8962 (0.9741) acc1: 75.0000 (77.9959) acc5: 100.0000 (92.2083) time: 8.2166 (0.1927 -- 84.2684) data: 7.9541 (0.0002 -- 84.0091) max mem: 2736
Test: [4440/9263] eta: 10:39:05 loss: 0.8962 (0.9739) acc1: 75.0000 (78.0033) acc5: 100.0000 (92.2174) time: 8.7052 (0.2151 -- 64.5609) data: 8.4290 (0.0002 -- 64.2934) max mem: 2736
Test: [4450/9263] eta: 10:37:37 loss: 0.6310 (0.9729) acc1: 87.5000 (78.0162) acc5: 100.0000 (92.2293) time: 9.0514 (0.1833 -- 69.0573) data: 8.7904 (0.0002 -- 68.8211) max mem: 2736
Test: [4460/9263] eta: 10:35:15 loss: 0.6533 (0.9734) acc1: 75.0000 (78.0038) acc5: 100.0000 (92.2271) time: 4.6454 (0.1833 -- 69.0573) data: 4.4049 (0.0002 -- 68.8211) max mem: 2736
Test: [4470/9263] eta: 10:34:31 loss: 1.2881 (0.9741) acc1: 75.0000 (77.9719) acc5: 100.0000 (92.2249) time: 6.6842 (0.1785 -- 99.6060) data: 6.4549 (0.0001 -- 99.3840) max mem: 2736
Test: [4480/9263] eta: 10:33:26 loss: 0.8821 (0.9734) acc1: 75.0000 (77.9848) acc5: 100.0000 (92.2367) time: 10.2560 (0.1761 -- 99.6060) data: 10.0390 (0.0001 -- 99.3840) max mem: 2736
Test: [4490/9263] eta: 10:31:32 loss: 0.3921 (0.9726) acc1: 87.5000 (78.0004) acc5: 100.0000 (92.2456) time: 7.0091 (0.1761 -- 52.3157) data: 6.7684 (0.0001 -- 52.0523) max mem: 2736
Test: [4500/9263] eta: 10:29:47 loss: 0.3635 (0.9713) acc1: 87.5000 (78.0299) acc5: 100.0000 (92.2601) time: 5.0883 (0.1988 -- 45.4116) data: 4.8232 (0.0004 -- 45.1441) max mem: 2736
Test: [4510/9263] eta: 10:27:42 loss: 0.4875 (0.9712) acc1: 87.5000 (78.0232) acc5: 100.0000 (92.2606) time: 4.5622 (0.1988 -- 45.4116) data: 4.3024 (0.0002 -- 45.1441) max mem: 2736
Test: [4520/9263] eta: 10:26:07 loss: 0.9449 (0.9718) acc1: 62.5000 (78.0054) acc5: 100.0000 (92.2639) time: 5.0264 (0.1956 -- 34.2879) data: 4.7782 (0.0002 -- 33.9954) max mem: 2736
Test: [4530/9263] eta: 10:25:31 loss: 0.8519 (0.9710) acc1: 75.0000 (78.0236) acc5: 100.0000 (92.2754) time: 9.2248 (0.1768 -- 112.2758) data: 8.9004 (0.0001 -- 111.7937) max mem: 2736
Test: [4540/9263] eta: 10:23:28 loss: 0.2287 (0.9698) acc1: 100.0000 (78.0555) acc5: 100.0000 (92.2842) time: 7.8819 (0.1757 -- 112.2758) data: 7.4237 (0.0001 -- 111.7937) max mem: 2736
Test: [4550/9263] eta: 10:25:02 loss: 0.2287 (0.9689) acc1: 100.0000 (78.0653) acc5: 100.0000 (92.2984) time: 14.1884 (0.1757 -- 158.7735) data: 13.8075 (0.0001 -- 158.5454) max mem: 2736
Test: [4560/9263] eta: 10:23:18 loss: 0.6578 (0.9690) acc1: 75.0000 (78.0585) acc5: 100.0000 (92.3016) time: 15.1305 (0.1843 -- 158.7735) data: 14.9077 (0.0002 -- 158.5454) max mem: 2736

Wonder more pretrain scripts and results

Wonderful work.
In paper's Table.2, you report the Top-1 accuary about the ViT-B on Something-Something v2 by using dual masking method, but there is no pretrain scripts in this repo.
So I wonder that is the pretrain script same with VideoMAE v1? Did you pretrain on Kinetic400 by using dual masking method and how about results?

Impact of Something Something and Kinetics during Unlabeled Pre-training

Hello,

In your unlabeled pre-training phase, you train on two datasets (STV2 and Kinetics) with reconstruction loss and later fine-tune with loss computed against their labels. Did you find that training over these two datasets during the unsupervised stage helped when finetuning? Did you find training over the validation sets to help as well?

Also, I'm wondering if you've ever played around with combining the first two stages of training (training against an added reconstruction loss and cross entropy loss). Thank you ahead of time, and great work on this paper!