Git Product home page Git Product logo

videomaev2's Introduction

[CVPR 2023] Official Implementation of VideoMAE V2

flowchart

PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC
PWC

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao
Nanjing University, Shanghai AI Lab, CAS

News

[2023.05.29] VideoMAE V2-g features for THUMOS14 and FineAction datasets are available at TAD.md now.
[2023.05.11] We have supported testing of our distilled models at MMAction2 (dev version)! See PR#2460.
[2023.05.11] The feature extraction script for TAD datasets has been released! See instructions at TAD.md.
[2023.04.19] ViT-giant model weights have been released! You can get the download links from MODEL_ZOO.md.
[2023.04.18] Code and the distilled models (vit-s & vit-b) have been released!
[2023.04.03] Code and models will be released soon.

Model Zoo

We now provide the model weights in MODEL_ZOO.md. We have additionally provided distilled models in MODEL_ZOO.

Model Dataset Teacher Model #Frame K710 Top-1 K400 Top-1 K600 Top-1
ViT-small K710 vit_g_hybrid_pt_1200e_k710_ft 16x5x3 77.6 83.7 83.1
ViT-base K710 vit_g_hybrid_pt_1200e_k710_ft 16x5x3 81.5 86.6 85.9

Installation

Please follow the instructions in INSTALL.md.

Data Preparation

Please follow the instructions in DATASET.md for data preparation.

Pre-training

The pre-training instruction is in PRETRAIN.md.

Fine-tuning

The fine-tuning instruction is in FINETUNE.md.

Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@InProceedings{wang2023videomaev2,
    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14549-14560}
}

@misc{videomaev2,
      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
      year={2023},
      eprint={2303.16727},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

videomaev2's People

Contributors

congee524 avatar jerryflymi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

videomaev2's Issues

Error when running runclass_finetuning.py

Detected CUDA files, patching ldflags
Emitting ninja build file /home/ravindu.nagasinghe/.cache/torch_extensions/py38_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/TH -isystem /home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/include/THC -isystem /home/ravindu.nagasinghe/.conda/envs/videomae/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++17 -c /home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
ERROR: No supported gcc/g++ host compiler found.
Use 'nvcc -ccbin ' to specify a host compiler.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "run_class_finetuning.py", line 927, in
main(opts, ds_init)
File "run_class_finetuning.py", line 727, in main
model, optimizer, _, _ = ds_init(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 303, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1202, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1264, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
File "/home/ravindu.nagasinghe/.conda/envs/videomae/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/ravindu.nagasinghe/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

Initialize student model's weights

Hello! Thank you for the great work!

In the paper, page 11, there is a sentence:

Distillation results.
Using the procedure of [9], we are able to compress VideoMAE V2-g into a mush smaller ViT-B. Specifically, we initialize the student model with the VideoMAE V2-B weights after the post-pre-training.

Can you explain this process? Where is this part in the code? How can I do such initialization?

My intention is to just use a ViT-Small model for finetuning by loading from a bigger pretrained model like vit_base or vit_huge.
I tried setting --model vit_small_patch16_224 while loading MODEL_PATH=vit_g_hybrid_pt_1200e.pth but it didn't load weights due to size mismatching.

MODEL_PATH='YOUR_PATH/model_zoo/vit_g_hybrid_pt_1200e.pth'  # Model for initializing parameters
...
--model vit_giant_patch14_224 \
...

Thank you!

Apply VideoMAEV2 to other directions.

Hello, thanks for your good works! Your works have achieved good results in action recognition. And i have an idea to apply VideoMAEV2 to ficial expression recognition, do you think this is feasible?

I find that there seems to be some strange things in the evaluation of model.

I did some simple finetuning training and it seems that some of it looks normal :
A1RSV8H}1XR1PN4~0`K@NFF

But when I retested the saved pt file with the '--eval' parameter, I got slightly different results: in particular, the results of the single view test were quite different (65.xx vs. 67.xx):
image

Is this normal or a bug? Is there something wrong with my understanding?

Finetuned smaller models

Hello! I'm very thankful for the weights that you have released to the public, but may I get access to ViT-B and ViT-S finetuned weights, please?

VideoMAEv2-L Weights/Checkpoints

I noticed in the distilled models section that there are weights available for VideoMAEv2-small and base for the k710 dataset.
By chance, are there any fine-tuned VideoMAEv2-Large or huge weights available for the K710 dataset as well?

Pre-train Action recognition videoMAE model on UCF101

Hi!!
I'm interested in pre-training an action recognition model using videoMAEV2 on the UCF101 dataset. But I am having some difficulties and want assistance from you guys. Please help me.

Data Preparation: For Pretraining you have written

for video data line:
video_path 0 -1

What does mean by 0 and -1???
What are the next steps to pre-train the model on UCF101? What are some good practices that should be followed before going to pre-train the model??

Unable to load the distilled model weights provided in the model zoo

How can one load and use the pre-trained distilled models from the model zoo?

First, creating the model using (needed to comment out all non-default params as they are not recognized):

model = create_model(
        'vit_base_patch16_224',
        img_size=224,
        pretrained=False,
        num_classes=710,
        #all_frames=args.num_frames * args.num_segments,
        #tubelet_size=args.tubelet_size,
        #drop_rate=args.drop,
        #drop_path_rate=args.drop_path,
        #attn_drop_rate=args.attn_drop_rate,
        #head_drop_rate=args.head_drop_rate,
        #drop_block_rate=None,
        #use_mean_pooling=args.use_mean_pooling,
        #init_scale=args.init_scale,
        #with_cp=args.with_checkpoint,
    )

When I am trying to load the weights:
https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/distill/vit_s_k710_dl_from_giant.pth

using the utils.load_state_dict() function, I get multiple errors, including:
size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([768, 3, 2, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).

I assume this might be because the tubelet size is missing, which by default is set to 2 (and could be the dimension I am missing). So I guess the main question is, how to load the model (and which model)?

Any help appreciated, thanks!

Pretrained smaller models availability

Hello. Thank you for the great work.
1. Could you provide me with the ViT-B adn ViT-S model?
2. How much GPU VRAM required when I fine-tune pretrained ViT-G model on custom video dataset? When I try to finetune it with batch size of 1 on V100 with 32GB memory, it is showing CUDA out of memory error. Is there sth wrong with what I am doing?

The parameter grad_norm appears to be inf and then nan when input resolution is 112*112 during the pre-training on VIT-Small backbone

Hello, thank you very much for your significant contribution to the computer vision community! When I set my input resolution to 112*112 and do the pre-training on VIT-Small backbone the parameter grad_norm appears to be inf and then nan and then back to normal, is this normal or abnormal? If the training is abnormal what should I do to avoid this, looking forward and thanking you for your answer!
bec87fed107a42d2df8ca25b5d993c5
2feaecdc2dec37011d1bb8d5baebbca

Pretrained Action Detection on AVA-Kinetics model weights

Hello,
I'm Rohit. I'm a researcher from IIIT-Hyderabad,India. I am working on a project related to Action Detection. Can you please provide your pretrained model weights on AVA-Kinetics/AVAv2.2. It would really help me in building a model that uses these pretrained models weights and also in ablation studies.

I filled your form for model weights. But I am getting model weights for video action classification which is not what I want. I want pretrained model weights for action detection

Thanks,
Rohit

Starting the pretraining from checkpoint..

Can I just do:
torchrun --standalone --nproc_per_node=${NGPU} run_mae_pretraining.py . . . .
--model pretrain_videomae_giant_patch14_224
with --resume checkpoints/vit_g_hybrid_pt_1200e.pth ??

This gave: Error(s) in loading state_dict for OptimizedModule:

Can you share the vit_base_patch16_224 checkpoint after 1200e?

Reproducing of TAD

Awesome Work!
Do you have plans for sharing the reproduction of Temporal Action Detection Task?

Turning VideoMAEv2 into a next-frame prediction model

Great work and thanks for the code!

I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?

Clarification on published logs.

In the log from the link in Google Sheets at vit_g_hybrid_pt_1200e_log.txt
The epoch is only up to 300. It should be up to 1200, isn't it? The loss seems to have just been settled to ~0.52. Is this correct?

I would love it if you could publish the weight for the base variant as well.

Request for the training script for VideoMAE-V2-Base

Hi. Thanks for your nice work! I need to train VideoMAE-V2-Base on the Kinetics-400 dataset, but I didn't find the training script for VideoMAE-V2-Base. And I didn't find the performance of VideoMAE-V2-Base on the Kinectics-400 dataset in your paper. Could you tell me the performance of VideoMAE-V2-Base on the Kinectics-400 dataset?

How to train my own dataset?

I want to do some experiments on my own dataset, using the fine-tuning method.
But in the training script, the param 'data_set' only contains some open datasets.
My dataset only has 10 classes. If I changed the --nb_class to 10, an AssertionERROR will raise.
After skipping this assertion, my dataset starts to train but the ACC@1 is always 0%.
So is there something that I should do correctly train my own dataset?

你好!再向你请教一个问题,就是我把部分模块冻结不更新参数的时候,跑的V2版本的vit_b_k400_ft.sh,batch size设置为4的时候一个epoch训练时间为1小时20分钟,batch size设置为8的时候一个epoch训练时间也为1小时左右,batch size设置为32的时候一个epoch训练时间也为1小时左右,请问这是正常现象么,就是batch size增大4倍的时候,每一个step时间也会增大四倍,然后一个epoch的总时间就不怎么变化,但无论batch size是4,8,还是32,GPU利用率好像都是满的(GPU-Util Compute M.这一列),请问我这里成倍数增加batch size而不能成倍数减少训练时间是正常的吗,目前batch size为4和8都能完整训练十个epoch,但是为32的时候会报错RuntimeError: DataLoader worker (pid 34621) is killed by signal: Killed.

batch size设置为4的时候:
Epoch: [0] [ 520/14651] eta: 1:15:14 lr: 0.000000 min_lr: 0.000000 loss: 5.9574 (5.9875) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.9344 (3.8902) time: 0.3269 (0.2472 -- 0.6107) data: 0.0189 (0.0001 -- 0.1967) max mem: 1399
Epoch: [0] [ 540/14651] eta: 1:15:06 lr: 0.000000 min_lr: 0.000000 loss: 6.2149 (5.9955) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.9445 (3.8901) time: 0.3168 (0.2267 -- 1.0072) data: 0.0463 (0.0001 -- 0.7580) max mem: 1399
Epoch: [0] [ 560/14651] eta: 1:15:02 lr: 0.000000 min_lr: 0.000000 loss: 6.1452 (6.0008) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 4.0181 (3.8955) time: 0.3250 (0.2347 -- 1.2508) data: 0.0027 (0.0001 -- 0.0454) max mem: 1399
Epoch: [0] [ 580/14651] eta: 1:14:44 lr: 0.000000 min_lr: 0.000000 loss: 5.7761 (5.9944) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.9082 (3.8942) time: 0.2955 (0.2435 -- 0.7480) data: 0.0323 (0.0001 -- 0.5108) max mem: 1399
Epoch: [0] [ 600/14651] eta: 1:14:26 lr: 0.000000 min_lr: 0.000000 loss: 5.8527 (5.9894) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 4.0570 (3.8975) time: 0.2939 (0.2322 -- 0.6804) data: 0.0239 (0.0003 -- 0.4686) max mem: 1399
Epoch: [0] [ 620/14651] eta: 1:14:29 lr: 0.000000 min_lr: 0.000000 loss: 6.1052 (5.9927) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.7728 (3.8941) time: 0.3377 (0.2268 -- 0.9082) data: 0.0112 (0.0001 -- 0.1255) max mem: 1399
[2023-05-18 23:48:49,839] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 23:48:49,839] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

batch size设置为8的时候:
Epoch: [0] [ 520/7325] eta: 1:07:53 lr: 0.000001 min_lr: 0.000001 loss: 5.5630 (5.9820) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.5644 (3.7585) time: 0.6204 (0.4038 -- 2.1102) data: 0.0023 (0.0004 -- 0.0050) max mem: 2446
Epoch: [0] [ 540/7325] eta: 1:07:22 lr: 0.000001 min_lr: 0.000001 loss: 6.0109 (5.9832) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.7292 (3.7575) time: 0.5229 (0.3955 -- 1.2074) data: 0.0026 (0.0003 -- 0.0143) max mem: 2446
Epoch: [0] [ 560/7325] eta: 1:06:38 lr: 0.000001 min_lr: 0.000001 loss: 5.7411 (5.9757) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 3.9810 (3.7646) time: 0.4626 (0.3977 -- 0.6022) data: 0.0028 (0.0002 -- 0.0100) max mem: 2446
Epoch: [0] [ 580/7325] eta: 1:06:15 lr: 0.000001 min_lr: 0.000001 loss: 6.0568 (5.9757) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.8520 (3.7678) time: 0.5432 (0.3913 -- 1.0046) data: 0.0483 (0.0012 -- 0.5670) max mem: 2446
Epoch: [0] [ 600/7325] eta: 1:05:51 lr: 0.000001 min_lr: 0.000001 loss: 6.0425 (5.9778) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 3.8539 (3.7686) time: 0.5360 (0.3927 -- 0.8390) data: 0.0418 (0.0010 -- 0.4310) max mem: 2446
Epoch: [0] [ 620/7325] eta: 1:05:36 lr: 0.000001 min_lr: 0.000001 loss: 6.0304 (5.9792) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.8550 (3.7696) time: 0.5713 (0.4063 -- 1.3603) data: 0.0031 (0.0008 -- 0.0113) max mem: 2446
[2023-05-17 19:13:54,854] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-17 19:13:54,855] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

batch size设置为32的时候:
Epoch: [0] [ 520/1831] eta: 0:51:08 lr: 0.000014 min_lr: 0.000014 loss: 6.0383 (5.9943) loss_scale: 1024.0000 (507.0864) weight_decay: 0.0500 (0.0500) grad_norm: 3.6213 (3.5492) time: 2.2246 (1.5183 -- 6.5994) data: 0.0011 (0.0003 -- 0.0022) max mem: 8732
Epoch: [0] [ 540/1831] eta: 0:50:09 lr: 0.000015 min_lr: 0.000015 loss: 5.9155 (5.9919) loss_scale: 2048.0000 (564.0518) weight_decay: 0.0500 (0.0500) grad_norm: 3.6299 (3.5474) time: 2.0826 (1.3866 -- 4.9960) data: 0.0013 (0.0004 -- 0.0036) max mem: 8732
Epoch: [0] [ 560/1831] eta: 0:49:23 lr: 0.000015 min_lr: 0.000015 loss: 5.8051 (5.9882) loss_scale: 2048.0000 (616.9554) weight_decay: 0.0500 (0.0500) grad_norm: 3.5953 (3.5460) time: 2.3389 (1.4880 -- 8.4142) data: 0.0011 (0.0006 -- 0.0026) max mem: 8732
Epoch: [0] [ 580/1831] eta: 0:48:25 lr: 0.000016 min_lr: 0.000016 loss: 5.8914 (5.9849) loss_scale: 2048.0000 (666.2169) weight_decay: 0.0500 (0.0500) grad_norm: 3.7235 (3.5488) time: 2.0733 (1.5494 -- 6.0287) data: 0.0010 (0.0005 -- 0.0022) max mem: 8732
Epoch: [0] [ 600/1831] eta: 0:47:43 lr: 0.000016 min_lr: 0.000016 loss: 6.0225 (5.9861) loss_scale: 2048.0000 (712.1997) weight_decay: 0.0500 (0.0500) grad_norm: 3.7547 (3.5498) time: 2.4241 (1.5126 -- 9.1236) data: 0.0012 (0.0007 -- 0.0032) max mem: 8732
Epoch: [0] [ 620/1831] eta: 0:46:56 lr: 0.000017 min_lr: 0.000017 loss: 5.8877 (5.9853) loss_scale: 2048.0000 (755.2206) weight_decay: 0.0500 (0.0500) grad_norm: 3.5791 (3.5478) time: 2.3172 (1.4694 -- 9.0961) data: 0.0011 (0.0002 -- 0.0023) max mem: 8732
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:370:_update_scale] No Grad overflow for 128 iterations
[2023-05-18 21:03:26,806] [INFO] [fused_optimizer.py:371:_update_scale] Increasing dynamic loss scale from 2048 to 4096

No module named 'petrel_client'

code:

try:
from petrel_client.client import Client
petrel_backend_imported = True
except (ImportError, ModuleNotFoundError):
petrel_backend_imported = False

How can i install petrel_client? Thanks

Visualization Script

Hi,
thanks for the great repo!
Can you please publish code for the visualization of the predicted completions for a video?

Thanks,
Kfir

The feature details of THUMOS14 dataset.

Hi! Thanks for your excellent work! Could you please give me some details about the feature extraction of the THUMOS14 dataset (eg. FPS of videos, the number of frames in each clip, the stride)? Is the feature a fixed length or is it a different length for each video? Thank you very much!

你好!我跑的V2版本的vit_b_k400_ft.sh,最终测试final_test需要20个小时,如下面所示,然后我又跑VideoMAE的final_test,发现也差不多那么久,但是我记得之前跑测试就俩小时左右啊,这是怎么回事啊,是我记错了么,修改了一下午v2版本的然后还是这样,突然找不到原因了

Test: [4430/9263] eta: 10:39:52 loss: 0.8962 (0.9741) acc1: 75.0000 (77.9959) acc5: 100.0000 (92.2083) time: 8.2166 (0.1927 -- 84.2684) data: 7.9541 (0.0002 -- 84.0091) max mem: 2736
Test: [4440/9263] eta: 10:39:05 loss: 0.8962 (0.9739) acc1: 75.0000 (78.0033) acc5: 100.0000 (92.2174) time: 8.7052 (0.2151 -- 64.5609) data: 8.4290 (0.0002 -- 64.2934) max mem: 2736
Test: [4450/9263] eta: 10:37:37 loss: 0.6310 (0.9729) acc1: 87.5000 (78.0162) acc5: 100.0000 (92.2293) time: 9.0514 (0.1833 -- 69.0573) data: 8.7904 (0.0002 -- 68.8211) max mem: 2736
Test: [4460/9263] eta: 10:35:15 loss: 0.6533 (0.9734) acc1: 75.0000 (78.0038) acc5: 100.0000 (92.2271) time: 4.6454 (0.1833 -- 69.0573) data: 4.4049 (0.0002 -- 68.8211) max mem: 2736
Test: [4470/9263] eta: 10:34:31 loss: 1.2881 (0.9741) acc1: 75.0000 (77.9719) acc5: 100.0000 (92.2249) time: 6.6842 (0.1785 -- 99.6060) data: 6.4549 (0.0001 -- 99.3840) max mem: 2736
Test: [4480/9263] eta: 10:33:26 loss: 0.8821 (0.9734) acc1: 75.0000 (77.9848) acc5: 100.0000 (92.2367) time: 10.2560 (0.1761 -- 99.6060) data: 10.0390 (0.0001 -- 99.3840) max mem: 2736
Test: [4490/9263] eta: 10:31:32 loss: 0.3921 (0.9726) acc1: 87.5000 (78.0004) acc5: 100.0000 (92.2456) time: 7.0091 (0.1761 -- 52.3157) data: 6.7684 (0.0001 -- 52.0523) max mem: 2736
Test: [4500/9263] eta: 10:29:47 loss: 0.3635 (0.9713) acc1: 87.5000 (78.0299) acc5: 100.0000 (92.2601) time: 5.0883 (0.1988 -- 45.4116) data: 4.8232 (0.0004 -- 45.1441) max mem: 2736
Test: [4510/9263] eta: 10:27:42 loss: 0.4875 (0.9712) acc1: 87.5000 (78.0232) acc5: 100.0000 (92.2606) time: 4.5622 (0.1988 -- 45.4116) data: 4.3024 (0.0002 -- 45.1441) max mem: 2736
Test: [4520/9263] eta: 10:26:07 loss: 0.9449 (0.9718) acc1: 62.5000 (78.0054) acc5: 100.0000 (92.2639) time: 5.0264 (0.1956 -- 34.2879) data: 4.7782 (0.0002 -- 33.9954) max mem: 2736
Test: [4530/9263] eta: 10:25:31 loss: 0.8519 (0.9710) acc1: 75.0000 (78.0236) acc5: 100.0000 (92.2754) time: 9.2248 (0.1768 -- 112.2758) data: 8.9004 (0.0001 -- 111.7937) max mem: 2736
Test: [4540/9263] eta: 10:23:28 loss: 0.2287 (0.9698) acc1: 100.0000 (78.0555) acc5: 100.0000 (92.2842) time: 7.8819 (0.1757 -- 112.2758) data: 7.4237 (0.0001 -- 111.7937) max mem: 2736
Test: [4550/9263] eta: 10:25:02 loss: 0.2287 (0.9689) acc1: 100.0000 (78.0653) acc5: 100.0000 (92.2984) time: 14.1884 (0.1757 -- 158.7735) data: 13.8075 (0.0001 -- 158.5454) max mem: 2736
Test: [4560/9263] eta: 10:23:18 loss: 0.6578 (0.9690) acc1: 75.0000 (78.0585) acc5: 100.0000 (92.3016) time: 15.1305 (0.1843 -- 158.7735) data: 14.9077 (0.0002 -- 158.5454) max mem: 2736

Wonder more pretrain scripts and results

Wonderful work.
In paper's Table.2, you report the Top-1 accuary about the ViT-B on Something-Something v2 by using dual masking method, but there is no pretrain scripts in this repo.
So I wonder that is the pretrain script same with VideoMAE v1? Did you pretrain on Kinetic400 by using dual masking method and how about results?

Impact of Something Something and Kinetics during Unlabeled Pre-training

Hello,

In your unlabeled pre-training phase, you train on two datasets (STV2 and Kinetics) with reconstruction loss and later fine-tune with loss computed against their labels. Did you find that training over these two datasets during the unsupervised stage helped when finetuning? Did you find training over the validation sets to help as well?

Also, I'm wondering if you've ever played around with combining the first two stages of training (training against an added reconstruction loss and cross entropy loss). Thank you ahead of time, and great work on this paper!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.