taoyang1122 / adapt-image-models Goto Github PK
View Code? Open in Web Editor NEWThis project forked from amazon-science/adapt-image-models
[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
License: Apache License 2.0
This project forked from amazon-science/adapt-image-models
[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
License: Apache License 2.0
Hello!
I want to reproduce your model training and results on the Something-Somethingv2 dataset, but failed. I have used your sthv2 config file, vitclip_base_sthv2.py
and the command:
bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU> --test-last --validate --cfg-options work_dir=<PATH/TO/OUTPUT>
I followed the data preparation procedure of MMAction2 described here as referred by you.
Environment info:
python 3.8.16, pytorch 1.10.0, torchvision 0.11.0, cudatoolkit 11.3.1, mmcv 1.4.0
I have used 8 NVIDIA A100 40GB GPUs in my reproducing trial.
I attach the last lines of the log file:
2023-06-11 10:18:25,528 - mmaction - INFO - Epoch [50][2560/2640] lr: 2.960e-07, eta: 0:02:06, time: 1.552, data_time: 0.008, memory: 10710, loss_cls: 2.8023, loss: 2.8023
2023-06-11 10:18:57,682 - mmaction - INFO - Epoch [50][2580/2640] lr: 2.960e-07, eta: 0:01:35, time: 1.608, data_time: 0.009, memory: 10710, loss_cls: 2.6813, loss: 2.6813
2023-06-11 10:19:28,511 - mmaction - INFO - Epoch [50][2600/2640] lr: 2.960e-07, eta: 0:01:03, time: 1.542, data_time: 0.190, memory: 10710, loss_cls: 2.8156, loss: 2.8156
2023-06-11 10:20:01,052 - mmaction - INFO - Epoch [50][2620/2640] lr: 2.960e-07, eta: 0:00:31, time: 1.621, data_time: 0.005, memory: 10710, loss_cls: 2.7159, loss: 2.7159
2023-06-11 10:20:28,968 - mmaction - INFO - Epoch [50][2640/2640] lr: 2.960e-07, eta: 0:00:00, time: 1.403, data_time: 0.104, memory: 10710, loss_cls: 2.6996, loss: 2.6996
2023-06-11 10:20:30,023 - mmaction - INFO - Saving checkpoint at 50 epochs
2023-06-11 10:27:48,532 - mmaction - INFO - Evaluating top_k_accuracy ...
2023-06-11 10:27:49,280 - mmaction - INFO -
top1_acc 0.3803
top5_acc 0.6874
2023-06-11 10:27:49,280 - mmaction - INFO - Evaluating mean_class_accuracy ...
2023-06-11 10:27:49,319 - mmaction - INFO -
mean_acc 0.3086
2023-06-11 10:27:49,371 - mmaction - INFO - The previous best checkpoint [REDACTED]/best_top1_acc_epoch_45.pth was removed
2023-06-11 10:27:50,692 - mmaction - INFO - Now best checkpoint is saved as best_top1_acc_epoch_50.pth.
2023-06-11 10:27:50,693 - mmaction - INFO - Best top1_acc is 0.3803 at 50 epoch.
2023-06-11 10:27:50,702 - mmaction - INFO - Epoch(val) [50][3098] top1_acc: 0.3803, top5_acc: 0.6874, mean_class_accuracy: 0.3086
2023-06-11 10:37:59,836 - mmaction - INFO - Testing results of the last checkpoint
2023-06-11 10:37:59,836 - mmaction - INFO - top1_acc: 0.3875
2023-06-11 10:37:59,836 - mmaction - INFO - top5_acc: 0.6963
2023-06-11 10:37:59,837 - mmaction - INFO - mean_class_accuracy: 0.3136
According to my understanding, the results shown in the paper for this configuration are 66.4 Top-1 accuracy and 90.5 Top-5 accuracy. Yet, as can be seen in the logs, the results obtained by me are much worse.
Am I missing something?
Please let me know. Thank you.
What is the difference between the swin2d_base.py and swin2d_adapter_base.py files in the configs folder?
We are sorry, but you do not have access to Google Docs. Some reasons why you may not have access:无法下载model
Hi, I just wonder if the configs for SSV2 will be available in the future.
Hello,
I try to train the model with 4x2080Ti server, I use the command below,
bash tools/dist_train.sh ./configs/recognition/vit/vitclip_large_k400.py 4 --test-last --validate --cfg-options work_dir=./work_dirs
but it runs the error :
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=========
tools/train.py FAILED
How can I solve it? Thanks a lot~
Will the “init weights” be called automatically when the model is built? I don't know how do I execute the “init weights” function.
def __init__(self, input_resolution: int, num_frames: int, patch_size: int, width: int, layers: int, heads: int, drop_path_rate, num_tadapter=1, adapter_scale=0.5, pretrained=None):
super().__init__()
self.input_resolution = input_resolution
self.pretrained = pretrained
self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
scale = width ** -0.5
self.layers = layers
self.class_embedding = nn.Parameter(scale * torch.randn(width))
self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
self.ln_pre = LayerNorm(width)
self.num_frames = num_frames
self.temporal_embedding = nn.Parameter(torch.zeros(1, num_frames, width))
self.transformer = Transformer(num_frames, width, layers, heads, num_tadapter=num_tadapter, scale=adapter_scale, drop_path=drop_path_rate)
self.ln_post = LayerNorm(width)
def init_weights(self, pretrained=None):
def _init_weights(m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
if pretrained:
self.pretrained = pretrained
if isinstance(self.pretrained, str):
self.apply(_init_weights)
# logger = get_root_logger()
print(f'load model from: {self.pretrained}')
## Load OpenAI CLIP pretrained weights
if self.layers == 12:
clip_model, preprocess = clip.load("ViT-B/16", device="cpu")
else:
clip_model, preprocess = clip.load("ViT-L/14", device="cpu")
pretrain_dict = clip_model.visual.state_dict()
del clip_model
del pretrain_dict['proj']
msg = self.load_state_dict(pretrain_dict, strict=False)
print('Missing keys: {}'.format(msg.missing_keys))
print('Unexpected keys: {}'.format(msg.unexpected_keys))
print(f"=> loaded successfully '{self.pretrained}'")
torch.cuda.empty_cache()
elif self.pretrained is None:
self.apply(_init_weights)
else:
raise TypeError('pretrained must be a str or None')
## initialize S_Adapter
for n, m in self.transformer.named_modules():
if 'S_Adapter' in n:
for n2, m2 in m.named_modules():
if 'D_fc2' in n2:
if isinstance(m2, nn.Linear):
nn.init.constant_(m2.weight, 0)
nn.init.constant_(m2.bias, 0)
## initialize T_Adapter
for n, m in self.transformer.named_modules():
if 'T_Adapter' in n:
for n2, m2 in m.named_modules():
if 'D_fc2' in n2:
if isinstance(m2, nn.Linear):
nn.init.constant_(m2.weight, 0)
nn.init.constant_(m2.bias, 0)
## initialize MLP_Adapter
for n, m in self.transformer.named_modules():
if 'MLP_Adapter' in n:
for n2, m2 in m.named_modules():
if 'D_fc2' in n2:
if isinstance(m2, nn.Linear):
nn.init.constant_(m2.weight, 0)
nn.init.constant_(m2.bias, 0)
Hello!
I don't see what GPU is being used in this project. So I want to ask, I have 4 40G NVIDIA V100 GPUs, could I run the best model?
thanks!
Hello, I really appreciate your work. May I ask where can I download the pretrained model for vit on Imagenet?
Hello,
Do you plan to share checkpoint files of models trained on the SSv2 dataset?
Hello, thank you for the insightful work!
I' m trying to reproduce the result, but I have some questions about the hyperparameters setting.
I also have some problems when reproducing reproducing the results in table 6. Does the memory here mean the peak memory usage (during both forward and backward passes) per GPU? What is the clip len used here?
Hello,
In most of the config files (e.g., this), I can see that the argument "lr" of the optimizers are set to 3e-4, and warmup_iters of the lr_config is set to 2.5. Does this mean the learning rate is starting from 0, linearly increases to 3e-4 at 2.5 epoch, and anneals back to 0 at 50 epoch? Do you scale the learning rate based on the batch size?
Hi,Thanks for the cool work.
I am using Video-Swin-Transformer (Swin-S, Swin-B or Swin-L is too big for my device) .
I want to switch to your work. Do you have pretrained model weights like ViT-S/16 or ViT-T/16?
I am looking forward to hearing from you.
HI, I am trying to reproduce the CLIP with adapter for action recognition. But I met an issue ' Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)' . Could I know about whether you also met this before?
Hello author,
Thanks for this research, I want to train vitlarge_clip_k400 on 4 V100
but when I use the command with "bash tools/dist_train.sh configs/recognition/vit/vitclip_large_k400.py 3 --test-last --validate --cfg-options model.backbone.pretrained=openaiclip work_dir=work_dirs_vit/k400_vitlarge/debug".
It get an error with below
/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
Traceback (most recent call last):
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 564, in determine_local_world_size
return int(nproc_per_node)
ValueError: invalid literal for int() with base 10: '--validate'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 709, in run
config, cmd, cmd_args = config_from_args(args)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 617, in config_from_args
nproc_per_node = determine_local_world_size(args.nproc_per_node)
File "/home/u1043795/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/distributed/run.py", line 582, in determine_local_world_size
raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}")
ValueError: Unsupported nproc_per_node value: --validate
I found that in practical applications, your method did not significantly improve training efficiency. Although the number of parameters to train is much smaller, it still requires calculating gradients for almost all layers due to backpropagation. As a result, there was not a significant reduction in memory usage or training time.
Hello, I encountered an interruption while training your model. When I tried to resume the training, it always started from the beginning. How can I solve this issue?
Hi, I want to know how can we freeze the layer and let them not update parameters? I have a such question after I read your paper.
Hello, thank you for the insightful research.
In the paper, the views during test time are described as follows:
Views = #frames × #temporal × #spatial
From what I understand, #temporal and #spatial represent the number of temporal and spatial samplings during test time.
I'm not very familiar with mmaction, so I'm not sure which part of the config file to refer to.
How many views are there for the base-diving48 case?
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
Is it 32x1x1? What does max_testing_views mean?
I tried looking into the mmaction documentation but couldn't grasp it, so I'm asking here.
Thank you.
What an effective method. Thank you for sharing. I have a question. I want to try the same visualization, but I don 't know how to do it. Could you teach me how to conduct visualization experiments?
Looking forward to your reply. Thank you.
Hello,Thanks for the amazing project!
I would like to ask how to configure the loss function in the training of a recognition model such as ViT. Specifically, I have seen that for tasks like object detection, the loss function is set as "loss_bbox." However, I am unsure of how to set the loss configuration for recognition tasks.
Thanks for helping!
Why can't the print information be printed in the files of https://github.com/taoyang1122/adapt-image-models/tree/main/mmaction?
Thank you for your excellent work! By the way I want to know about clip_len
and frame_interval
for Kinetics. In Appendix A.1, "We evaluate the model on 8, 16, 32 frames and the sampling interval is 16, 8, 4, respectively." Does this mean for kinetics400/700, the data pipeline (train, val, test) should be the same? For example, in configs/recognition/vit/vit_imagenet_k400.py
, the config of data pipeline keeps the same as the paper mentioned.
i.e., clip_len=8, frame_interval=16
for train/val/test pipeline, which is the same as the paper mentioned.
adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py
Lines 19 to 21 in 392647e
adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py
Lines 32 to 39 in 392647e
adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py
Lines 49 to 56 in 392647e
But, for CLIP pretrained, the configs are confused.
vitclip_base_k400
, clip_len=32, frame_interval=16
for train pipeline, while clip_len=32, frame_interval=8
for val/test pipeline. However, if clip_len=32
, frame_interval
should be 4?adapt-image-models/configs/recognition/vit/vitclip_base_k400.py
Lines 19 to 21 in 392647e
adapt-image-models/configs/recognition/vit/vitclip_base_k400.py
Lines 32 to 39 in 392647e
adapt-image-models/configs/recognition/vit/vitclip_base_k400.py
Lines 49 to 56 in 392647e
vitclip_large_k400
, clip_len=16, frame_interval=16
for train/val/test pipeline. However, if clip_len=16
, frame_interval
should be 8?adapt-image-models/configs/recognition/vit/vitclip_large_k400.py
Lines 19 to 21 in 392647e
adapt-image-models/configs/recognition/vit/vitclip_large_k400.py
Lines 32 to 39 in 392647e
adapt-image-models/configs/recognition/vit/vitclip_large_k400.py
Lines 49 to 56 in 392647e
Thank you.
As shown in the question, did you change any configuration when using a longer input length?
where to load the checkpoint of base model? i cannot find where you load the checkpoint of ViT backbone
Hi, thanks for the great work!
I wonder is there any training log available for clip pretrained models?
Your work is amazing. However I have confusion about running the code for training.
I run the given command bash tools/dist_train.sh configs/recognition/vit/vitclip_base_diving48.py 4 --test-last --validate \ --cfg-options model.backbone.pretrained=openaiclip work_dir=work_dirs_vit/diving48/debug
in the file run_exp.sh
, then the error is
2023-11-17 19:41:19,976 - mmaction - INFO - Evaluating top_k_accuracy ...
Traceback (most recent call last):
File "tools/train.py", line 210, in <module>
main()
File "tools/train.py", line 198, in main
train_model(
File "/root/data/adapt-image-models/mmaction/apis/train.py", line 195, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 505, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate
eval_res = self.dataloader.dataset.evaluate(
File "/root/data/adapt-image-models/mmaction/datasets/base.py", line 207, in evaluate
top_k_acc = top_k_accuracy(results, gt_labels, topk)
File "/root/data/adapt-image-models/mmaction/core/evaluation/accuracy.py", line 104, in top_k_accuracy
max_k_preds = np.argsort(scores, axis=1)[:, -k:][:, ::-1]
File "<__array_function__ internals>", line 180, in argsort
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1120, in argsort
return _wrapfunc(a, 'argsort', axis=axis, kind=kind, order=order)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "/root/miniconda3/envs/AIM/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
I will be grateful for your advise.
When initializing weights to the same value, the weights will also be the same during the training.
(
Is there any reason to initialize this to a constant 0 instead of other initialization? For example, Gaussian with a mean of 0 and a very small variance.
Thanks
Hello!
I want to reproduce your model training on Diving48, but failed. I used your diving48 config file, vitclip_base_diving48.py
, with (1) original ver (2) clip len = 8, frame interval = 8,
and command bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU> --test-best --validate --cfg-options work_dir=<PATH/TO/OUTPUT>
.
I wonder what is the problem. Please let me know. Thank you.
environment info
python 3.9.13, pytorch 1.10.0, cuda 11.3
here is a part of log
2023-03-08 21:00:08,793 - mmaction - INFO - Epoch [50][540/627] lr: 2.960e-07, eta: 0:01:07, time: 0.698, data_time: 0.000, memory: 19879, top1_acc: 0.3833, top5_acc: 0.8187, loss_cls: 1.8822, loss: 1.8822
2023-03-08 21:00:22,859 - mmaction - INFO - Epoch [50][560/627] lr: 2.960e-07, eta: 0:00:51, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3604, top5_acc: 0.8083, loss_cls: 1.9930, loss: 1.9930
2023-03-08 21:00:36,925 - mmaction - INFO - Epoch [50][580/627] lr: 2.960e-07, eta: 0:00:36, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3688, top5_acc: 0.8250, loss_cls: 1.9158, loss: 1.9158
2023-03-08 21:00:50,874 - mmaction - INFO - Epoch [50][600/627] lr: 2.960e-07, eta: 0:00:20, time: 0.697, data_time: 0.000, memory: 19879, top1_acc: 0.4146, top5_acc: 0.8438, loss_cls: 1.8412, loss: 1.8412
2023-03-08 21:01:04,933 - mmaction - INFO - Epoch [50][620/627] lr: 2.960e-07, eta: 0:00:05, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3896, top5_acc: 0.8125, loss_cls: 1.9206, loss: 1.9206
2023-03-08 21:01:10,258 - mmaction - INFO - Saving checkpoint at 50 epochs
2023-03-08 21:03:12,731 - mmaction - INFO - Evaluating top_k_accuracy ...
2023-03-08 21:03:12,740 - mmaction - INFO -
top1_acc 0.1025
top5_acc 0.3548
2023-03-08 21:03:12,740 - mmaction - INFO - Evaluating mean_class_accuracy ...
2023-03-08 21:03:12,741 - mmaction - INFO -
mean_acc 0.0586
2023-03-08 21:03:12,799 - mmaction - INFO - The previous best checkpoint /data/aim/outputs/diving48/best_top1_acc_epoch_45.pth was removed
2023-03-08 21:03:14,531 - mmaction - INFO - Now best checkpoint is saved as best_top1_acc_epoch_50.pth.
2023-03-08 21:03:14,531 - mmaction - INFO - Best top1_acc is 0.1025 at 50 epoch.
2023-03-08 21:03:14,532 - mmaction - INFO - Epoch(val) [50][985] top1_acc: 0.1025, top5_acc: 0.3548, mean_class_accuracy: 0.0586
2023-03-08 21:03:15,535 - mmaction - INFO - Warning: test_best set as True, but is not applicable (eval_hook.best_ckpt_path is None)
I'd like to learn about your work on visualizing code from research papers. Can you share your code?
First of all,thanks for your great work!
My question is shown in the title, and the specific error message is as follows:
[> ] 777/19881, 1.0 task/s, elapsed: 769s, ETA: 18903sTraceback (most recent call last):
File "tools/test.py", line 364, in
main()
File "tools/test.py", line 349, in main
outputs = inference_pytorch(args, cfg, distributed, data_loader)
File "tools/test.py", line 167, in inference_pytorch
args.gpu_collect)
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/mmcv/engine/test.py", line 70, in multi_gpu_test
for i, data in enumerate(data_loader):
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
raise exception
decord._ffi.base.DECORDError: Caught DECORDError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/lzh/2022/tjq/adapt-image-models/mmaction/datasets/base.py", line 285, in getitem
return self.prepare_test_frames(idx)
File "/home/lzh/2022/tjq/adapt-image-models/mmaction/datasets/base.py", line 276, in prepare_test_frames
return self.pipeline(results)
File "/home/lzh/2022/tjq/adapt-image-models/mmaction/datasets/pipelines/compose.py", line 41, in call
data = t(data)
File "/home/lzh/2022/tjq/adapt-image-models/mmaction/datasets/pipelines/loading.py", line 965, in call
container = decord.VideoReader(file_obj, num_threads=self.num_threads)
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/video_reader.py", line 42, in init
ba, ctx.device_type, ctx.device_id, width, height, num_threads, 2)
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/_ffi/_ctypes/function.py", line 175, in call
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/_ffi/base.py", line 63, in check_call
raise DECORDError(py_str(_LIB.DECORDGetLastError()))
decord._ffi.base.DECORDError: [15:44:17] /io/decord/src/video/video_reader.cc:125: Check failed: st_nb >= 0 (-1381258232 vs. 0) ERROR cannot find video stream with wanted index: -1
Stack trace returned 10 entries:
[bt] (0) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(dmlc::StackTrace(unsigned long)+0x50) [0x7f4606a29990]
[bt] (1) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1d) [0x7f4606a2aa7d]
[bt] (2) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(decord::VideoReader::SetVideoStream(int)+0xee) [0x7f4606a7a6ae]
[bt] (3) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(decord::VideoReader::VideoReader(std::string, DLContext, int, int, int, int)+0x3cd) [0x7f4606a7b28d]
[bt] (4) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(+0x6a039) [0x7f4606a6a039]
[bt] (5) /home/lzh/anaconda3/envs/AIM/lib/python3.7/site-packages/decord/libdecord.so(DECORDFuncCall+0x52) [0x7f4606a26572]
[bt] (6) /home/lzh/anaconda3/envs/AIM/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f465f9679dd]
[bt] (7) /home/lzh/anaconda3/envs/AIM/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f465f967067]
[bt] (8) /home/lzh/anaconda3/envs/AIM/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2e7) [0x7f465c9ec437]
[bt] (9) /home/lzh/anaconda3/envs/AIM/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12ea4) [0x7f465c9ecea4]
The k400 dataset in my project is located in https://github.com/cvdfoundation/kinetics-dataset I downloaded it from, but I ensured that none of my videos were damaged. Is there an error in my configuration file?
My configuration file is as follows:
model = dict(
backbone=dict(drop_path_rate=0.2, adapter_scale=0.5, num_frames=8),
cls_head=dict(num_classes=400),
test_cfg=dict(max_testing_views=4))
dataset_type = 'VideoDataset'
#data_root = 'data/kinetics400/train_256'
#data_root_val = 'data/kinetics400/val_256'
#ann_file_train = 'data/kinetics400/train_video_list.txt'
#ann_file_val = 'data/kinetics400/val_video_list.txt'
#ann_file_test = 'data/kinetics400/val_video_list.txt'
data_root = '/data/K400/k400/train'
data_root_val = '/data/K400/k400/'
ann_file_train = '/data/K400/kinetics400/kinetics400_train_list.txt'
ann_file_val = '/data/K400/kinetics400/kinetics400_val_list.txt'
ann_file_test = '/data/K400/kinetics400/kinetics400_test_list.txt'
img_norm_cfg = dict(
mean=[122.769, 116.74, 104.04], std=[68.493, 66.63, 70.321], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(type='SampleFrames', clip_len=8, frame_interval=16, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=16,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=8,
frame_interval=16,
num_clips=3,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
data = dict(
videos_per_gpu=8,
workers_per_gpu=2,
val_dataloader=dict(
videos_per_gpu=1,
workers_per_gpu=1
),
test_dataloader=dict(
videos_per_gpu=1,
workers_per_gpu=1
),
train=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=data_root,
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=data_root_val,
pipeline=val_pipeline),
test=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=data_root_val,
pipeline=test_pipeline))
evaluation = dict(
interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'])
I think it's a problem with the decord version, but when I tried to replace it with decord=0.6.0/0.40/0.4.1, the same error was reported
So, I have no choice but to bother you.
Looking forward to your reply, thank you very much!
Hi! Great work. I want to finetune based on actions of my own dataset. How can I do this?
Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py
In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d.
And I see the description of the paper, it is (N+1) x T x D.
So I want to ask which one is the correct one?
Thanks a lot.
Thank you for your excellent work.
After using the Adapter, I only passed the Adapter parameters to the optimizer. However, the training memory did not go down. I verified that the rest of the Transformer parameters were set to requires_grad = False. The code is as follows:
for name, param in model.named_parameters():
if "Adapter" in name:
param.requires_grad = True
else:
param.requires_grad = False
optimizer = torch.optim.AdamW(lr=3e-4 params = filter(lambda p: p.requires_grad, model.parameters()),
weight_decay=0.05,)
Looking forward to your reply.
When i train diving-48 datasets with clip_base,i change videos_per_gps=4 because my machine limit(v100 32g). And i use 14 gpus to train a model so the batchsize is 56,close to 64.But finally i got top1_accuracy=85.3.This is my training logs,could you give me some advise? May be i should change some hyper parameters?
sys.platform: linux
Python: 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0]
CUDA available: True
GPU 0,1: Tesla V100-PCIE-32GB
CUDA_HOME: /share/soft/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (GCC) 7.5.0
PyTorch: 1.10.0+cu102
PyTorch compiling details: PyTorch built with:
2023-05-31 00:40:47,429 - mmaction - INFO - Distributed training: True
2023-05-31 00:40:47,986 - mmaction - INFO - Config: model = dict(
type='Recognizer3D',
backbone=dict(
type='ViT_CLIP',
input_resolution=224,
patch_size=16,
num_frames=32,
width=768,
layers=12,
heads=12,
drop_path_rate=0.2,
adapter_scale=0.5,
num_tadapter=1,
pretrained='openaiclip'),
cls_head=dict(
type='I3DHead',
in_channels=768,
num_classes=48,
spatial_type='avg',
dropout_ratio=0.5),
test_cfg=dict(average_clips='prob', max_testing_views=4))
checkpoint_config = dict(interval=10)
log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
dataset_type = 'VideoDataset'
data_root = '/share/group/datanlpr_ai/wtyuan/diving48/videos'
data_root_val = '/share/group/datanlpr_ai/wtyuan/diving48//videos'
ann_file_train = '/share/group/datanlpr_ai/wtyuan/diving48/diving48_train_list_videos.txt'
ann_file_val = '/share/group/datanlpr_ai/wtyuan/diving48/diving48_val_list_videos.txt'
ann_file_test = '/share/group/datanlpr_ai/wtyuan/diving48/diving48_val_list_videos.txt'
img_norm_cfg = dict(
mean=[122.769, 116.74, 104.04], std=[68.493, 66.63, 70.321], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
data = dict(
videos_per_gpu=4,
workers_per_gpu=2,
val_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1),
test_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1),
train=dict(
type='VideoDataset',
ann_file=
'/share/group/datanlpr_ai/wtyuan/diving48/diving48_train_list_videos.txt',
data_prefix='/share/group/datanlpr_ai/wtyuan/diving48/videos',
pipeline=[
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]),
val=dict(
type='VideoDataset',
ann_file=
'/share/group/datanlpr_ai/wtyuan/diving48/diving48_val_list_videos.txt',
data_prefix='/share/group/datanlpr_ai/wtyuan/diving48//videos',
pipeline=[
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]),
test=dict(
type='VideoDataset',
ann_file=
'/share/group/datanlpr_ai/wtyuan/diving48/diving48_val_list_videos.txt',
data_prefix='/share/group/datanlpr_ai/wtyuan/diving48//videos',
pipeline=[
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(
type='Normalize',
mean=[122.769, 116.74, 104.04],
std=[68.493, 66.63, 70.321],
to_bgr=False),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]))
evaluation = dict(
interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'])
optimizer = dict(
type='AdamW',
lr=0.0003,
betas=(0.9, 0.999),
weight_decay=0.05,
paramwise_cfg=dict(
custom_keys=dict(
class_embedding=dict(decay_mult=0.0),
positional_embedding=dict(decay_mult=0.0),
ln_1=dict(decay_mult=0.0),
ln_2=dict(decay_mult=0.0),
ln_pre=dict(decay_mult=0.0),
ln_post=dict(decay_mult=0.0))))
lr_config = dict(
policy='CosineAnnealing',
min_lr=0,
warmup='linear',
warmup_by_epoch=True,
warmup_iters=2.5)
total_epochs = 50
work_dir = 'work_dirs_vit/diving48/paper_sota_clstoken_fixmodelconfig'
find_unused_parameters = True
fp16 = None
optimizer_config = dict(
type='DistOptimizerHook',
update_interval=1,
grad_clip=None,
coalesce=True,
bucket_size_mb=-1,
use_fp16=True)
gpu_ids = range(0, 14)
omnisource = False
module_hooks = []
2023-05-31 00:40:58,009 - mmaction - INFO - workflow: [('train', 1)], max: 50 epochs
2023-05-31 00:40:58,009 - mmaction - INFO - Checkpoints will be saved to /share/home/ai_zhuxinxin/wtyuan/git/aim3/aim3/work_dirs_vit/diving48/paper_sota_clstoken_fixmodelconfig by HardDiskBackend.
2023-05-31 00:41:34,264 - mmaction - INFO - Epoch [1][20/269] lr: 3.763e-05, eta: 6:45:30, time: 1.812, data_time: 0.949, memory: 20587, top1_acc: 0.0455, top5_acc: 0.1830, loss_cls: 3.7788, loss: 3.7788
''''''''''''''''''''''''''''''''''''''''''''''
2023-05-31 04:07:05,175 - mmaction - INFO - Epoch [50][260/269] lr: 2.960e-07, eta: 0:00:07, time: 0.761, data_time: 0.001, memory: 20590, top1_acc: 0.9313, top5_acc: 0.9839, loss_cls: 0.2157, loss: 0.2157
2023-05-31 04:07:11,845 - mmaction - INFO - Saving checkpoint at 50 epochs
2023-05-31 04:08:10,086 - mmaction - INFO - Evaluating top_k_accuracy ...
2023-05-31 04:08:10,093 - mmaction - INFO -
top1_acc 0.8594
top5_acc 0.9888
2023-05-31 04:08:10,093 - mmaction - INFO - Evaluating mean_class_accuracy ...
2023-05-31 04:08:10,094 - mmaction - INFO -
mean_acc 0.7878
2023-05-31 04:08:10,095 - mmaction - INFO - Epoch(val) [50][141] top1_acc: 0.8594, top5_acc: 0.9888, mean_class_accuracy: 0.7878
2023-05-31 04:09:08,388 - mmaction - INFO - Testing results of the last checkpoint
2023-05-31 04:09:08,388 - mmaction - INFO - top1_acc: 0.8533
2023-05-31 04:09:08,388 - mmaction - INFO - top5_acc: 0.9873
2023-05-31 04:09:08,388 - mmaction - INFO - mean_class_accuracy: 0.7874
where is the specific file to change the adapter in the model? I would like to refer to how do you insert adapters?
Where can I find the codes for T-MSA and S-MSA?Or where did the patch embedding reshape take place?Thanks
The train set and val set in the Diving48 dataset used in the paper consist of how many videos, respectively?
Hi,Thanks for the cool work.
I am using Video-Swin-Transformer (Swin-S, Swin-B or Swin-L is too big for my device) .
I want to switch to your work. Do you have pretrained model weights like ViT-S/16 or ViT-T/16?
I am looking forward to hearing from you.
Have the authors tried fine-tuning the parameters of the LayerNorm layer with it turned on? If so, what were the results?
Regarding the reported 88.9% accuracy of ViT-B on the Diving48 dataset in the paper, I would like to know the testing protocol on which this result is based.
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the aforementioned validation settings, the obtained result for testing the vit_b_clip_32frame_diving48.pth is 88.88%.
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the mentioned test configuration, the obtained result is lower than 88.9%. Moreover, the ThreeCrop operation does not align with the mentioned 32×1×1 in the paper.Therefore, I would like to understand the testing protocol underlying the reported 88.9% result in the paper.
Thanks for your great work. I have two questions:
With the same kinetics400
validation set (19796 videos) as that of mmaction, the same setting as your configs/recognition/vit/vitclip_base_k400.py
(32 x 3 x 1 Views during testing), the checkpoint vit_b_clip_32frame_k400.pth
you provided, my evaluation results on kinetics400
validation set is 83.34 (acc@1) and 96.45 (acc@5), which is lower than your results given in README.md, i.e., 84.7 (acc@1) and 96.7 (acc@5). Is there any possible reason for the gap (e.g., do you have a smaller kinetics400
validation set due to expired links)?
The checkpoint vit_b_clip_32frame_diving48.pth
you provided is tested on 32 x 1 x 1 Views, according to README.md. But the Views in configs/recognition/vit/vitclip_base_diving48.py
is 32 x 1 x 3. My evaluation results is 88.43 (acc@1, 32 x1 x 3) and 88.32 (acc@1, 32 x 1 x 1), which is lower than your results given in README.md, i.e., 88.9 (acc@1, 32 x 1 x 1). Is there any possible reason for the gap?
I am also confused about the following mismatch:
vit_b_clip_32frame_k700.pth
you provided is tested on 32 x 3 x 3 Views, according to README.md. But the Views in configs/recognition/vit/vitclip_base_k700.py
is 8 x 3 x 3.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.