Comments (9)
If you want to use 6 GPUs for pretraining or fine-tuning, you can run the following code.
python -m torch.distributed.launch --nproc_per_node=6 --master_port=12320 run_mae_pretraining.py --data_path ${DATA_PATH} --mask_type tube --mask_ratio 0.9 --model pretrain_videomae_base_patch16_224 --decoder_depth 4 --batch_size 32 --num_frames 16 --sampling_rate 4 --opt adamw --opt_betas 0.9 0.95 --warmup_epochs 40 --save_ckpt_freq 20 --epochs 801 --log_dir ${OUTPUT_DIR} --output_dir ${OUTPUT_DIR}
from videomae.
Hi, if I want to set batch size to 8, does LR need to be changed?
from videomae.
I've had issues training with less GPUs (6 v100s), training on SSV2 we get about 40% higher loss than with the pretrained model (from the log files) at 800 epochs. The LR is scaled by num_training_steps per epoch which depends on the number of tasks.
Does anyone have any suggestions?
from videomae.
In my case, I added an accumlate iter arguments during training, which can scale up the effective batch size. The accum_iter's implementation can be found in the repo of MAE.
from videomae.
@Zi-hao-Wei Would you mind sharing your modifications/code?
from videomae.
You can find it here https://github.com/Zi-hao-Wei/VideoMAE_acc_iter; Maybe I can add a pull/request to the original repo?
from videomae.
@Zi-hao-Wei When I tried setting accum_iter to 8 on 8 A40s the loss seemed to not improve (green: accum_iter = 8, vs. grey: accum_iter=1)
What's a bit weird, is it works "ok" with accum_iter set to 6 (on 8x gpus), the model is able to learn but the resulting loss is worse than training on a single gpu.
Were you able to get good results with this? Wondering if the learning rates are equivalent as compared to the original implementation on more GPUs, that's the only thing I can think of.
My learning rate appears to start at 0:
{"train_lr": 0.0, "train_min_lr": 0.0, "train_loss": 1.4127931836105527, "train_loss_scale": 65536.0, "train_weight_decay": 0.05, "train_grad_norm": 0.8564211130142212, "epoch": 0, "n_parameters": 94210944}
{"train_lr": 2.958407871198569e-05, "train_min_lr": 2.958407871198569e-05, "train_loss": 1.4128651916980743, "train_loss_scale": 65536.0, "train_weight_decay": 0.05, "train_grad_norm": 0.8564602732658386, "epoch": 1, "n_parameters": 94210944}
{"train_lr": 5.916815742397138e-05, "train_min_lr": 5.916815742397138e-05, "train_loss": 1.3674379118851252, "train_loss_scale": 65536.0, "train_weight_decay": 0.05, "train_grad_norm": 0.761830747127533, "epoch": 2, "n_parameters": 94210944}
{"train_lr": 8.875223613595705e-05, "train_min_lr": 8.875223613595705e-05, "train_loss": 1.3142942587534587, "train_loss_scale": 65536.0, "train_weight_decay": 0.05, "train_grad_norm": 0.6155216693878174, "epoch": 3, "n_parameters": 94210944}
And not match what is in the log in MODEL_ZOO.md:
https://drive.google.com/file/d/1kP3_-465jCL7PRNFq1JcAghPo2BONRWY/view
Here's the difference:
800 epoch pretrained checkpoint from MODEL_ZOO.md:
Reproduced using the available training script, on 8 x a40s with accum_iter = 6 (8 does not train)
Also, I think you are clearing the gradients too early in your implementation. My modifications are in this branch: https://github.com/mjlbach/VideoMAE/tree/dataset_location, however, your implementation matches the original placement in this codebase (confused why it was moved from MAE in the first place)
from videomae.
Thank you for sharing. I will try to follow your implementation.
from videomae.
too small batch size may degrade the model performance, maybe you can use checkpoint.checkpoint to save memory.
from videomae.
Related Issues (20)
- What's the finetuning differences between ViT-B 80%acc and 81%acc?
- Can I fine-tine it on a video dataset of 32 frames?
- The Batch size and training epoch not metch with paper
- About the encoder layer output
- Issue Encountered When Loading the Model: "pretrain_videomae_base_patch16_224" HOT 5
- learnable position embedding HOT 1
- the acc of small batch datasets is too low HOT 4
- Can VideoMAE be used to learn the motion characteristics and appearance characteristics of objects in videos?
- MoCoV3 Training Configuration
- ViT-S and ViT-H models on huggingface
- VideoMAE ViT-H pre-train does not contain the decoder weights HOT 2
- dist_init_required
- About pre-trained models
- Fail to finetune from the provided pretrained model checkpoint on UCF101 HOT 5
- How many videos are in your validation set? HOT 3
- could you please provide me the weight of VideoMAE pre-trained on Kinetics-400,I want to use the the weight to extract the features of the thumos14 HOT 4
- The dataset files in the link are not available HOT 2
- BUG: Incorrect temporal indexing?
- UCF101 HOT 4
- Questions about performence on ssv2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from videomae.