Thanks for your great codes! In your paper, running the pre-training experiments

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

A very large batchsize requires 64 GPUs about vilt HOT 14 OPEN

dandelin commented on August 28, 2024

A very large batchsize requires 64 GPUs

from vilt.

Comments (14)

dandelin commented on August 28, 2024 2

Unfortunately, we only experimented with batch_size=4096, thus have no empirical results.
Though I believe the performance will be preserved for lower batch sizes like 2048 or 1024.

For low resource regimes, the published code provides "gradient accumulation" options.
It will automatically compute the steps to accumulate gradients with given per_gpu_batchsize and the number of GPUS. (see https://github.com/dandelin/ViLT/blob/master/run.py#L42-L44)
Theoretically, the gradient accumulation will result in the same output compared to the non-accumulation version. (However, we did not use gradient accumulation for our experiments. So it is not guaranteed.)

from vilt.

dandelin commented on August 28, 2024 2

@Jxu-Thu Thank you.
I'll investigate this issue soon.

from vilt.

Jxu-Thu commented on August 28, 2024

If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps?

from vilt.

dandelin commented on August 28, 2024

@Jxu-Thu
As far as I know, Pytorch lightning will increase the LightningModule's internal step only if the accumulation is done.
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L813)
So you should not change the other configurations for applying the gradient accumulation feature.

from vilt.

Jxu-Thu commented on August 28, 2024

Many thanks for your kind reply!
I am trying to reproduce the results with 24 V100 GPUs with accu steps 3 and batch size over 4k without modifying any configurations.

from vilt.

dandelin commented on August 28, 2024

@Jxu-Thu
Also, please pull the latest commit (#12 (comment))

from vilt.

Jxu-Thu commented on August 28, 2024

Thanks for your reminder

from vilt.

Jxu-Thu commented on August 28, 2024

I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize.
Given the vg+mscoco+gcc+sbu (about 900w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/2392933 [00:00<?, ?it/s]
Given the vg+mscoco (about 500w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/169000 [00:00<?, ?it/s]

Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w?
For vg+mscoco , 32 x 16.9w=500w samples.
However, for vg+mscoco+gcc+sbu, 32 x 239w=7648w. I cannot understand why there are so many iterations.
I carefully check the codes but do not find any clues. Could you help me?

from vilt.

dandelin commented on August 28, 2024

@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config)

from vilt.

Jxu-Thu commented on August 28, 2024

vg+mscoco+gcc+sbu

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg', 'sbu', 'gcc']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

coco+vg

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

from vilt.

dandelin commented on August 28, 2024

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32
=> Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'
=> Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset.
Please double-check your arrow files' sanity.

from vilt.

Jxu-Thu commented on August 28, 2024

Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours.

from vilt.

HarmanDotpy commented on August 28, 2024

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps donot change. for eg if I run
python run.py with data_root=/mnt/nfs/dandelin num_gpus=4 num_nodes=8 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'

the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus.

Has anyone faced these issues before?

from vilt.

HarmanDotpy commented on August 28, 2024

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 => Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]' => Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. Please double-check your arrow files' sanity.

what is the total batch size for this run?

from vilt.

A very large batchsize requires 64 GPUs about vilt HOT 14 OPEN

Comments (14)

vg+mscoco+gcc+sbu

coco+vg

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent