Git Product home page Git Product logo

Comments (14)

dandelin avatar dandelin commented on August 28, 2024 2

Unfortunately, we only experimented with batch_size=4096, thus have no empirical results.
Though I believe the performance will be preserved for lower batch sizes like 2048 or 1024.

For low resource regimes, the published code provides "gradient accumulation" options.
It will automatically compute the steps to accumulate gradients with given per_gpu_batchsize and the number of GPUS. (see https://github.com/dandelin/ViLT/blob/master/run.py#L42-L44)
Theoretically, the gradient accumulation will result in the same output compared to the non-accumulation version. (However, we did not use gradient accumulation for our experiments. So it is not guaranteed.)

from vilt.

dandelin avatar dandelin commented on August 28, 2024 2

@Jxu-Thu Thank you.
I'll investigate this issue soon.

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

If I use smaller nodes such as num_gpus=8 num_nodes=1, (batch size 4096, with accum_steps=8) should I modify the other configurations? such as the max_steps?

from vilt.

dandelin avatar dandelin commented on August 28, 2024

@Jxu-Thu
As far as I know, Pytorch lightning will increase the LightningModule's internal step only if the accumulation is done.
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L813)
So you should not change the other configurations for applying the gradient accumulation feature.

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

Many thanks for your kind reply!
I am trying to reproduce the results with 24 V100 GPUs with accu steps 3 and batch size over 4k without modifying any configurations.

from vilt.

dandelin avatar dandelin commented on August 28, 2024

@Jxu-Thu
Also, please pull the latest commit (#12 (comment))

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

Thanks for your reminder

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

I found a very slow training speed due to numerous training iterations in each epoch. I try to inspect why so many iterations using a small batchsize.
Given the vg+mscoco+gcc+sbu (about 900w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/2392933 [00:00<?, ?it/s]
Given the vg+mscoco (about 500w samples) with batchsize 32, I obtain this
Epoch 0: 0%| | 0/169000 [00:00<?, ?it/s]

Why adding gcc+sbu(only 400w samples) increases the iterations from 16w to 239w?
For vg+mscoco , 32 x 16.9w=500w samples.
However, for vg+mscoco+gcc+sbu, 32 x 239w=7648w. I cannot understand why there are so many iterations.
I carefully check the codes but do not find any clues. Could you help me?

from vilt.

dandelin avatar dandelin commented on August 28, 2024

@Jxu-Thu could you share the config for each running using the sacred's print_config command? (https://sacred.readthedocs.io/en/stable/command_line.html#print-config)

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

vg+mscoco+gcc+sbu

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg', 'sbu', 'gcc']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00


coco+vg

INFO - ViLT - Running command 'print_config'
INFO - ViLT - Started
Configuration (modified, added, typechanged, doc):
batch_size = 4096 # this is a desired batch size; pl trainer will accumulate gradients when per step batch is smaller.
data_root = 'data/VilT_dataset' # below params varies with the environment
datasets = ['coco', 'vg']
decay_power = 1
draw_false_image = 1
draw_false_text = 0
drop_rate = 0.1
end_lr = 0
exp_name = 'debug_pretrain'
fast_dev_run = False
get_recall_metric = False # Downstream Setting
hidden_size = 768
image_only = False
image_size = 384
learning_rate = 0.0001
load_path = ''
log_dir = 'checkpoint_vilt/pre_train'
lr_mult = 1 # multiply lr for downstream heads
max_epoch = 100
max_image_len = 200
max_steps = 100000
max_text_len = 40
mlm_prob = 0.15
mlp_ratio = 4
num_gpus = 1
num_heads = 12
num_layers = 12
num_nodes = 1
num_workers = 0 # debug 8
optim_type = 'adamw' # Optimizer Setting
patch_size = 32
per_gpu_batchsize = 32 # you should define this manually with per_gpu_batch_size=#
precision = 16
resume_from = None # PL Trainer Setting
seed = 0 # the random seed for this experiment
test_only = False
tokenizer = 'bert-base-uncased'
train_transform_keys = ['pixelbert'] # Image setting
val_check_interval = 1.0
val_transform_keys = ['pixelbert']
vit = 'vit_base_patch32_384' # Transformer Setting
vocab_size = 30522
vqav2_label_size = 3129 # Text Setting
warmup_steps = 2500
weight_decay = 0.01
whole_word_masking = True
loss_names:
irtr = 0
itm = 1
mlm = 1
mpp = 0
nlvr2 = 0
vqa = 0
INFO - ViLT - Completed after 0:00:00

from vilt.

dandelin avatar dandelin commented on August 28, 2024

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32
=> Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'
=> Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset.
Please double-check your arrow files' sanity.

from vilt.

Jxu-Thu avatar Jxu-Thu commented on August 28, 2024

Thanks! I make a mistake in the data processing. Once fixing the mistake, I have similar iterations with yours.

from vilt.

HarmanDotpy avatar HarmanDotpy commented on August 28, 2024

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps donot change. for eg if I run
python run.py with data_root=/mnt/nfs/dandelin num_gpus=4 num_nodes=8 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]'

the number of steps is still nearly 169158, while I believe it should have been reduced to 169k/(4*8). Also I observe that the time taken per epoch while using just 1 gpu, is less than when using 32 gpus.

Has anyone faced these issues before?

from vilt.

HarmanDotpy avatar HarmanDotpy commented on August 28, 2024

@Jxu-Thu I ran your settings.

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 => Epoch 0: 0%| | 130/290436 [03:56<146:58:18, 1.82s/it, loss=11.2, v_num=1]

python run.py with data_root=/mnt/nfs/dandelin num_gpus=1 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=32 datasets='["coco", "vg"]' => Epoch 0: 0%|▏ | 137/169158 [04:00<82:27:29, 1.76s/it, loss=11.2, v_num=3

Since it works fine with my datasets, I guess you have some duplicated/corrupted arrow files for SBU or GCC dataset. Please double-check your arrow files' sanity.

what is the total batch size for this run?

from vilt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.