Git Product home page Git Product logo

Comments (12)

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

Hi, have you reproduced the performance of ViT-B? Also, could you provide more details of the configuration of your experiments using ViT-L?

from droppos.

vateye avatar vateye commented on August 19, 2024

Yes, I can reproduce for base but not for large, for both base and large, I follow the configuration in Appendix A.1 of the paper.
For pretraining

python -m torch.distributed.launch --nproc_per_node=8 --nnodes 4 --node_rank 0 \
    main_pretrain.py \
    --batch_size 128 \
    --accum_iter 1 \
    --model DropPos_mae_vit_large_patch16_dec512d2b \
    \
    --drop_pos_type mae_pos_target \
    --use_mask_token \
    --pos_mask_ratio 0.75 \
    --pos_weight 0.1 \
    --label_smoothing_sigma 1 \
    --sigma_decay \
    --attn_guide \
    \
    --input_size 224 \
    --token_size 14 \
    --mask_ratio 0.75 \
    --epochs 200 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --data_path /path/to/imagenet \
    --output_dir  ./output_dir \
    --log_dir   ./log_dir \
    --experiment droppos_pos_mask0.75_posmask0.75_smooth1to0_sim_in1k_ep200

For finetuning

python -m torch.distributed.launch --nproc_per_node=8 --nnodes 4 --node_rank 0 \
    main_finetune.py \
    --batch_size 32 \
    --accum_iter 1 \
    --model vit_large_patch16 \
    --finetune /path/to/checkpoint \
    \
    --epochs 50 \
    --warmup_epochs 5 \
    --blr 1e-3 --layer_decay 0.75 --weight_decay 0.05 \
    --drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
    --dist_eval \
    --data_path /path/to/imagenet \
    --nb_classes 1000 \
    --output_dir  ./output_dir \
    --log_dir   ./log_dir \
    --experiment droppos_vit_large_patch16_in1k_ep200

from droppos.

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

How about trying enlarging the decoder by setting the depth to 4 or 8?

from droppos.

vateye avatar vateye commented on August 19, 2024

Is this the default setting for ViT-L? If it is possible, could you provide your running command for training ViT-L for the best performance?

from droppos.

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

The default setting is just what we have provided. However, the training of ViT-L is much more unstable than ViT-B.

from droppos.

vateye avatar vateye commented on August 19, 2024

So for your default setting, the model used for pre-training is DropPos_mae_vit_large_patch16_dec512d2b instead of DropPos_mae_vit_large_patch16_dec512d8b. I will launch multiple jobs to see whether the stability is the core issue.

from droppos.

vateye avatar vateye commented on August 19, 2024

Hi, haochen,

I have tried to rerun the job several times. I also have tried different depth number of decoders and the number of epochs for pre-training. I still cannot reproduce the results reported in the paper (ViT-L 200epochs can achieve 84.5 top-1 acc.).

For the "unstable training", I have run the same jobs with the hyperparameters suggested in the paper for 3 different trails. For each trail, I vary the seed during pre-training. And after 200 epochs pre-training on ViT-L, I can only get 82.83, 83.04, and 82.86.

For the depth issue, I have run the experiments with depth=2 and depth=8. The results after pre-training for 200 epochs are 82.83 and 82.73.

For different epochs, I have tried to pre-training ViT-L for 200, 400, and 800 epochs. The fine-tuning performance are similar. (82.83 v.s. 82.51 v.s. 83.20), which is much lower than that reported in the paper.

from droppos.

vateye avatar vateye commented on August 19, 2024

Could you please share the scripts and accompanying training logs for ViT-L? Having access to these would greatly enhance our understanding and enable us to replicate your work more effectively. :)

from droppos.

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

I would like to first clarify that the top-1 of ViT-L with 200 epochs of pre-training is 83.7 instead of 84.5 (see Table 5 in our paper for more details).
And, using the exact configuration reported in our paper on page 15 is expected to reproduce the result. But sometimes, setting the base learning rate of fine-tuning as 5e-4 may get better results. It is quite unexpected to achieve only ~83% top-1 with ViT-L.
Finally, I will release both the pre-trained and the fine-tuned checkpoints for both ViT-B and ViT-L with 800 epochs of pre-training within a few hours.

from droppos.

vateye avatar vateye commented on August 19, 2024

Sorry for the wrong reference number. Is there any training logs and scripts that can reproduce the results? Since releasing the training logs and scripts would be more helpful for research community to fully reproduce the results.

from droppos.

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

I am willing to provide the training log but I can only find the fine-tuning log for 800 epochs of ViT-L, which is attached below, and now I have no resources to reproduce DropPos since it requires a large amount of GPUs :(

DropPos_vit_large_patch16_ft_log.txt

I have checked the training scripts you provided. The global batch size is correct.

from droppos.

Haochen-Wang409 avatar Haochen-Wang409 commented on August 19, 2024

The pre-trained and fine-tuned models are available at: https://pan.baidu.com/s/1xj9XiHgagKGJrJt88IfhLw?pwd=4gik.
The fetch code is 4gik.

from droppos.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.