haochen-wang409 / droppos Goto Github PK

[NeurIPS'23] DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

License: Apache License 2.0

Python 98.01% Shell 1.99%

ade20k coco computer-vision deep-learning detection image-classification imagenet position-embedding segmentation self-supervised-learning

droppos's Issues

Cannot Reproduce the results on ViT-L

Hi, I tried the official code and hyperparameters suggested in the paper for training ViT-L for 200 epochs. And after the fine-tuning, I only can achieve 82.8 Top-1 Acc on ImageNet-1K. Is there any missing details for training DropPos?

Why does DropPos achieve exactly the same performance as HPM?

It is interesting that DropPos achieves exactly the same performance as your CVPR paper (HPM), is it a coincidence or there is some internal connection?

pretrain model

大佬，请问方便释放预训练权重吗？

pre-trained and fine-tuned models

Hi,
I am willing to download the models you've uploaded recently. Do you consider uploading them on something like google drive or dropbox ? Or is there a way to download them via link you provided without registration and installing baidu?
Thanks

与训练阶段loss的最终值

您好，请问这个模型在预训练阶段结束后的loss大概在多少为佳呢？我不知道当前训练出的loss是否太大了

A question about the strategy of DropPos

Hi author, thank you for contributing such interesting and solid work.

I got a question (maybe is a trivial question), the reconstruct target of DropPos are the actual positions of maksed PE right? But why would you consider to firstly mask a subset of patches? ( I can understand that it's necessary for MAE due to its target is RGB pixel) Is this because reconstructing the masked PE is a simply pretext task for pre-training ViT? (as the paper claims: trivial solution)

If so, directly feeding all patches into encoder will produces a suboptimal results, since all patches are visible for encoder, and it can reason the masked PE according all possible positions. In contrast, if we only allow it to "see" part of patches, it has to reason the masked PE only by the visible patch.

Am I right for this question? I hope you can provide some insight to me, thanks a lot!

Position encoding for downsteam task when pos_mask_ratio=1 and other questions

Hi,
Thank you for the impressive work. I want to double-check a few points about the paper and code.

When setting pos_mask_ratio=1 in pre-training, do we apply any position encoding in downstream tasks, e.g., linear probing? Also, could we say DropPos is almost equivalent to Zhai et al [1], under this setting?
I found "--multi_task" in the pre-train code. However, it seems no related reports about it. I am curious about its performance boosting.
The visible patches with masked positions are involved in the encoder processing. This is somehow different from MAE, shouldn't they join later in the decoder stage (further speed up training?)? Under this setting, what's the difference between an encoder and a decoder?

[1] Zhai et al, Position Prediction as an Effective Pretraining Strategy

haochen-wang409 / droppos Goto Github PK

droppos's Issues

Cannot Reproduce the results on ViT-L

Why does DropPos achieve exactly the same performance as HPM?

pretrain model

pre-trained and fine-tuned models

与训练阶段loss的最终值

A question about the strategy of DropPos

Position encoding for downsteam task when pos_mask_ratio=1 and other questions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent