Hi, Thank you for the impressive work. I want to double-check a few points about t

Position encoding for downsteam task when pos_mask_ratio=1 and other questions about droppos HOT 4 CLOSED

KJ-rc commented on July 1, 2024

Position encoding for downsteam task when pos_mask_ratio=1 and other questions

from droppos.

Haochen-Wang409 commented on July 1, 2024

Hi, thanks for your attention to our work! Here are point-to-point responses:

The positional embeddings will be added to downstream tasks when setting pos_mask_ratio=1 in pre-training. DropPos is not equivalent to MP3 [1] with pos_mask_ratio=1 because the visible patches of DropPos are encoded with positional embeddings while no positional information is added to context tokens in [1]. Moreover, DropPos employs a patch masking stage. Therefore, DropPos is more efficient than [1].
The multi_task setting is expected to boost ~0.5% of the top-1 accuracy on ImageNet-1K with a ViT-B backbone pre-trained with 200 epochs.
DropPos tries to reconstruct dropped positions based on patch appearances. These visible patches without positional embeddings provide sufficient information for further position reconstruction. Similar to most self-supervised methods, the encoder is responsible for learning scalable feature representations while the decoder is served to the particular pre-text task, i.e., reconstructing dropped positions in DropPos.

from droppos.

KJ-rc commented on July 1, 2024

Thank you for the explanation. I still have a few questions.

When pos_mask_ratio=1, DropPos didn't see any position info either, did it?
Regarding my 3rd question, if there are no new tokens joined in the decoder, what's the difference between a "12 layers encoder + 2 layers decoder" setting and a "14 layers encoder" setting?

from droppos.

Haochen-Wang409 commented on July 1, 2024

It seems no difference. The only thing that matters may be to choose features from which layer for downstream classification.

from droppos.

KJ-rc commented on July 1, 2024

Thank you. That answers my questions.

from droppos.