Git Product home page Git Product logo

haochen-wang409 / droppos Goto Github PK

View Code? Open in Web Editor NEW
59.0 1.0 3.0 991 KB

[NeurIPS'23] DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

License: Apache License 2.0

Python 98.01% Shell 1.99%
ade20k coco computer-vision deep-learning detection image-classification imagenet position-embedding segmentation self-supervised-learning vision-transformer neurips-2023

droppos's Introduction

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

Official Implementation of our paper "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions. In NeurIPS 2023.

by Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, and Zhaoxiang Zhang

[arXiv] [Paper]

๐Ÿ””๐Ÿ””๐Ÿ”” We are happy to announce that DropPos has been accepted by NeurIPS 2023! ๐Ÿ””๐Ÿ””๐Ÿ””

๐Ÿ””๐Ÿ””๐Ÿ”” The pre-trained and fine-tuned models are available here with fetch code 4gik! ๐Ÿ””๐Ÿ””๐Ÿ””

Notes

Motivation

TL;DR We present a novel self-supervised pre-text task to pre-train vision transformers, i.e., reconstructing dropped positions (DropPos), which achieved competitive results on various evaluation protocols, such as image classification, object detection, and semantic segmentation.

Abstract. As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs.

Method

Results

Method Model PT Epochs Top-1 Acc.
DropPos ViT-B/16 200 83.0
DropPos ViT-B/16 800 84.2
DropPos ViT-L/16 800 85.8

Acknowledgement

The pretraining and finetuning of our project are based on DeiT, MAE and HPM. Thanks for their wonderful work.

For object detection and semantic segmentation, please refer to Detectron2 and MMSegmentation, respectively. The configurations can be found in here and here for detection and segmentation, respectively.

License

This project is under the Apache License 2.0 license. See LICENSE for details.

Citation

@article{wang2023droppos,
  author  = {Wang, Haochen and Fan, Junsong and Wang, Yuxi and Song, Kaiyou and Wang, Tong and Zhang, Zhaoxiang},
  journal = {Advances in Neural Information Processing Systems (NeurIPS)},
  title   = {DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions},
  year    = {2023},
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.