Git Product home page Git Product logo

adamae's Introduction

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

📖 Paper: arXiv

💡 Contributions:

  • We propose AdaMAE, a novel, adaptive, and end-to-end trainable token sampling strategy for MAEs that takes into account the spatiotemporal properties of all input tokens to sample fewer but informative tokens.

  • We empirically show that AdaMAE samples more tokens from high spatiotemporal information regions of the input, resulting in learning meaningful representations for downstream tasks.

  • We demonstrate the efficiency of AdaMAE in terms of performance and GPU memory against random patch, tube, and frame sampling by conducting a thorough ablation study on the SSv2 dataset.

  • We show that our AdaMAE outperforms state-of-the-art (SOTA) by $0.7%$ and $1.1%$ (in top-1) improvements on $SSv2$ and $Kinetics-400$, respectively.

Method

mask-vis-1

Adaptive mask visualizations from $SSv2$ (samples from $50th$ epoch)

  Video   Pred.     Error       CAT   Mask   Video Pred.     Error       CAT   Mask  

Adaptive mask visualizations from $K400$ (samples from $50th$ epoch):

  Video   Pred.     Error       CAT   Mask   Video Pred.     Error       CAT   Mask  

A comparision

Comparison of our adaptive masking with existing random patch, tube, and frame masking for masking ratio of 80%.} Our adaptive masking approach selects more tokens from the regions with high spatiotemporal information while a small number of tokens from the background.

mask-type-comp

Ablation experiments on SSv2 dataset:

We use ViT-Base as the backbone for all experiments. MHA $(D=2, d=384)$ denotes our adaptive token sampling network with a depth of two and embedding dimension of $384$. All pre-trained models are evaluated based on the evaluation protocol described in Sec. 4. The default choice of our AdaMAE is highlighted in gray color. The GPU memory consumption is reported for a batch size of 16 on a single GPU.

ssv2-ablations

Pre-training AdaMAE & fine-tuning:

  • We closely follow the VideoMAE pre-trainig receipy, but now with our adaptive masking instead of tube masking. To pre-train AdaMAE, please follow the steps in DATASET.md, PRETRAIN.md.

  • To check the performance of pre-trained AdaMAE please follow the steps in DATASET.md and FINETUNE.md.

  • To setup the conda environment, please refer FINETUNE.md.

Pre-trained model weights

  • Download the pre-trained model weights for SSv2 and K400 datasets here.

Acknowledgement:

Our AdaMAE codebase is based on the implementation of VideoMAE paper. We thank the authors of the VideoMAE for making their code available to the public.

Citation:

@article{bandara2022adamae,
  title={AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders},
  author={Bandara, Wele Gedara Chaminda and Patel, Naman and Gholami, Ali and Nikkhah, Mehdi and Agrawal, Motilal and Patel, Vishal M},
  journal={arXiv preprint arXiv:2211.09120},
  year={2022}
}

adamae's People

Contributors

wgcban avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.