Transformer-in-Vision

A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.

**Last updated: 2022/02/03

Update log

2021/April - update all of recent papers of Transformer-in-Vision.
2021/May - update all of recent papers of Transformer-in-Vision.
2021/June - update all of recent papers of Transformer-in-Vision.
2021/July - update all of recent papers of Transformer-in-Vision.
2021/August - update all of recent papers of Transformer-in-Vision.
2021/September - update all of recent papers of Transformer-in-Vision.
2021/October - update all of recent papers of Transformer-in-Vision.
2021/November - update all of recent papers of Transformer-in-Vision.
2021/December - update all of recent papers of Transformer-in-Vision.
2022/January - update all of recent papers of Transformer-in-Vision.

Survey:

(arXiv 2022.01) Transformers in Medical Imaging: A Survey. [Paper], [Awesome]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks. [Paper]
(arXiv 2022.01) Video Transformers: A Survey. [Paper]
(arXiv 2021.11) A Survey of Visual Transformers. [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training. [Paper]
(arXiv 2021.03) Multi-modal Motion Prediction with Stacked Transformers. [Paper], [Code]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision. [Paper]
(arXiv 2020.09) Efficient Transformers: A Survey. [Paper]
(arXiv 2020.01) Transformers in Vision: A Survey. [Paper]

Hand Gesture

(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]

HOI

(CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper], [Code]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]
(arXiv 2021.12) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer, [Paper], [Code]

Hyperspectral

(arXiv 2021.07) SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Learning A 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution, [Paper]

Incremental Learning

(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

In-painting

(ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [Paper], [Code]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]

Instance Segmentation

(CVPR'21) End-to-End Video Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.04) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]
(arXiv 2021.12) SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation, [Paper]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

Layout

(CVPR'21) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.10) The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE, [Paper]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2022.02) ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis, [Paper]

Matching

(CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices, [Paper], [Code]

Medical

(arXiv 2021.02) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.02) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.03) SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation, [Paper], [Code]
(arXiv 2021.03) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, [Paper], [Code]
(arXiv 2021.03) TransMed: Transformers Advance Multi-modal Medical Image Classification, [Paper]
(arXiv 2021.03) U-Net Transformer: Self and Cross Attention for Medical Image Segmentation, [Paper]
(arXiv 2021.03) SUNETR: Transformers for 3D Medical Image Segmentation, [Paper]
(arXiv 2021.04) DeepProg: A Multi-modal Transformer-based End-to-end Framework for Predicting Disease Prognosis, [Paper]
(arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [Paper], [Code]
(arXiv 2021.04) Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification, [Paper]
(arXiv 2021.04) Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer, [Paper]
(arXiv 2021.04) Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, [Paper]
(arXiv 2021.04) Crossmodal Matching Transformer for Interventional in TEVAR, [Paper]
(arXiv 2021.04) GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification, [Paper]
(arXiv 2021.04) Pyramid Medical Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.05) Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy, [Paper]
(arXiv 2021.05) Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.05) Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers, [Paper]
(arXiv 2021.05) Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers, [Paper]
(arXiv 2021.05) Medical Image Segmentation using Squeeze-and-Expansion Transformers, [Paper], [Code]
(arXiv 2021.05) POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound, [Paper]
(arXiv 2021.05) COTR: Convolution in Transformer Network for End to End Polyp Detection, [Paper]
(arXiv 2021.05) PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer, [Paper]
(arXiv 2021.06) TED-net: Convolution-free T2T Vision Transformerbased Encoder-decoder Dilation network for Low-dose CT Denoising, [Paper]
(arXiv 2021.06) A Multi-Branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation, [Paper]
(arXiv 2021.06) Task Transformer Network for Joint MRI Reconstruction and Super-Resolution, [Paper], [Code]
(arXiv 2021.06) DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation, [Paper]
(arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
(arXiv 2021.06) Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image, [Paper]
(arXiv 2021.06) MTrans: Multi-Modal Transformer for Accelerated MR Imaging, [Paper], [Code]
(arXiv 2021.06) Multi-Compound Transformer for Accurate Biomedical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) ResViT: Residual vision transformers for multi-modal medical image synthesis, [Paper]
(arXiv 2021.07) E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception, [Paper]
(arXiv 2021.07) UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation, [Paper]
(arXiv 2021.07) COVID-VIT: Classification of Covid-19 from CT chest images based on vision transformer models, [Paper]
(arXiv 2021.07) RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting, [Paper], [Code]
(arXiv 2021.07) Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation, [Paper]
(arXiv 2021.07) Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries, [Paper]
(arXiv 2021.07) EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification, [Paper]
(arXiv 2021.07) Visual Transformer with Statistical Test for COVID-19 Classification, [Paper]
(arXiv 2021.07) TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Few-Shot Domain Adaptation with Polymorphic Transformers, [Paper], [Code]
(arXiv 2021.07) TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Surgical Instruction Generation with Transformers, [Paper]
(arXiv 2021.07) LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations, [Paper], [Code]
(arXiv 2021.08) Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, [Paper], [Code]
(arXiv 2021.08) Is it Time to Replace CNNs with Transformers for Medical Images, [Paper], [Code]
(arXiv 2021.09) nnFormer: Interleaved Transformer for Volumetric Segmentation, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) MISSFormer: An Effective Medical Image Segmentation Transformer, [Paper]
(arXiv 2021.09) Eformer: Edge Enhancement based Transformer for Medical Image Denoising, [Paper]
(arXiv 2021.09) Transformer-Unet: Raw Image Processing with Unet, [Paper]
(arXiv 2021.09) BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation, [Paper]
(arXiv 2021.09) GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, [Paper]
(arXiv 2021.10) Transformer Assisted Convolutional Network for Cell Instance Segmentation, [Paper]
(arXiv 2021.10) A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI images, [Paper]
(arXiv 2021.10) Boundary-aware Transformers for Skin Lesion Segmentation, [Paper], [Code]
(arXiv 2021.10) Vision Transformer based COVID-19 Detection using Chest X-rays, [Paper]
(arXiv 2021.10) Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining, [Paper], [Code]
(arXiv 2021.10) CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans, [Paper], [Code]
(arXiv 2021.10) COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer, [Paper], [Code]
(arXiv 2021.10) Bilateral-ViT for Robust Fovea Localization, [Paper]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) Vision Transformer for Classification of Breast Ultrasound Images, [Paper]
(arXiv 2021.11) Federated Split Vision Transformer for COVID-19CXR Diagnosis using Task-Agnostic Training, [Paper]
(arXiv 2021.11) Hepatic vessel segmentation based on 3D swin-transformer with inductive biased multi-head self-attention, [Paper]
(arXiv 2021.11) Lymph Node Detection in T2 MRI with Transformers, [Paper]
(arXiv 2021.11) Mixed Transformer U-Net For Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.11) Transformer for Polyp Detection, [Paper]
(arXiv 2021.11) DuDoTrans: Dual-Domain Transformer Provides More Attention for Sinogram Restoration in Sparse-View CT Reconstruction, [Paper], [Code]
(arXiv 2021.11) A Volumetric Transformer for Accurate 3D Tumor Segmentation, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis, [Paper], [Code]
(arXiv 2021.11) MIST-net: Multi-domain Integrative Swin Transformer network for Sparse-View CT Reconstruction, [Paper]
(arXiv 2021.12) MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification, [Paper], [Code]
(arXiv 2021.12) 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis, [Paper], [Code]
(arXiv 2021.12) Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer, [Paper], [Code]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper], [Code]
(arXiv 2021.12) MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer, [Paper], [Code]
(arXiv 2022.01) D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation, [Paper]
(arXiv 2022.01) Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images, [Paper], [Code]
(arXiv 2022.01) Swin Transformer for Fast MRI, [Paper], [Code]
(arXiv 2022.01) ViTBIS: Vision Transformer for Biomedical Image Segmentation, [Paper]
(arXiv 2022.01) Improving Across-Dataset Brain Tissue Segmentation Using Transformer, [Paper], [Code]
(arXiv 2022.01) SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation, [Paper], [Code]
(arXiv 2022.01) ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer, [Paper], [Code]
(arXiv 2022.01) Fast MRI Reconstruction: How Powerful Transformers Are, [Paper]
(arXiv 2022.01) Class-Aware Generative Adversarial Transformers for Medical Image Segmentation, [Paper]
(arXiv 2022.01) RTNet: Relation Transformer Network for Diabetic Retinopathy Multi-lesion Segmentation, [Paper]
(arXiv 2022.01) Joint Liver and Hepatic Lesion Segmentation using a Hybrid CNN with Transformer Layers, [Paper]
(arXiv 2022.01) DSFormer: A Dual-domain Self-supervised Transformer for Accelerated Multi-contrast MRI Reconstruction, [Paper]
(arXiv 2022.01) TransPPG: Two-stream Transformer for Remote Heart Rate Estimate, [Paper]
(arXiv 2022.01) TransBTSV2: Wider Instead of Deeper Transformer for Medical Image Segmentation, [Paper], [Code]

Motion

(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

Multi-task/modal

(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Code]
(arXiv 2021.04) MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
(arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.06) Scene Transformer: A Unified Multi-task Model for Behavior Prediction and Planning, [Paper]
(arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
(arXiv 2021.06) A Transformer-based Cross-modal Fusion Model with Adversarial Training, [Paper]
(arXiv 2021.07) Attention Bottlenecks for Multimodal Fusion, [Paper]
(arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
(arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
(arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
(arXiv 2021.08) StrucTexT: Structured Text Understanding with Multi-Modal Transformers, [Paper]
(arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper], [Code]
(arXiv 2021.10) VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2021.10) Detecting Dementia from Speech and Transcripts using Transformers, [Paper]
(arXiv 2021.11) MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition, [Paper]
(arXiv 2021.11) VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code1], [Code2]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper], [Code]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper],[Code]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling, [Paper]
(arXiv 2021.12) VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper],[Code]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper],[Code]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper],[Code]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper],[Code]
(arXiv 2022.01) Robust Self-Supervised Audio-Visual Speech Recognition, [Paper],[Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning, [Paper],[Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper],[Code]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper],[Code]
(arXiv 2022.01) A Pre-trained Audio-Visual Transformer for Emotion Recognition, [Paper]
(arXiv 2022.01) Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition, [Paper]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]

Multi-view Stereo

(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

NAS

(CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [Paper], [Code]
(arXiv.2021.06) Vision Transformer Architecture Search, [Paper], [Code]
(arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
(arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
(arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper]
(arXiv.2021.10) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper]
(arXiv.2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv.2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

Navigation

(ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [Paper]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]

OCR

(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.05) I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition, [Paper]
(arXiv 2021.05) Vision Transformer for Fast and Efficient Scene Text Recognition, [Paper]
(arXiv 2021.06) DocFormer: End-to-End Transformer for Document Understanding, [Paper]
(arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
(arXiv 2021.09) TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [Paper], [Code]
(arXiv 2021.10) Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks, [Paper], [Code]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) SPTS: Single-Point Text Spotting, [Paper]

Octree

(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

Panoptic Segmentation

(arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [Paper]
(arXiv 2021.09) Panoptic SegFormer, [Paper]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.10) An End-to-End Trainable Video Panoptic Segmentation Method using Transformers, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, [Paper], [Code]

Point Cloud

(ICRA'21) NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2021.03) You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module, [Paper], [Code]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
(arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.08) SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer, [Paper], [Code]
(arXiv 2021.08) PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
(arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
(arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper], [Code]
(arXiv 2021.09) PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) PatchFormer: A Versatile 3D Transformer Based on Patch Attention, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Adaptive Channel Encoding Transformer for Point Cloud Analysis, [Paper], [Code]
(arXiv 2021.11) Fast Point Transformer, [Paper]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]

Pose

(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) TokenPose: Learning Keypoint Tokens for Human Pose Estimation, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
(arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Code]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper], [Code]
(arXiv 2021.12) Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans, [Paper]
(arXiv 2021.12) End-to-End Learning of Multi-category 3D Pose and Shape Estimation, [Paper]
(arXiv 2022.01) Swin-Pose: Swin Transformer Based Human Pose Estimation, [Paper]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers, [Paper]

Planning

(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

Pruning & Quantization

(arXiv 2021.04) Visual Transformer Pruning, [Paper]
(arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]

Recognition

(arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [Paper]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [Paper]
(arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
(arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based place recognition with multi-level attention aggregation, [Paper]

Reconstruction

(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
(arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
(arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.11) Reference-based Magnetic Resonance Image Reconstruction Using Texture Transformer, [Paper]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

Re-identification

(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper]
(arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
(arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
(arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [Paper], [Code]
(arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [Paper]
(arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper]
(arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2022.01) Short Range Correlation Transformer for Occluded Person Re-Identification, [Paper]

Restoration

(arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.12) U2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

Retrieval

(CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [Paper]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper], [Code]
(arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [Paper]
(arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]

Salient Object Detection

(arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
(arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
(arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper], [Code]
(arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [Paper]
(arXiv 2021.12) Transformer-based Network for RGB-D Saliency Detection, [Paper]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

Scene

(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2021.05) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
(arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
(arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
(arXiv 2021.07) Spatial-Temporal Transformer for Dynamic Scene Graph Generation, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper], [Code]

Self-supervised Learning

(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper], [Code]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]], [Code]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper], [Code]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
(arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
(arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper], [Code]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper], [Code]

Semantic Segmentation

(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
(arXiv 2021.06) Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images, [Paper]
(arXiv 2021.06) OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments, [Paper]
(arXiv 2021.07) Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images, [Paper]
(arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
(arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
(arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]
(arXiv 2021.08) Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation, [Paper], [Code]
(arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper], [Code]
(arXiv 2021.08) Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation, [Paper]
(arXiv 2021.08) Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models, [Paper]
(arXiv 2021.09) Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation, [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]
(arXiv 2021.11) Dynamically pruning segformer for efficient semantic segmentation, [Paper]
(arXiv 2021.11) APANet: Adaptive Prototypes Alignment Network for Few-Shot Semantic Segmentation, [Paper]
(arXiv 2021.11) Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers, [Paper]
(arXiv 2021.11) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation, [Paper]

Shape

(WACV'21) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]

Super-Resolution

(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(arXiv 2021.06) LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation, [Paper]
(arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
(arXiv 2021.08) Light Field Image Super-Resolution with Transformers, [Paper], [Code]
(arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]
(arXiv 2021.09) Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution, [Paper]
(arXiv 2021.12) Implicit Transformer Network for Screen Content Image Continuous Super-Resolution, [Paper]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

Synthesis

(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper]
(arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
(arXiv 2021.06) The Image Local Autoregressive Transformer, [Paper]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Project]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

Tracking

(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]
(CVPR'21) Transformer Tracking, [Paper], [Code]
(CVPR'21) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
(arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
(arXiv 2021.08) HiFT: Hierarchical Feature Transformer for Aerial Tracking, [Paper], [Code]
(arXiv 2021.10) Siamese Transformer Pyramid Networks for Real-Time UAV Tracking, [Paper], [Code]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

Traffic

(arXiv 2021.05) Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder, [Paper]
(arXiv 2021.11) DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction, [Paper]
(arXiv 2021.11) ProSTformer: Pre-trained Progressive Space-Time Self-attention Model for Traffic Flow Forecasting, [Paper]
(arXiv 2022.01) SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers, [Paper], [Code]

Texture

(arXiv 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

Transfer learning

(arXiv 2021.06) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]

Video

(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ICLR'21) Support-set bottlenecks for video-text representation learning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Project]
(arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.05) Local Frequency Domain Transformer Networks for Video Prediction, [Paper]
(arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
(arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
(arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
(arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper]
(arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
(arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
(arXiv 2021.06) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
(arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
(arXiv 2021.06) Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection, [Paper]
(arXiv 2021.07) Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation, [Paper], [Code]
(arXiv 2021.07) Generative Video Transformer: Can Objects be the Words, [Paper]
(arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
(arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]
(arXiv 2021.08) Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering, [Paper]
(arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]
(arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]
(arXiv 2021.08) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) Hierarchical Multimodal Transformer to Summarize Videos, [Paper]
(arXiv 2021.10) Object-Region Video Transformers, [Paper], [Code]
(arXiv 2021.10) Can't Fool Me: Adversarially Robust Transformer for Video Understanding, [Paper], [Code]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) Sparse Adversarial Video Attacks with Spatial Transformations, [Paper], [Code]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing, [Paper]
(arXiv 2021.12) Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper]
(arXiv 2021.12) A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring,[Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition,[Paper]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers,[Paper]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition,[Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer,[Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer,[Paper], [Code]

Visual Grounding

(arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
(arXiv 2021.05) Visual Grounding with Transformers, [Paper]
(arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
(arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]
(arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

Visual Reasoning

(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

Visual Relationship Detection

(arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]

Voxel

(arXiv 2021.05) SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

Weakly Supervised Learning

(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2022.01) CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization, [Paper]

Zero-Shot Learning

(arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

Others

(CVPR'21') Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(CVPR'21') Pre-Trained Image Processing Transformer, [Paper]
(ICCV'21) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Code]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
(arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
(arXiv 2021.04) Fourier Image Transformer, [Paper], [Code]
(arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper]
(arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
(arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper]
(arXiv 2021.06) A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer, [Paper]
(arXiv 2021.06) Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information, [Paper]
(arXiv 2021.06) StyTr2: Unbiased Image Style Transfer with Transformers, [Paper]
(arXiv 2021.06) Semantic Correspondence with Transformers, [Paper]
(arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
(arXiv 2021.07) Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation, [Paper], [Code]
(arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
(arXiv 2021.07) PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution, [Paper]
(arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
(arXiv 2021.08) Applications of Artificial Neural Networks in Microorganism Image Analysis: A Comprehensive Review from Conventional Multilayer Perceptron to Popular Convolutional Neural Network and Potential Visual Transformer, [Paper]
(arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code]
(arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper], [Code]
(arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]
(arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
(arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.08) Investigating transformers in the decomposition of polygonal shapes as point collections, [Paper]
(arXiv 2021.08) Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography, [Paper]
(arXiv 2021.08) Construction material classification on imbalanced datasets for construction monitoring automation using Vision Transformer (ViT) architecture, [Paper]
(arXiv 2021.08) Spatial Transformer Networks for Curriculum Learning, [Paper]
(arXiv 2021.09) TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper], [Code]
(arXiv 2021.10) ProTo: Program-Guided Transformer for Program-Guided Tasks, [Paper]
(arXiv 2021.10) TranSalNet: Visual saliency prediction using transformers, [Paper]
(arXiv 2021.10) Development and testing of an image transformer for explainable autonomous driving systems, [Paper]
(arXiv 2021.10) Leveraging redundancy in attention with Reuse Transformers, [Paper]
(arXiv 2021.10) Tensor-to-Image: Image-to-Image Translation with Vision Transformers, [Paper]
(arXiv 2021.10) Accelerating Framework of Transformer by hardware Design and Model Compression Co-Optimization, [Paper]
(arXiv 2021.10) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.10) TNTC: two-stream network with transformer-based complementarity for gait-based emotion recognition, [Paper]
(arXiv 2021.11) The self-supervised channel-spatial attention-based transformer network for automated, accurate prediction of crop nitrogen status from UAV imagery, [Paper]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) U-shape Transformer for Underwater Image Enhancement, [Paper]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper]
(arXiv 2021.11) Attention-based Dual-stream Vision Transformer for Radar Gait Recognition,[Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions,[Paper], [Code]
(arXiv 2021.11) BuildFormer: Automatic building extraction with vision transformer,[Paper]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers,[Paper]
(arXiv 2021.12) Transformer based trajectory prediction,[Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors,[Paper], [Project]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection,[Paper]
(arXiv 2021.12) 3D Question Answering,[Paper]
(arXiv 2021.12) Light Field Neural Rendering,[Paper], [Project]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Nonlinear Transform Source-Channel Coding for Semantic Communications, [Paper]
(arXiv 2021.12) APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer,[Paper], [Project]
(arXiv 2022.01) A Transformer-Based Siamese Network for Change Detection,[Paper], [Code]
(arXiv 2022.01) Learning class prototypes from Synthetic InSAR with Vision Transformers,[Paper]
(arXiv 2022.01) Swin transformers make strong contextual encoders for VHR image road extraction,[Paper]
(arXiv 2022.01) Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners,[Paper]
(arXiv 2022.01) Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer,[Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer,[Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference,[Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation,[Paper], [Code]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation,[Paper], [Code]
(arXiv 2022.01) A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,[Paper], [Code]
(arXiv 2022.01) Transformer-based SAR Image Despeckling,[Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer,[Paper], [Code]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers,[Paper]
(arXiv 2022.01) Dual-Tasks Siamese Transformer Framework for Building Damage Assessment,[Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer,[Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer,[Paper]

Contact & Feedback

If you have any suggestions about this project, feel free to contact me.

[e-mail: yzhangcst[at]gmail.com]

aimeng100 / transformer-in-computer-vision Goto Github PK

transformer-in-computer-vision's Introduction

Transformer-in-Vision

Update log

Survey:

Recent Papers

Action

Active Learning

Anomaly Detection

Assessment

Captioning

Classification (Backbone)

Completion

Compression

Crowd Counting

Depth

Deepfake Detection

Dehazing

Detection

Face

Few-shot Learning

Fusion

GAN

Gaze