Git Product home page Git Product logo

multi-task-transformer's Introduction

Python 3.7

๐Ÿ”ฅ [ECCV2022,ICLR2023] Powerful Multi-Task Transformer for Dense Scene Understanding

snowboard_1 snowboard_2

๐Ÿ“œ Introduction

๐ŸŽ†Update 2023.2 TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding has been accepted by ICLR2023. We will release the codes, including the Cityscapes-3D joint 2D-3D multi-task learning benchmark in this repository (segmentation, 3D detection, and depth estimation). Stay tuned!

This repository currently contains codes of our ECCV2022 paper InvPT:

Hanrong Ye and Dan Xu, Inverted Pyramid Multi-task Transformer for Dense Scene Understanding. The Hong Kong University of Science and Technology (HKUST)

  • InvPT proposes a novel end-to-end Inverted Pyramid multi-task Transformer to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework.
  • InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution.
  • InvPT achieves superior performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts.

img-name
InvPT enables jointly learning and inference of global spatial interaction and simultaneous all-task interaction, which is critically important for multi-task dense prediction.

img-name
Framework overview of the proposed Inverted Pyramid Multi-task Transformer (InvPT) for dense scene understanding.

๐Ÿ˜Ž Demo

compress_final_demo_small.mp4

To qualitatively demonstrate the powerful performance and generalization ability of our multi-task model InvPT, we further examine its multi-task prediction performance for dense scene understanding in the new scenes. Specifically, we train InvPT on PASCAL-Context dataset (with 4,998 training images) and generate prediction results of the video frames in DAVIS dataset without any fine-tuning. InvPT yields good performance on the new dataset with distinct data distribution. Watch the clearer version of demo here!

๐Ÿ“บ News

๐Ÿšฉ Updates

  • โœ… July 18, 2022: Update with InvPT models trained on PASCAL-Context and NYUD-v2 dataset!

๐Ÿ˜€ Train your InvPT!

1. Build recommended environment

For easier usage, we re-implement InvPT with a clean training framework, and here is a successful path to deploy the recommended environment:

conda create -n invpt python=3.7
conda activate invpt
pip install tqdm Pillow easydict pyyaml imageio scikit-image tensorboard
pip install opencv-python==4.5.4.60 setuptools==59.5.0

# An example of installing pytorch-1.10.0 with CUDA 11.1
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install timm==0.5.4 einops==0.4.1

2. Get data

We use the same data (PASCAL-Context and NYUD-v2) as ATRC. You can download the data by:

wget https://data.vision.ee.ethz.ch/brdavid/atrc/NYUDv2.tar.gz
wget https://data.vision.ee.ethz.ch/brdavid/atrc/PASCALContext.tar.gz

And then extract the datasets by:

tar xfvz NYUDv2.tar.gz
tar xfvz PASCALContext.tar.gz

You need to specify the dataset directory as db_root variable in configs/mypath.py.

3. Train the model

The config files are defined in ./configs, the output directory is also defined in your config file.

As an example, we provide the training script of the best performing model of InvPT with Vit-L backbone. To start training, you simply need to run:

bash run.sh # for training on PASCAL-Context dataset. 

or

bash run_nyud.sh # for training on NYUD-v2 dataset.

after specifcifing your devices and config in run.sh. This framework supports DDP for multi-gpu training.

All models are defined in models/ so it should be easy to deploy your own model in this framework.

4. Evaluate the model

The training script itself includes evaluation. For inferring with pre-trained models, you need to change run_mode in run.sh to infer.

Special evaluation for boundary detection

We follow previous works and use Matlab-based SEISM project to compute the optimal dataset F-measure scores. The evaluation code will save the boundary detection predictions on the disk.

Specifically, identical to ATRC and ASTMT, we use maxDist=0.0075 for PASCAL-Context and maxDist=0.011 for NYUD-v2. Thresholds for HED (under seism/parameters/HED.txt) are used. read_one_cont_png is used as IO function in SEISM.

๐Ÿฅณ Pre-trained InvPT models

To faciliate the community to reproduce our SoTA results, we re-train our best performing models with the training code in this repository and provide the weights for the reserachers.

Download pre-trained models

Version Dataset Download Segmentation Human parsing Saliency Normals Boundary
InvPT* PASCAL-Context google drive, onedrive 79.91 68.54 84.38 13.90 72.90
InvPT (our paper) PASCAL-Context - 79.03 67.61 84.81 14.15 73.00
ATRC (ICCV 2021) PASCAL-Context - 67.67 62.93 82.29 14.24 72.42
Version Dataset Download Segmentation Depth Normals Boundary
InvPT* NYUD-v2 google drive, onedrive 53.65 0.5083 18.68 77.80
InvPT (our paper) NYUD-v2 - 53.56 0.5183 19.04 78.10
ATRC (ICCV 2021) NYUD-v2 - 46.33 0.5363 20.18 77.94

*: reproduced results

Infer with the pre-trained models

Simply set the pre-trained model path in run.sh by adding --trained_model pretrained_model_path. You also need to change run_mode in run.sh to infer.

Generate multi-task predictions form any image

To generate multi-task predictions from an image with the pre-trained InvPT model on PASCAL-Context, please use inference.py. An example running script is:

CUDA_VISIBLE_DEVICES=0 python inference.py --image_path=IMAGE_PATH --ckp_path=CKP_PATH --save_dir=SAVE_DIR

๐Ÿค— Cite

BibTex:

@InProceedings{invpt2022,
  title={Inverted Pyramid Multi-task Transformer for Dense Scene Understanding},
  author={Ye, Hanrong and Xu, Dan},
  booktitle={ECCV},
  year={2022}
}
@InProceedings{taskprompter2023,
  title={TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding},
  author={Ye, Hanrong and Xu, Dan},
  booktitle={ICLR},
  year={2023}
}

Please do consider ๐ŸŒŸ star our project to share with your community if you find this repository helpful!

๐Ÿ˜Š Contact

Please contact Hanrong Ye if any questions.

๐Ÿ‘ Acknowledgement

This repository borrows partial codes from MTI-Net and ATRC.

๐Ÿ•ด๏ธ License

Creative commons license which allows for personal and research use only.

For commercial useage, please contact the authors.

multi-task-transformer's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.