Git Product home page Git Product logo

clip-ssl's Introduction

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Introduction

This is an official release of the paper CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction,
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
Bibetex

TODO

  • Code and models of CLIPSelf
  • Code and models of F-ViT
  • Support F-ViT under the ovdet repo using MMDetection3.x

Installation

This project is adapted from OpenCLIP-v2.16.0. Run the following command to install the package

pip install -e . -v

Data Preparation

The main experiments are conducted using images from COCO and LVIS datasets. Please prepare datasets and organize them like the following:

CLIPSelf/
├── data
    ├── coco
        ├── annotations
            ├── instances_train2017.json  # the box annotations are not used
            ├── panoptic_val2017.json
            ├── panoptic_val2017     # panoptic masks
        ├── train2017
        ├── val2017
        ├── coco_pseudo_4764.json    # to run RegionCLIP
        ├── coco_proposals.json      # to run CLIPSelf with region proposals
    ├── lvis_v1
        ├── annotations
            ├── lvis_v1_train.json  # the box annotations are not used
        ├── train2017    # the same with coco
        ├── val2017      # the same with coco

For CLIPSelf with region proposals or RegionCLIP that uses region-text pairs, obtain coco_pseudo_4764.json or coco_proposals.json from Drive. Put the json files under data/coco.

Run

Original Models

To run CLIPSelf, first obtain the original models from EVA-02-CLIP, and put them under checkpoints/ like the following:

CLIPSelf/
├── checkpoints
    ├── EVA02_CLIP_B_psz16_s8B.pt
    ├── EVA02_CLIP_L_336_psz14_s6B.pt
    

Training and Testing

We provide the scripts to train CLIPSelf and RegionCLIP under scripts/, they are summarized as follows:

# Model Method Proposals Training Data Script Checkpoint
1 ViT-B/16 CLIPSelf - COCO script model
2 ViT-B/16 CLIPSelf + COCO script model
3 ViT-B/16 RegionCLIP + COCO script model
4 ViT-L/14 CLIPSelf - COCO script model
5 ViT-L/14 CLIPSelf + COCO script model
6 ViT-L/14 RegionCLIP + COCO script model
7 ViT-B/16 CLIPSelf - LVIS script model
8 ViT-L/14 CLIPSelf - LVIS script model

For example, if we want to refine ViT-B/16 by CLIPSelf using only image patches on COCO, simply run:

bash scripts/train_clipself_coco_image_patches_eva_vitb16.sh    # 1

We also provide the checkpoints of the listed experiments above in Drive. And they can be organized as follows:

CLIPSelf/
├── checkpoints
    ├── eva_vitb16_coco_clipself_patches.pt     # 1
    ├── eva_vitb16_coco_clipself_proposals.pt   # 2
    ├── eva_vitb16_coco_regionclip.pt           # 3
    ├── eva_vitl14_coco_clipself_patches.pt     # 4
    ├── eva_vitl14_coco_clipself_proposals.pt   # 5
    ├── eva_vitl14_coco_regionclip.pt           # 6
    ├── eva_vitb16_lvis_clipself_patches.pt     # 7
    ├── eva_vitl14_lvis_clipself_patches.pt     # 8

To evaluate a ViT-B/16 model, run:

bash scripts/test_eva_vitb16_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt

To evaluate a ViT-L/14 model, run:

bash scripts/test_eva_vitl14_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt

F-ViT

Go to the folder CLIPSelf/F-ViT and follow the instructions in this README.

License

This project is licensed under NTU S-Lab License 1.0.

Citation

@article{wu2023clipself,
    title={CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction},
    author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li and Wentao Liu and Chen Change Loy},
    journal={arXiv preprint arXiv:2310.01403},
    year={2023}
}

Acknowledgement

We thank OpenCLIP, EVA-CLIP and MMDetection for their valuable code bases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.