Git Product home page Git Product logo

ov-dino's Introduction

๐Ÿฆ– OV-DINO

Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang1,2,Pengzhen Ren1,Zequn Jie3, Xiao Dong1, Chengjian Feng3, Yinlong Qian3,

Lin Ma3, Dongmei Jiang2, Yaowei Wang2,4, Xiangyuan Lan2๐Ÿ“ง, Xiaodan Liang1,2๐Ÿ“ง

1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc, 4 HIT, Shenzhen

๐Ÿ“ง corresponding author.

[Paper] [HuggingFace] [Demo] [BibTex]

PWC PWC PWC

๐Ÿ”ฅ Updates

  • 06/08/2024: ๐ŸŽ‡ Awesome!!! OV-SAM = OV-DINO + SAM2. We update OV-SAM marrying OV-DINO with SAM2 on the online demo.

  • 16/07/2024: We provide the online demo, click and enjoy !!! NOTE: You uploaded image will be stored for failure analysis.

  • 16/07/2024: We release the web inference demo, try to deploy it by yourself.

  • 15/07/2024: We release the fine-tuning code, try to fine-tune on your custom dataset. Feel free to raise issue if you encounter some problem.

  • 15/07/2024: We release the local inference demo, try to deploy OV-DINO on you local machine and run inference on images.

  • 14/07/2024: We release the pre-trained models and the evaluation code.

  • 11/07/2024: We release OV-DINO paper on arxiv. Code and pre-trained model are coming soon.

๐Ÿš€ Introduction

This project contains the official PyTorch implementation, pre-trained models, fine-tuning code, and inference demo for OV-DINO.

  • OV-DINO is a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application.

  • OV-DINO entails a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model.

  • OV-DINO shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +12.7% AP on LVIS compared to G-DINO in zero-shot evaluation.

๐Ÿ“„ Overview

โœจ Model Zoo

Model Pre-Train Data APmv APr APc APf APval APr APc APf APcoco Weights
OV-DINO1 O365 24.4 15.5 20.3 29.7 18.7 9.3 14.5 27.4 49.5 / 57.5 HF CKPT๐Ÿค—
OV-DINO2 O365,GoldG 39.4 32.0 38.7 41.3 32.2 26.2 30.1 37.3 50.6 / 58.4 HF CKPT๐Ÿค—
OV-DINO3 O365,GoldG,CC1Mโ€ก 40.1 34.5 39.5 41.5 32.9 29.1 30.4 37.4 50.2 / 58.2 HF CKPT๐Ÿค—

NOTE: APmv denotes the zero-shot evaluation results on LVIS MiniVal, APval denotes the zero-shot evaluation results on LVIS Val, APcoco denotes (zero-shot / fine-tune) evaluation results on COCO, respectively.

๐Ÿ Getting Started

1. Project Structure

OV-DINO
โ”œโ”€โ”€ datas
โ”‚ย ย  โ”œโ”€โ”€ coco
โ”‚   โ”‚   โ”œโ”€โ”€ annotations
โ”‚   โ”‚   โ”œโ”€โ”€ train2017
โ”‚   โ”‚   โ””โ”€โ”€ val2017
โ”‚   โ”œโ”€โ”€ lvis
โ”‚   โ”‚   โ”œโ”€โ”€ annotations
โ”‚   โ”‚   โ”œโ”€โ”€ train2017
โ”‚   โ”‚   โ””โ”€โ”€ val2017
โ”‚   โ””โ”€โ”€ custom
โ”‚       โ”œโ”€โ”€ annotations
โ”‚       โ”œโ”€โ”€ train
โ”‚       โ””โ”€โ”€ val
โ”œโ”€โ”€ docs
โ”œโ”€โ”€ inits
โ”‚ย ย  โ”œโ”€โ”€ huggingface
โ”‚ย ย  โ”œโ”€โ”€ ovdino
โ”‚ย ย  โ”œโ”€โ”€ sam2
โ”‚ย ย  โ””โ”€โ”€ swin
โ”œโ”€โ”€ ovdino
โ”‚ย ย  โ”œโ”€โ”€ configs
โ”‚ย ย  โ”œโ”€โ”€ demo
โ”‚ย ย  โ”œโ”€โ”€ detectron2-717ab9
โ”‚ย ย  โ”œโ”€โ”€ detrex
โ”‚ย ย  โ”œโ”€โ”€ projects
โ”‚ย ย  โ”œโ”€โ”€ scripts
โ”‚ย ย  โ””โ”€โ”€ tools
โ”œโ”€โ”€ wkdrs
โ”‚   โ”œโ”€โ”€ ...
โ”‚

2. Installation

# clone this project
git clone https://github.com/wanghao9610/OV-DINO.git
cd OV-DINO
export root_dir=$(realpath ./)
cd $root_dir/ovdino

# create conda env for ov-dino
conda create -n ovdino -y
conda activate ovdino
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia -y
conda install gcc=9 gxx=9 -c conda-forge -y # Optional: install gcc9
python -m pip install -e detectron2-717ab9
pip install -e ./

# Optional: create conda env for ov-sam, it may not compatible with ov-dino, so we create a new env.
# ov-sam = ov-dino + sam2
conda create -n ovsam -y
conda activate ovsam
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# install the sam2 following the sam2 project.
# please refer to https://github.com/facebookresearch/segment-anything-2.git
# download sam2 checkpoints and put them to inits/sam2
python -m pip install -e detectron2-717ab9
pip install -e ./

2. Data Preparing

COCO

  • Download COCO from the official website, and put them on datas/coco folder.
    cd $root_dir
    wget http://images.cocodataset.org/zips/train2017.zip -O datas/coco/train2017.zip
    wget http://images.cocodataset.org/zips/val2017.zip -O datas/coco/val2017.zip
    wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O datas/coco/annotations_trainval2017.zip
  • Extract the ziped files, and remove them:
    cd $root_dir
    unzip datas/coco/train2017.zip -d datas/coco
    unzip datas/coco/val2017.zip -d datas/coco
    unzip datas/coco/annotations_trainval2017.zip -d datas/coco
    rm datas/coco/train2017.zip datas/coco/val2017.zip datas/coco/annotations_trainval2017.zip

LVIS

  • Download LVIS annotation files:
    cd $root_dir
    wget https://huggingface.co/hao9610/OV-DINO/resolve/main/lvis_v1_minival_inserted_image_name.json -O datas/lvis/annotations/lvis_v1_minival_inserted_image_name.json
    wget https://huggingface.co/hao9610/OV-DINO/resolve/main/lvis_v1_val_inserted_image_name.json -O datas/lvis/annotations/lvis_v1_val_inserted_image_name.json
  • Soft-link COCO to LVIS:
    cd $root_dir
    ln -s $(realpath datas/coco/train2017) datas/lvis
    ln -s $(realpath datas/coco/val2017) datas/lvis

3. Evaluation

Download the pre-trained model from Model Zoo, and put them on inits/ovdino directory.

cd $root_dir/ovdino
sh scripts/eval.sh path_to_eval_config_file path_to_pretrained_model output_directory

Zero-Shot Evaluation on COCO Benchmark

cd $root_dir/ovdino
sh scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_coco.py \
  ../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth \
  ../wkdrs/eval_ovdino

Zero-Shot Evaluation on LVIS Benchmark

cd $root_dir/ovdino
sh scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvismv.py \
  ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
  ../wkdrs/eval_ovdino

sh scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvis.py \
  ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
  ../wkdrs/eval_ovdino

4. Fine-Tuning

Fine-Tuning on COCO Dataset

cd $root_dir/ovdino
sh scripts/train.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_ft_coco_24ep.py \
  ../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth

Fine-Tuning on Custom Dataset

  • Prepare your custom dataset as the COCO annotation format.

  • Refer the following command to run fine-tuning.

    cd $root_dir/ovdino
    sh scripts/train.sh \
      projects/ovdino/configs/ovdino_swin_tiny224_bert_base_ft_custom_24ep.py \
      ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth

๐Ÿ’ป Demo

  • Local inference on a image or folder give the category names.

    # for ovdino: conda activate ovdino
    # for ovsam: conda activate ovsam
    cd $root_dir/ovdino
    sh scripts/demo.sh demo_config.py pretrained_model category_names input_images_or_directory output_directory

    Examples:

    cd $root_dir/ovdino
    # single image inference
    sh scripts/demo.sh \
      projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \
      ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
      "class0 class1 class2 ..." img0.jpg output_dir/img0_vis.jpg
    
    # multi images inference
    sh scripts/demo.sh \
      projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \
      ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
      "class0 long_class1 long_class2 ..." "img0.jpg img1.jpg" output_dir
    
    # image folder inference
    sh scripts/demo.sh \
      projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \
      ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
      "class0 long_class1 long_class2 ..." image_dir output_dir

    NOTE: the input category_names are separated by spaces, and the words of single class are connected by underline (_).

  • Web inference demo.

    cd $root_dir/ovdino
    sh scripts/app.sh \
      projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \
      ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth

    After the web demo deployment, you can open the demo on your browser.

    We also provide the online demo, click and enjoy.

โœ… TODO

  • Release the pre-trained model.
  • Release the fine-tuning and evaluation code.
  • Support the local inference demo.
  • Support the web inference demo.
  • Support OV-DINO in ๐Ÿค— transformers.
  • Release the pre-training code.

๐Ÿ˜Š Acknowledge

This project has referenced some excellent open-sourced repos (Detectron2, detrex, GLIP, G-DINO, YOLO-World). Thanks for their wonderful works and contributions to the community.

๐Ÿ“Œ Citation

If you find OV-DINO is helpful for your research or applications, please consider giving us a star ๐ŸŒŸ and citing it by the following BibTex entry.

@article{wang2024ovdino,
  title={OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion}, 
  author={Hao Wang and Pengzhen Ren and Zequn Jie and Xiao Dong and Chengjian Feng and Yinlong Qian and Lin Ma and Dongmei Jiang and Yaowei Wang and Xiangyuan Lan and Xiaodan Liang},
  journal={arXiv preprint arXiv:2407.07844},
  year={2024}
}

ov-dino's People

Contributors

wanghao9610 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.