Git Product home page Git Product logo

codet's Introduction

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection,
Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
NeurIPS 2023 (https://arxiv.org/abs/2310.16667)
Project page (https://codet-ovd.github.io)

PWC

Features

  • Train an open-vocabulary detector with web-scale image-text pairs
  • Align regions and words by co-occurrence instead of region-text similarity
  • State-of-the-art performance on open-vocabulary LVIS
  • Deployed with modern visual foudation models
  • Intergated with roboflow to automatically label images for training a small, fine-tuned model

Installation

Setup environment

conda create --name codet python=3.8 -y && conda activate codet
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/CVMI-Lab/CoDet.git

Install Apex and xFormer (You can skip this part if you do not use EVA-02 backbone)

pip install ninja
pip install -v -U git+https://github.com/facebookresearch/xformers.git@7e05e2caaaf8060c1c6baadc2b04db02d5458a94
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./ && cd ..

Install detectron2 and other dependencies

cd CoDet/third_party/detectron2
pip install -e .
cd ../..
pip install -r requirements.txt

Prepare Datasets

We use LVIS and Conceptual Caption (CC3M) for OV-LVIS experimets, COCO for OV-COCO experiments, and Objects365 for cross-dataset evaluation. Before starting processing, please download the (selected) datasets from the official websites and place or sim-link them under CoDet/datasets/. CoDet/datasets/metadata/ is the preprocessed meta-data (included in the repo). Please refer to DATA.md for more details.

$CoDet/datasets/
    metadata/
    lvis/
    coco/
    cc3m/
    objects365/

Model Zoo

OV-COCO

Backbone Box AP50 Box AP50_novel Config Model
ResNet50 46.8 30.6 CoDet_OVCOCO_R50_1x.yaml ckpt

OV-LVIS

Backbone Mask mAP Mask mAP_novel Config Model
ResNet50 31.3 23.7 CoDet_OVLVIS_R5021k_4x_ft4x.yaml ckpt
Swin-B 39.2 29.4 CoDet_OVLVIS_SwinB_4x_ft4x.yaml ckpt
EVA02-L 44.7 37.0 CoDet_OVLVIS_EVA_4x.yaml ckpt

Inference

To test with custom images/videos, run

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary lvis --opts MODEL.WEIGHTS [model_weights]

Or you can customize the test vocabulary, e.g.,

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS [model_weights]

To evaluate a pre-trained model, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt

To evaluate a pre-trained model on Objects365 (cross-dataset evaluation), run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt DATASETS.TEST "('objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(365,)" MODEL.MASK_ON False

Training

Training configurations used by the paper are listed in CoDet/configs. Most config files require pre-trained model weights for initialization (indicated by MODEL.WEIGHTS in the config file). Please train or download the corresponding pre-trained models and place them under CoDet/models/ before training.

Name Model
resnet50_miil_21k.pkl ResNet50-21K pretrain from MIIL
swin_base_patch4_window7_224_22k.pkl SwinB-21K pretrain from Swin-Transformer
eva02_L_pt_m38m_p14to16.pt EVA02-L mixed 38M pretrain from EVA
BoxSup_OVCOCO_CLIP_R50_1x.pth ResNet50 COCO base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x.pth ResNet50 LVIS base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_SwinB_896b32_4x.pth SwinB LVIS base class pretrain from Detic

To train on a single node, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config

Note: By default, we use 8 V100 for training with ResNet50 or SwinB, and 16 A100 for training with EVA02-L. Please remember to re-scale the learning rate accordingly if you are using a different number of GPUs for training.

Citation

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{ma2023codet,
  title={CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection},
  author={Ma, Chuofan and Jiang, Yi and Wen, Xin and Yuan, Zehuan and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

Acknowledgment

CoDet is built upon the awesome works Detic and EVA.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

codet's People

Contributors

capjamesg avatar machuofan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

codet's Issues

AttributeError: 'Linear' object has no attribute 'linear'

I have a problem while training, at CoDet-main/codet/modeling/roi_heads/codet_roi_heads.py, line 255, in _run_stage
boxes_features_copy = self.box_predictor[stage].cls_score.linear(box_features)
raise AttributeError('''{}' object has no attribute '{}'''.format(
AttributeError: 'Linear' object has no attribute 'linear')

I'm trying to run the CoDet training code, but I encountered this problem. I hope someone can help me.

About training cost.

Thank you for the open source. How long will it take to train with 8 GPUs on the LVIS+CC3M training set?

Where is 'datasets/coco/annotations/captions_train2017_tags_634_allcaps.json'?

Thank you for your work. I want to ask how to generate captions_train2017_tags_634_allcaps.json ? Where is captions_train2017_tags_634_allcaps.json? When I train with CoDet_OVCOCO_R50_1x.yaml, I meet a problem of [Errno 2] No such file or directory: 'datasets/coco/annotations/captions_train2017_tags_634_allcaps.json'. I am looking forward to your reply!

lvis_v1_train_norare_cat_info.json

Hello, when I tried to use config CoDet_OVLVIS_SwinB_4x_ft4x.yaml to train, a prompt appeared that the file
lvis_v1_train_norare_cat_info.json was missing. Can you tell me how to obtain this file? Thank you very much

How to run CoDet on custom dataset?

Can you give me some guidance on preparing my own dataset? For example, how can I generate cococap_clip_a+cname.npy and
coco_clip_a+cname.npy for my own dataset?

Questions about reproducing OV-COCO results

Thank you for your awesome work and open source code!

I'm trying to reproduce the OV-COCO experiment, the results in Table 2 in your paper.

Unfortunately, there is only one 1080Ti available, so I divided the base learning rate by 8 (as mentioned in the repo), but the results are quite different from the data in your paper (The experimental results are shown in the figure below, unseen AP50=19, seen AP50=50).

I tried doubling the training iterations, and only got unseen AP50=24, seen AP50=50.

Can you provide some suggestions?

447e9b10eb1f2d3e1e9add904d0cefd

Question about the Two Layer MLP applied to the similarity matrix (i.e `weight_transform` applied to `simi_scores`)

Thanks you for your work! I really appreciate the idea's you've presented in your paper and find them really interesting and useful! I have a question about the paper and was hoping you could provide some clarity.

In the paper and in the method _concept_grouping_loss a two layer MLP is applied to the similarity matrix. It seems to me that for the MLP to learn patterns in the similarity matrix there needs to be some inherent order in the last dimension of the matrix (mn). From the paper it seems that mini group is randomly sampled and so there's no specific order maintained in m and since n depend on the proposals (based on the content of the image) there is no inherent order in n.

How does this MLP learn useful features from the similarity matrix? Is there some order maintained during sampling of the mini-group? Is there an ablation study with and without the MLP?

About the license

Hi,

Thanks for this great work!
I wonder if there is any license about this repo? Currently I cannot find license file for it.

Bests.

Where is captions_train2017_tags_634_allcaps.json?

@machuofan Thank you for your reply! I've downloaded the parsed caption annotation. There is captions_train2017.json, instances_train2017.json and instances_val2017.json in DATA.md. But There isn't captions_train2017_tags_634_allcaps.json.I don't know how to generate captions_train2017_tags_634_allcaps.json.

Request for annotations

For some reason I cannot generate the correct instances_train2017_seen_2.json and instances_val2017_all_2.json. Can you provide a download link? thank you for your help!

Mismatches between image filenames in CC3M.

Hello, we want to reproduce the work of your paper and we are stuck on the files. Basically the file "train_image_info_tags_4706.json" is pointing to the incorrect images of the cc3m dataset. We find that they are not the correct match.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.