cvmi-lab / codet Goto Github PK

(NeurIPS2023) CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Python 100.00%

object-detection open-vocabulary open-vocabulary-detection

codet's Introduction

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection,
Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
NeurIPS 2023 (https://arxiv.org/abs/2310.16667)
Project page (https://codet-ovd.github.io)

Features

Train an open-vocabulary detector with web-scale image-text pairs
Align regions and words by co-occurrence instead of region-text similarity
State-of-the-art performance on open-vocabulary LVIS
Deployed with modern visual foudation models
Intergated with roboflow to automatically label images for training a small, fine-tuned model

Installation

Setup environment

conda create --name codet python=3.8 -y && conda activate codet
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/CVMI-Lab/CoDet.git

Install Apex and xFormer (You can skip this part if you do not use EVA-02 backbone)

pip install ninja
pip install -v -U git+https://github.com/facebookresearch/xformers.git@7e05e2caaaf8060c1c6baadc2b04db02d5458a94
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./ && cd ..

Install detectron2 and other dependencies

cd CoDet/third_party/detectron2
pip install -e .
cd ../..
pip install -r requirements.txt

Prepare Datasets

We use LVIS and Conceptual Caption (CC3M) for OV-LVIS experimets, COCO for OV-COCO experiments, and Objects365 for cross-dataset evaluation. Before starting processing, please download the (selected) datasets from the official websites and place or sim-link them under CoDet/datasets/. CoDet/datasets/metadata/ is the preprocessed meta-data (included in the repo). Please refer to DATA.md for more details.

$CoDet/datasets/
    metadata/
    lvis/
    coco/
    cc3m/
    objects365/

Model Zoo

OV-COCO

Backbone	Box AP50	Box AP50_novel	Config	Model
ResNet50	46.8	30.6	CoDet_OVCOCO_R50_1x.yaml	ckpt

OV-LVIS

Backbone	Mask mAP	Mask mAP_novel	Config	Model
ResNet50	31.3	23.7	CoDet_OVLVIS_R5021k_4x_ft4x.yaml	ckpt
Swin-B	39.2	29.4	CoDet_OVLVIS_SwinB_4x_ft4x.yaml	ckpt
EVA02-L	44.7	37.0	CoDet_OVLVIS_EVA_4x.yaml	ckpt

Inference

To test with custom images/videos, run

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary lvis --opts MODEL.WEIGHTS [model_weights]

Or you can customize the test vocabulary, e.g.,

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS [model_weights]

To evaluate a pre-trained model, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt

To evaluate a pre-trained model on Objects365 (cross-dataset evaluation), run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt DATASETS.TEST "('objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(365,)" MODEL.MASK_ON False

Training

Training configurations used by the paper are listed in CoDet/configs. Most config files require pre-trained model weights for initialization (indicated by MODEL.WEIGHTS in the config file). Please train or download the corresponding pre-trained models and place them under CoDet/models/ before training.

Name	Model
resnet50_miil_21k.pkl	ResNet50-21K pretrain from MIIL
swin_base_patch4_window7_224_22k.pkl	SwinB-21K pretrain from Swin-Transformer
eva02_L_pt_m38m_p14to16.pt	EVA02-L mixed 38M pretrain from EVA
BoxSup_OVCOCO_CLIP_R50_1x.pth	ResNet50 COCO base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x.pth	ResNet50 LVIS base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_SwinB_896b32_4x.pth	SwinB LVIS base class pretrain from Detic

To train on a single node, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config

Note: By default, we use 8 V100 for training with ResNet50 or SwinB, and 16 A100 for training with EVA02-L. Please remember to re-scale the learning rate accordingly if you are using a different number of GPUs for training.

Citation

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{ma2023codet,
  title={CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection},
  author={Ma, Chuofan and Jiang, Yi and Wen, Xin and Yuan, Zehuan and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

Acknowledgment

CoDet is built upon the awesome works Detic and EVA.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

codet's People

Contributors

Stargazers

Watchers

Forkers

hiyyg capjamesg linsy546749 nari95park dylansisyphe

codet's Issues

How to infer on own data?

@machuofan
Thank you for your work!
wonder it is possible to provide an inference script to test on my own video/images?

AttributeError: 'Linear' object has no attribute 'linear'

I have a problem while training, at CoDet-main/codet/modeling/roi_heads/codet_roi_heads.py, line 255, in _run_stage
boxes_features_copy = self.box_predictor[stage].cls_score.linear(box_features)
raise AttributeError('''{}' object has no attribute '{}'''.format(
AttributeError: 'Linear' object has no attribute 'linear')

I'm trying to run the CoDet training code, but I encountered this problem. I hope someone can help me.

About training cost.

Thank you for the open source. How long will it take to train with 8 GPUs on the LVIS+CC3M training set？

Where is instances_train2017_seen_2_oriorder_cat_info.json？

Thank you for your work. I want to ask how to generate instances_train2017_seen_2_oriorder_cat_info.json ? Where is instances_train2017_seen_2_oriorder_cat_info.json? I am looking forward to your reply!

Where is 'datasets/coco/annotations/captions_train2017_tags_634_allcaps.json'?

Thank you for your work. I want to ask how to generate captions_train2017_tags_634_allcaps.json ? Where is captions_train2017_tags_634_allcaps.json? When I train with CoDet_OVCOCO_R50_1x.yaml, I meet a problem of [Errno 2] No such file or directory: 'datasets/coco/annotations/captions_train2017_tags_634_allcaps.json'. I am looking forward to your reply!

lvis_v1_train_norare_cat_info.json

Hello, when I tried to use config CoDet_OVLVIS_SwinB_4x_ft4x.yaml to train, a prompt appeared that the file
lvis_v1_train_norare_cat_info.json was missing. Can you tell me how to obtain this file? Thank you very much

How to run CoDet on custom dataset?

Can you give me some guidance on preparing my own dataset? For example, how can I generate cococap_clip_a+cname.npy and
coco_clip_a+cname.npy for my own dataset?

Questions about reproducing OV-COCO results

Thank you for your awesome work and open source code!

I'm trying to reproduce the OV-COCO experiment, the results in Table 2 in your paper.

Unfortunately, there is only one 1080Ti available, so I divided the base learning rate by 8 (as mentioned in the repo), but the results are quite different from the data in your paper （The experimental results are shown in the figure below, unseen AP50=19, seen AP50=50）.

I tried doubling the training iterations, and only got unseen AP50=24, seen AP50=50.

Can you provide some suggestions?

Question about the Two Layer MLP applied to the similarity matrix (i.e `weight_transform` applied to `simi_scores`)

Thanks you for your work! I really appreciate the idea's you've presented in your paper and find them really interesting and useful! I have a question about the paper and was hoping you could provide some clarity.

In the paper and in the method _concept_grouping_loss a two layer MLP is applied to the similarity matrix. It seems to me that for the MLP to learn patterns in the similarity matrix there needs to be some inherent order in the last dimension of the matrix (mn). From the paper it seems that mini group is randomly sampled and so there's no specific order maintained in m and since n depend on the proposals (based on the content of the image) there is no inherent order in n.

How does this MLP learn useful features from the similarity matrix? Is there some order maintained during sampling of the mini-group? Is there an ablation study with and without the MLP?

About the license

Hi,

Thanks for this great work!
I wonder if there is any license about this repo? Currently I cannot find license file for it.

Bests.

how to convet torch to onnx and then tensorrt

Thank you for your great work.

Do you have any plan to support model conveting implementation for onnx ?

Where is captions_train2017_tags_634_allcaps.json?

@machuofan Thank you for your reply! I've downloaded the parsed caption annotation. There is captions_train2017.json， instances_train2017.json and instances_val2017.json in DATA.md. But There isn't captions_train2017_tags_634_allcaps.json.I don't know how to generate captions_train2017_tags_634_allcaps.json.

Request for annotations

For some reason I cannot generate the correct instances_train2017_seen_2.json and instances_val2017_all_2.json. Can you provide a download link? thank you for your help！

Mismatches between image filenames in CC3M.

Hello, we want to reproduce the work of your paper and we are stuck on the files. Basically the file "train_image_info_tags_4706.json" is pointing to the incorrect images of the cc3m dataset. We find that they are not the correct match.