longzw1997 / open-groundingdino Goto Github PK

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

License: MIT License

Python 79.98% C++ 1.89% Cuda 17.80% Shell 0.33%

object-detection open-world open-world-detection vision-language

open-groundingdino's Introduction

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Zuwei Long and Wei Li.

You can use this code to fine-tune a model on your own dataset, or start pretraining a model from scratch.

Supported Features
Setup
Dataset
Config
Training
Results and Models
Inference
Acknowledgments
Citation
Contact

Supported Features

	Official release version	The version we replicated
Inference	✔	✔
Train (Objecet Detection data)	✖	✔
Train (Grounding data)	✖	✔
Slurm multi-machine support	✖	✔
Training acceleration strategy	✖	✔

Setup

We conduct our model testing using the following versions: Python 3.7.11, PyTorch 1.11.0, and CUDA 11.3. It is possible that other versions are also available.

Clone this repository.

git clone https://github.com/longzw1997/Open-GroundingDino.git && cd Open-GroundingDino/

Install the required dependencies.

pip install -r requirements.txt 
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..

Download pre-trained model and BERT weights, then modify the corresponding paths in the train/test script.

Dataset

For training, we use the odvg data format to support both OD data and VG data.
Before model training begins, you need to convert your dataset into odvg format, see data_format.md | datasets_mixed_odvg.json | coco2odvg.py | grit2odvg for more details.

For testing, we use coco format, which currently only supports OD datasets.

mixed dataset

{
  "train": [
    {
      "root": "path/V3Det/",
      "anno": "path/V3Det/annotations/v3det_2023_v1_all_odvg.jsonl",
      "label_map": "path/V3Det/annotations/v3det_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/LVIS/train2017/",
      "anno": "path/LVIS/annotations/lvis_v1_train_odvg.jsonl",
      "label_map": "path/LVIS/annotations/lvis_v1_train_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/Objects365/train/",
      "anno": "path/Objects365/objects365_train_odvg.json",
      "label_map": "path/Objects365/objects365_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/coco_2017/train2017/",
      "anno": "path/coco_2017/annotations/coco2017_train_odvg.jsonl",
      "label_map": "path/coco_2017/annotations/coco2017_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/GRIT-20M/data/",
      "anno": "path/GRIT-20M/anno/grit_odvg_620k.jsonl",
      "dataset_mode": "odvg"
    }, 
    {
      "root": "path/flickr30k/images/flickr30k_images/",
      "anno": "path/flickr30k/annotations/flickr30k_entities_odvg_158k.jsonl",
      "dataset_mode": "odvg"
    }
  ],
  "val": [
    {
      "root": "path/coco_2017/val2017",
      "anno": "config/instances_val2017.json",
      "label_map": null,
      "dataset_mode": "coco"
    }
  ]
}

example for odvg dataset

# For OD
{"filename": "000000391895.jpg", "height": 360, "width": 640, "detection": {"instances": [{"bbox": [359.17, 146.17, 471.62, 359.74], "label": 3, "category": "motorcycle"}, {"bbox": [339.88, 22.16, 493.76, 322.89], "label": 0, "category": "person"}, {"bbox": [471.64, 172.82, 507.56, 220.92], "label": 0, "category": "person"}, {"bbox": [486.01, 183.31, 516.64, 218.29], "label": 1, "category": "bicycle"}]}}
{"filename": "000000522418.jpg", "height": 480, "width": 640, "detection": {"instances": [{"bbox": [382.48, 0.0, 639.28, 474.31], "label": 0, "category": "person"}, {"bbox": [234.06, 406.61, 454.0, 449.28], "label": 43, "category": "knife"}, {"bbox": [0.0, 316.04, 406.65, 473.53], "label": 55, "category": "cake"}, {"bbox": [305.45, 172.05, 362.81, 249.35], "label": 71, "category": "sink"}]}}

# For VG
{"filename": "014127544.jpg", "height": 400, "width": 600, "grounding": {"caption": "Homemade Raw Organic Cream Cheese for less than half the price of store bought! It's super easy and only takes 2 ingredients!", "regions": [{"bbox": [5.98, 2.91, 599.5, 396.55], "phrase": "Homemade Raw Organic Cream Cheese"}]}}
{"filename": "012378809.jpg", "height": 252, "width": 450, "grounding": {"caption": "naive : Heart graphics in a notebook background", "regions": [{"bbox": [93.8, 47.59, 126.19, 77.01], "phrase": "Heart graphics"}, {"bbox": [2.49, 1.44, 448.74, 251.1], "phrase": "a notebook background"}]}}

Config

config/cfg_odvg.py                   # for backbone, batch size, LR, freeze layers, etc.
config/datasets_mixed_odvg.json      # support mixed dataset for both OD and VG

Training

Datasets: before starting the training, you need to modify the config/datasets_mixed_example.json according to data_format.md.
Configs: defaults to using coco_val2017 for evaluation.
- If you are evaluating with your own test set, you need to convert the test data to coco format (not the ovdg format) and modify the config to set use_coco_eval = False (The COCO dataset has 80 classes used for training but 90 categories in total, so there is a built-in mapping in the code).
- Also, add(or update) the label_list in the config with your own class names like label_list=['dog', 'cat', 'person'].

- use_coco_eval = True
+ use_coco_eval = False
+ label_list=['dog', 'cat', 'person']

Train/Eval:

# train/eval on torch.distributed.launch:
bash train_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}

# train/eval on slurm cluster：
bash train_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
# e.g.  check train_slurm.sh for more details
# bash train_slurm.sh v100_32g 32 config/cfg_odvg.py config/datasets_mixed_odvg.json ./logs
# bash train_slurm.sh v100_32g 8 config/cfg_coco.py config/datasets_od_example.json ./logs

Results and Models

Name	Pretrain data	Task	mAP on COCO	Ckpt	Misc
GroundingDINO-T (offical)	O365,GoldG,Cap4M	zero-shot	48.4 (zero-shot)	model	-
GroundingDINO-T (fine-tune)	O365,GoldG,Cap4M	finetune w/ coco	57.3 (fine-tune)	model	cfg \| log
GroundingDINO-T (pretrain)	COCO,O365,LIVS,V3Det, GRIT-200K,Flickr30k(total 1.8M)	zero-shot	55.1 (zero-shot)	model	cfg \| log

GRIT-200K generated by GLIP and spaCy.

Inference

Because the model architecture has not changed, you only need to install GroundingDINO library and then run inference_on_a_image.py to inference your images.

python tools/inference_on_a_image.py \
  -c tools/GroundingDINO_SwinT_OGC.py \
  -p path/to/your/ckpt.pth \
  -i ./figs/dog.jpeg \
  -t "dog" \
  -o output

Prompt	Official ckpt	COCO ckpt	1.8M ckpt
dog
cat

Acknowledgments

Provided codes were adapted from:

Citation

@misc{Open Grounding Dino,
  author = {Zuwei Long, Wei Li},
  title = {Open Grounding Dino:The third party implementation of the paper Grounding DINO},
  howpublished = {\url{https://github.com/longzw1997/Open-GroundingDino}},
  year = {2023}
}

Contact

longzuwei at sensetime.com
liwei1 at sensetime.com

Feel free to contact we if you have any suggestions or questions. Bugs found are also welcome. Please create a pull request if you find any bugs or want to contribute code.

open-groundingdino's People

Stargazers

Watchers

open-groundingdino's Issues

Partial and Unlabeled Predicitions

After fine tuning on on a small set of classes. I get a bunch of predictions are are unlabeled or only contain part of the label name. For example, I have a class called ''mvr" and it is displayed as 'mv'. I have attached an image of the output below.

Error in ms_deformable_im2col_cuda

After following the installation instructions, everything installs successfully and I'm able to run test.py in models/GroundingDINO/ops:

* True check_forward_equal_with_pytorch_double: max_abs_err 8.67e-19 max_rel_err 2.35e-16
* True check_forward_equal_with_pytorch_float: max_abs_err 4.66e-10 max_rel_err 1.13e-07
* True check_gradient_numerical(D=30)
* True check_gradient_numerical(D=32)
* True check_gradient_numerical(D=64)
* True check_gradient_numerical(D=71)

However, when doing inference with the model (running the model with CUDA_LAUNCH_BLOCKING=1), I get the following error:

error in ms_deformable_im2col_cuda: an illegal memory access was encountered

and also:

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Do you have any idea what may be causing this issue? I'll continue exploring and I'll update the question when I have more information.

Additionally, I can run inference with the model on cpu without any errors. Here is the output of python -m torch.utils.collect_env:

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-163-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h2bc3f7f_2
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py37h7f8727e_0
[conda] mkl_fft                   1.3.1            py37hd3c417c_0
[conda] mkl_random                1.2.2            py37h51133e4_0
[conda] numpy                     1.21.5           py37h6c91a56_3
[conda] numpy-base                1.21.5           py37ha15fc14_3
[conda] pytorch                   1.11.0          py3.7_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.11.0               py37_cu113    pytorch
[conda] torchvision               0.12.0               py37_cu113    pytorch

Config for Swin-B training?

Discrepency between the model's predictions and the confidence scores

Thank you so much for the amazing work!

I used your implementation to train a model on a custom dataset consisting of only 10 images for 500 epochs, during which I expected the model to be able to memorize the provided images. I then passed the same image I used for training and the weight obtained to the official grounding dino inference script to test its performance.

The model exhibited promising results by correctly drawing bounding boxes and accurately predicting the class. However, I observed a notable discrepancy in the confidence scores (as shown in the attached image). Despite the model's correct predictions, the confidence scores were unexpectedly low.

I am wondering if you could kindly provide any guidance or suggestions on why there might be such a difference between the model's predictions and the confidence scores. Any insights would be greatly appreciated. Thank you so much for your time and support :))

Question about ODVGdataset

I have a question ODVGdataset. I find the label map of "VG" mode is updated in every image without a global label map. And all the classes of the image may in range [0, len(uni_caption_list)].
https://github.com/longzw1997/Open-GroundingDino/blob/main/datasets/odvg.py line 105
label_map = {} for idx in range(len(uni_caption_list)): label_map[uni_caption_list[idx]] = idx classes = [label_map[cap] for cap in caption_list] caption = ' . '.join(uni_caption_list) + ' .'

Where is the DN module in the training code？

I couldn't find the denoising training module, like the one used in DINO

assertion fails for tgt_bbox I tried giving input jsonl in multiple bbox formats like xywh,xywhn,cxcywh,cxcywhn but nothing worked.

File "/content/Open-GroundingDino/models/GroundingDINO/matcher.py", line 101, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/content/Open-GroundingDino/util/box_ops.py", line 53, in generalized_box_iou
assert (boxes2[:, 2:] >= boxes2[:, :2]).all(), f"{boxes2}"
AssertionError: tensor([[0.4659, 0.3646, 0.1392, 0.4132],
[0.5540, 0.4931, 0.2415, 0.7604],
[0.2244, 0.4861, 0.2869, 0.8021],
[0.7528, 0.7708, 0.2898, 0.4514],
[0.3778, 0.5556, 0.1449, 0.2743],
[0.8295, 0.6632, 0.1250, 0.2361],
[0.2528, 0.8681, 0.0739, 0.2326],
[0.6165, 0.2014, 0.0852, 0.1736],
[0.0199, 0.1389, 0.0398, 0.1111]], device='cuda:0')

    Please do let me know whhat could be the issue and in what format odvg bboxes are also normalised or pixelised boxes?

Simple questions regarding fine-tuning the model on custom data.

Hi, i would like to express my infinite gratitude for sharing the training code of GroundingDINO 🙌.

I want to fine-tune the model using gdinot-1.8m-odvg.pth on a custom dataset. Could you provide some advice on how to set the freeze layer?

Alternatively, if it's not too much trouble, could you let me know which layers were frozen during the fine-tuning of your GroundingDINO-T(fine-tune)?

Thank you!

In /tools/inference_on_a_image.py, is there a difference between IDEA-Research's code and longzw1997's code?

Some functions call the code here:
https://github.com/longzw1997/Open-GroundingDino/
Others call the code here:
https://github.com/IDEA-Research/GroundingDINO/

Directory Structure (I moved it ro the root directory)

# These calls IDEA-Research's code
import groundingdino.datasets.transforms as T
from groundingdino.models import build_model
# These calls longzw1997's code
from groundingdino.util import box_ops
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
from groundingdino.util.vl_utils import create_positive_map_from_span

In this code, it works normally. However, when I uninstalled rf-groundingdino (IDEA-Research's code) and tried to making this code use longzw1997's code as follows:

import datasets.transforms as T
from models import build_model_inference
from groundingdino.util import box_ops
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
from groundingdino.util.vl_utils import create_positive_map_from_span

It raised AttributeError: 'ConfigDict' object has no attribute 'coco_val_path'.

Of course, I see that comment about installing IDEA-Research. But I'm little confused, is there a difference between these two codes?

Why I still can't pass test.py test when I use 80GB A100, it will still explode memory.

This GPU does not run any programs

Inference with a trained model

Hello and thanks for open sourcing a GroundingDino training code! In addition, since the API is somewhat updated, I'm wondering whether you can also release a standalone inference notebook - similar to: https://github.com/IDEA-Research/GroundingDINO/blob/main/test.ipynb?

How to evaluate RefC datasets？

Thanks for your sharing!
Will this project support evaluating RefC datasets ?

Questions regarding fine-tuning the model on custom data.

Hi. I'm trying to fine-tune a model on custom data. I have a few questions during the process, and it would be really helpful if you could answer them. Thank you in advance.

How is the dictionary called 'id_map' defined in tools/coco2odvg.py?
I don't understand how the id_map = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16, 15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31, 27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43, 39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56, 51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72, 63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85, 75: 86, 76: 87, 77: 88, 78: 89, 79: 90} is defined in the file.

When using custom data, how should 'id_map' be defined?

If the val set in config/datasets_mixed_odvg.json is not coco, how should the 'label_map' be set in the json file?

If I define the label_list in cfg_odvg.py, is it okay to set the label_map in the json file to null?

Thank you🙌

Thank you for your contribution to open source Grounding DINO training code !

Evaluate

Hello Again !

I trained the model using this code but had an issue when evaluating:
Note : I'm using it for visual grounding

when putting the model in eval mode , I encountered an error when setting --save_results where I got the following error:
File "/content/Open-GroundingDino/engine.py", line 232, in evaluate res_info = torch.cat((_res_bbox, _res_prob.unsqueeze(-1), _res_label.unsqueeze(-1)), 1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 900 but got size 300 for tensor number 1 in the list.

I think line 226 in engine.py _res_bbox = outbbox should be replaced with _res_bbox = res['boxes'] , this made the code work by matching the sizes.

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

The latest version fixes the cuda issue but produces the following Traceback. It looks like the label_map is not being moved the GPU. To fine tune on coco, I created a new json file with the label map as described in the data_format.md file.

Traceback (most recent call last): File "/content/Open-GroundingDino/main.py", line 372, in <module> main(args) File "/content/Open-GroundingDino/main.py", line 285, in main train_stats = train_one_epoch( File "/content/Open-GroundingDino/engine.py", line 48, in train_one_epoch loss_dict = criterion(outputs, targets, cap_list, captions) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/content/Open-GroundingDino/models/GroundingDINO/groundingdino.py", line 553, in forward inds = self.matcher(for_match, [targets[j]], label_map_list[j]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/content/Open-GroundingDino/models/GroundingDINO/matcher.py", line 80, in forward new_label_map=label_map[tgt_ids] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I changed datasets_od_example.json as follow:

{ "train": [ { "root": "/content/dataset_folder/train2017", "anno": "/content/drive/MyDrive/coco/annotations/instances_train2017.jsonl", "label_map": "/content/drive/MyDrive/coco/coco2017_label_map.json", "dataset_mode": "odvg" } ], "val": [ { "root": "/content/dataset_folder/val2017", "anno": "/content/dataset_folder/annotations/instances_val2017.json", "label_map": null, "dataset_mode": "coco" } ] }

The impact of the order of elements (i.e., text) within "cat_list" used in model evaluation

Thank you sincerely for sharing your training code. 🙌😥

I have a few questions while conducting evaluations on a model fine-tuned with custom object detection data.

In short, is there a definitive answer on how to structure cat_list for accurate model evaluation? Specifically, is there a correct order for the text elements within cat_list (i.e., label_list in cfg.py)? (The class mapping between the predictions of the model trained on the trainset and the validset has been correctly completed.)

When evaluating a model fine-tuned on custom od data using another custom dataset, I observed that organizing cat_list based on the categories of the training set yields an mAP of approximately 0.45, whereas organizing it based on the categories of the validset yields a result of 0.15. (This pattern persists for different training sets with the same evaluation dataset.)

However, when using coco_val2017 for evaluation, I confirmed through the code that cat_list is organized based on the order of categories in the coco annotations (e.g., cat_list = ['person', 'bicycle', ...]). Furthermore, this cat_list, established in this manner, remains fixed and is utilized throughout the evaluation process.

Considering this process, it appears that the order of text elements in cat_list may not be crucial, is that correct? Alternatively, is organizing cat_list based on the categories of the validset a clear model evaluation method?

Thank you for reading the question, and I appreciate any insights you can provide.

Detection task with one class custom dataset, 15epoch is not enough?

Hello, thank you for your code.
I want to finetune this network in my custom dataset which just has one class to detect a LOGO （30k with pos and neg data）and I use most of default params in config of training(cfg_odvg.py) and dataset (dvt_COCO_odvg.json).
In fact, mAP is always 0. I tried to find some problems in config twice, but fail.

training cmd
bash train_dist.sh 4 ./config/cfg_odvg.py ./config/dvt_COCO_odvg.json ./logs

odvg_dataset
{"filename": "base/raptor_evt1_pile_common_2022-08-03-14-35-04_selected/946685562836786.jpg", "height": 720, "width": 1280, "detection": {"instances": [{"bbox": [542.55, 333.65, 649.04, 411.78], "label": 0, "category": "charger"}]}} {"filename": "base/raptor_evt1_pile_common_2022-08-03-14-35-04_selected/946685182538767.jpg", "height": 720, "width": 1280, "detection": {"instances": [{"bbox": [837.73, 225.0, 933.94, 301.97], "label": 0, "category": "charger"}]}} {"filename": "/dta/yanx/Dataset/charger_detect/dvt_charger/data/training/base/raptor_evt2_negative_eur_3148/1691843820252.jpg", "height": 720, "width": 1280, "detection": {"instances": []}} {"filename": "/dta/yanx/Dataset/charger_detect/dvt_charger/data/training/base/raptor_evt2_negative_eur_3148/1691690205131.jpg", "height": 720, "width": 1280, "detection": {"instances": []}} ...

labelmap
{"0": "charger"}

dvt_COCO_odvg.json
cfg_odvg.py.txt

Error: ModuleNotFoundError: No module named 'groundingdino' torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on a single A800：
bash train_dist.sh 1 config/cfg_coco.py config/datasets_od_example.json

pre-training

Hello ! thank you for your code. please could you help me understand if pre-training is possible with your code ?

I have data in coco format, and i understand i need to convert to odvg format to use your code. but from what I understand, i still need the bbox annotations for the VG data as well, right? Is it possible to pre-train grounding dino with only the image-text pairs?

thank you

Does the weight file support exporting onnx?

Without modifying the model source code.

how much memories we need to finetune?

how much memories we need to finetune? Can single gpu?

could you share more details about converting ODVG dataformat?

when reading the script of coco2odvg.py,but this file does not show that how to convert the VG dataformat only have OD dataformat.

Tokenizer decoder makes up class names during inference

Hello,

I have finetuned a model on my custom dataset using your implementation of grounding DINO. I am currently testing its performance by calling the inference function on unseen data. However, I noticed that the prediction function sometimes makes up non-existent class names that are not in the caption text input.

For example, when I used the caption "cadiere forceps . needle driver .", the results returned included classes like "cad forceps" or "##ps" as shown in the figure. I'm curious if you have any insights into why this might be happening. Thank you so much!

about train dataset questions

There was an issue converting my local Coco dataset to a. jsoll file using the script you provided. My dataset is just one category.
I have made two modifications as follows:

This is the information output after running the script：

The output log of the training process prompts the following issues:

Looking forward to your reply，thinks！

Single GPU Support

Excellent work! Is there single GPU support for fine tuning?

Question about dataset format

The official Flickr30k dataset only has sentence discriptions but no object detections.So how could I get the Annotations in the ./tools/flickr30ke2odvg.py?
sentence_list = os.path.join(args.root, "Sentences")
annotation_list = os.path.join(args.root, "Annotations")
thanks!

关于文本不同导致的训练精度差异的问题

您好！我在利用自己的数据作训练的过程中发现了一个问题：比如当文本是truck . truck mixer . heavy truck；再比如文本是insulator . dirty insulator . damadge insulator等，这种多类别包含了相同词汇的文本时，得到的预测结果有很多是 truck truck mixer、insulator dirty insulator等。然后我改变了类别的定义，比如说truck . concrete mixer . heavy让它们不再包含相同词汇，识别率会提升很多。

起初我以为是模型对某两个类别的特征区分能力比较差导致它认为某物体会同时是这两个物体。后来我想了下，跟文本特征提取模块也有关系吧？像yolo这种没有文本特征提取分支的模型，相同的训练和验证集识别率就相对高一点

Errors of distributed training

你好，使用你推荐的配置文件，使用coco数据集进行微调，发现不能进行分布式训练，使用的配置文件为：cfg_odvg.py，分布式训练截图如下：

Testing the new weights reveals issues: they fail to recognize general objects

"Is this training pipeline fine-tuning pretrained SwinT weights or training from scratch? Testing the new weights reveals issues: they fail to recognize general objects and have incomplete detection on familiar frames.

Can i train on windows?

i try to run
bash train_dist.sh 1 config/cfg_odvg.py config/datasets_mixed_odvg.json ./logs
but

Traceback (most recent call last):
File "E:\VSProject\Open-GroundingDino\main.py", line 372, in
main(args)
File "E:\VSProject\Open-GroundingDino\main.py", line 88, in main
utils.setup_distributed(args)
File "E:\VSProject\Open-GroundingDino\util\misc.py", line 553, in setup_distributed
torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
File "D:\python\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "D:\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "D:\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

and i found that Windows doesn't seem to support NCCL. Are there any other ways to train on Windows?

Label_list

Hi, great work, thanks for sharing!

I wiuld like to know if I want to train my own dataset on mixed datasets, how can I initialize label_list in config file?

Thanks for you reply!

UnpicklingError: invalid load key, '<'. unable to use new trained weights for taking detections or inferencing

Open-GroundingDino (this repo) didnt seem to have modules for inferencing on new trained weights(.pth).

I tried using IDEA-Research/GroundingDINO (official implementation) repo for taking detections by providing path of newly trained weights which gave me error like:

UnpicklingError Traceback (most recent call last)
in <cell line: 6>()
4 import time
5
----> 6 model = load_model("/content/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "/content/GroundingDINO/weights/groundingdino_swint_ogc.pth")

2 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
1244 "functionality.")
1245
-> 1246 magic_number = pickle_module.load(f, **pickle_load_args)
1247 if magic_number != MAGIC_NUMBER:
1248 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Also I need clarity on multiple weights file getting generated i understood there is a weight file for each epoch and a file named "checkpoint_best_regular.pth" is it weight file for epoch with lowest "loss" or highest accuracy? and also eval folder has 2 weight files latest.pth and 000.pth? what are those for?

请问如何解决向Grounding-Dino无法输入一个batch的数据？

首先感谢两位提供训练方法！

model_checkpoint_path = "./weights/groundingdino_swint_ogc.pth"

model = load_model(model_config_path, model_checkpoint_path)
img = torch.ones([16, 3, 256, 256])
prompt = 'building .'
o1, o2 = model(img, captions=[prompt])`

会报如下错误：
  File "D:\Install\anaconda\envs\clip-py3.7\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "e:\project\groundingdino\groundingdino\models\GroundingDINO\fuse_modules.py", line 163, in forward
    key_states = self._shape(self.l_proj(l), -1, bsz)
  File "e:\project\groundingdino\groundingdino\models\GroundingDINO\fuse_modules.py", line 130, in _shape
    return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
RuntimeError: shape '[16, -1, 4, 256]' is invalid for input of size 4096

After trained model, the original classification accuracy decreases

你好，我想要在原模型可识别类别基础上新增我想要的类别
比如这个图，我想新增角落文字的识别，使用”OSD“这个标签
Hello, I want to add a new category to the original model based on the recognized categories.
For example, in this image, I want to add the recognition of corner text using the label "OSD"

然后我在训练后发现，虽然能有效识别文字目标了，但是car无法识别了，person识别率也大幅度下降了
Then I found after training that although I could effectively recognize text targets, the car could not be recognized, and the person recognition rate also dropped significantly.

以下是我用自己的脚本按照项目指引生成的训练集和验证集
The following are the training and validation sets generated by my own script according to the project guidelines
训练集格式
Training set format

验证集格式
Validation set format

新特征列表
Validation set format

请问问题可能出在哪里
Please tell me where the problem may be

请问一下训练的时候是没有使用swin_T_224_1k的预训练权重吗？

Error: "Failed to load custom C++ ops. Running on CPU mode Only!"

请问是对单卡的显存要求高吗

Training-Data-Prep

Hello , first of all thank you for releasing your training code!
I was trying to run it , but facing some issues in the format of my data.
I'm using a dataset which has its annotations in the form of PASCAL VOC ( each image has its own .xml annotation document).
So , I tried changing them to COCO format using this python script : voc2coco.py , and then using your script coco2odvg tool.
but unfortunately , when generating the final odvg json file , I have an empty instances for all samples.
examples:

This is after using coco2odvg :
{"filename": "00003.jpg", "height": 800, "width": 800, "detection": {"instances": []}}
{"filename": "00004.jpg", "height": 800, "width": 800, "detection": {"instances": []}}

Thanks in advance :)

Predicted labels

Hello ,

@aghand0ur and I used your code to train on a custom dataset(20 classes) , everything went fine.
I modified the evaluate function to suit this specific task , when testing on my test dataset(changed it to coco format) , the coco results are really low , although visualizing samples showed impressive results.
I printed out the labels being predicted during evaluation , it is never returning correct prediction while the bounding boxes are quiet good.
I placed label_list containing the categories in the cfg_odvg.py
any idea/tips where the source of the problem could be ?

device spec for training and finetuning

May I ask the device spec for training and finetuning?
RAM, GPU

and how long does it take for the data you use

thanks!!

How to create your own dataset and use it to tune the model

May I ask how I can create a dataset for my own data to tune? I am currently working on tasks related to driving area detection and would like to use the weights in the Grounded-SAM model after tuning. Thank you.

Visualization code?

楼主好，十分感谢您做出的贡献
然后想问一下您这边有相应的debug的可视化结果的代码不？

IndexError: index 909 is out of bounds for dimension 0 with size 900

我在我自己的数据集上训练，完整报错如下：
Traceback (most recent call last):
File "/mnt/lvm_data/project/xyguo/code_dmx/Open-GroundingDino/main.py", line 372, in
main(args)
File "/mnt/lvm_data/project/xyguo/code_dmx/Open-GroundingDino/main.py", line 284, in main
train_stats = train_one_epoch(
File "/mnt/lvm_data/project/xyguo/code_dmx/Open-GroundingDino/engine.py", line 48, in train_one_epoch
loss_dict = criterion(outputs, targets, cap_list, captions)
File "/mnt/lvm_data/public/package/anaconda3/envs/groundingdino_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/lvm_data/project/xyguo/code_dmx/Open-GroundingDino/models/GroundingDINO/groundingdino.py", line 596, in forward
tgt_ids[i]=tgt_ids[i][indices[i][1]]
IndexError: index 909 is out of bounds for dimension 0 with size 900

通过翻阅代码，发现应该下图中的两行代码产生的索引值超出了for循环中tgt_ids[i]的长度，请问应该怎么改呢?

question about training VG data

I've noticed an issue during training that when using my custom dataset (in VG format), the model's performance significantly degrades when different objects share the same description. How can I address this problem?
If I treat the description sentences as class names and convert the custom dataset to an object detection (od) format, would that address this issue? Looking forward to your reply, thanks!

Loss inf problem occurs when training tiny model, but base and large are normal

model：grounding_dino
weight: groundingdino_swint_ogc.pth
backbone：swin_T_224_1k
train env：8v10032G
config：cfg_odvg.py
dataset:obejct365

log:
Epoch: [0] [ 0/41483] eta: 14 days, 20:50:49 lr: 0.000100 loss: 32.9808 (32.9808) loss_bbox: 1.8663 (1.8663) loss_bbox_0: 2.4479 (2.4479) loss_bbox_1: 2.6284 (2.6284) loss_bbox_2: 1.6473 (1.6473) loss_bbox_3: 1.6328 (1.6328) loss_bbox_4: 1.8657 (1.8657) loss_bbox_interm: 2.6962 (2.6962) loss_ce: 0.7780 (0.7780) loss_ce_0: 2.6423 (2.6423) loss_ce_1: 2.4399 (2.4399) loss_ce_2: 2.5211 (2.5211) loss_ce_3: 2.4680 (2.4680) loss_ce_4: 2.5602 (2.5602) loss_ce_interm: 2.1035 (2.1035) loss_giou: 0.3789 (0.3789) loss_giou_0: 0.3840 (0.3840) loss_giou_1: 0.3836 (0.3836) loss_giou_2: 0.3773 (0.3773) loss_giou_3: 0.3759 (0.3759) loss_giou_4: 0.3781 (0.3781) loss_giou_interm: 0.4054 (0.4054) loss_bbox_unscaled: 0.3733 (0.3733) loss_bbox_0_unscaled: 0.4896 (0.4896) loss_bbox_1_unscaled: 0.5257 (0.5257) loss_bbox_2_unscaled: 0.3295 (0.3295) loss_bbox_3_unscaled: 0.3266 (0.3266) loss_bbox_4_unscaled: 0.3731 (0.3731) loss_bbox_interm_unscaled: 0.5392 (0.5392) loss_ce_unscaled: 0.3890 (0.3890) loss_ce_0_unscaled: 1.3212 (1.3212) loss_ce_1_unscaled: 1.2199 (1.2199) loss_ce_2_unscaled: 1.2605 (1.2605) loss_ce_3_unscaled: 1.2340 (1.2340) loss_ce_4_unscaled: 1.2801 (1.2801) loss_ce_interm_unscaled: 1.0518 (1.0518) loss_giou_unscaled: 0.1895 (0.1895) loss_giou_0_unscaled: 0.1920 (0.1920) loss_giou_1_unscaled: 0.1918 (0.1918) loss_giou_2_unscaled: 0.1886 (0.1886) loss_giou_3_unscaled: 0.1879 (0.1879) loss_giou_4_unscaled: 0.1891 (0.1891) loss_giou_interm_unscaled: 0.2027 (0.2027) loss_hw_unscaled: 0.2596 (0.2596) loss_hw_0_unscaled: 0.3445 (0.3445) loss_hw_1_unscaled: 0.3645 (0.3645) loss_hw_2_unscaled: 0.2324 (0.2324) loss_hw_3_unscaled: 0.2289 (0.2289) loss_hw_4_unscaled: 0.2603 (0.2603) loss_hw_interm_unscaled: 0.3786 (0.3786) loss_xy_unscaled: 0.1137 (0.1137) loss_xy_0_unscaled: 0.1451 (0.1451) loss_xy_1_unscaled: 0.1611 (0.1611) loss_xy_2_unscaled: 0.0970 (0.0970) loss_xy_3_unscaled: 0.0976 (0.0976) loss_xy_4_unscaled: 0.1128 (0.1128) loss_xy_interm_unscaled: 0.1606 (0.1606) time: 30.9681 data: 5.9649 max mem: 9660
Loss is inf, stopping training
{'loss_bbox': tensor(inf, device='cuda:0'), 'loss_bbox_0': tensor(inf, device='cuda:0'), 'loss_bbox_1': tensor(inf, device='cuda:0'), 'loss_bbox_2': tensor(inf, device='cuda:0'), 'loss_bbox_3': tensor(inf, device='cuda:0'), 'loss_bbox_4': tensor(inf, device='cuda:0'), 'loss_bbox_interm': tensor(inf, device='cuda:0'), 'loss_ce': tensor(0.4675, device='cuda:0'), 'loss_ce_0': tensor(0.6388, device='cuda:0'), 'loss_ce_1': tensor(0.6114, device='cuda:0'), 'loss_ce_2': tensor(0.6029, device='cuda:0'), 'loss_ce_3': tensor(0.5809, device='cuda:0'), 'loss_ce_4': tensor(0.5935, device='cuda:0'), 'loss_ce_interm': tensor(0.6336, device='cuda:0'), 'loss_giou': tensor(0.1472, device='cuda:0'), 'loss_giou_0': tensor(0.1555, device='cuda:0'), 'loss_giou_1': tensor(0.1515, device='cuda:0'), 'loss_giou_2': tensor(0.1504, device='cuda:0'), 'loss_giou_3': tensor(0.1504, device='cuda:0'), 'loss_giou_4': tensor(0.1466, device='cuda:0'), 'loss_giou_interm': tensor(0.1693, device='cuda:0'), 'loss_hw': tensor(inf, device='cuda:0'), 'loss_hw_0': tensor(inf, device='cuda:0'), 'loss_hw_1': tensor(inf, device='cuda:0'), 'loss_hw_2': tensor(inf, device='cuda:0'), 'loss_hw_3': tensor(inf, device='cuda:0'), 'loss_hw_4': tensor(inf, device='cuda:0'), 'loss_hw_interm': tensor(inf, device='cuda:0'), 'loss_xy': tensor(inf, device='cuda:0'), 'loss_xy_0': tensor(inf, device='cuda:0'), 'loss_xy_1': tensor(inf, device='cuda:0'), 'loss_xy_2': tensor(inf, device='cuda:0'), 'loss_xy_3': tensor(inf, device='cuda:0'), 'loss_xy_4': tensor(inf, device='cuda:0'), 'loss_xy_interm': tensor(inf, device='cuda:0')}
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1809199 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1809205 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1809211 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1809213 closing signal SIGTERM

!!! ps: if train tiny model by single V100 is normal too.

3Q first if you check this problem

Dataset preparation

Thank you for your great work,

i have several questions w.r.t. dataset preparation for pre-training.

How can I get the label map of COCO?

below might be the one, but I'm not quite sure...
https://github.com/longzw1997/Open-GroundingDino/blob/main/data_format.md#label_map

How can I get the annotation and label map of Object365 and LVIS?

i guess I might easily get label map of Objects365 and LVIS by referring to this format,
but the code for generating the anno (objects365_train_odvg.json, lvis_v1_train_odvg.jsonl) seems not to exist.
For this, should I modify some part of this COCO annotation generation script? Or are there any alternatives?

How can I get the label map of GRIT-200k and flickr30k?

for obtaining grit_odvg_2m.json, am I right to execute the grit2odvg.py with the --random_samples option to be set as 200000 (200k)?
I'm bit confused whether I should put this value to 200k or 2m.

Also for flickr30k, when running flickr30ke2odvg.py, am I right to put the --osoi=False for generating flickr30k_entities_odvg_158k.json? I'm also bit confused about the meaning of 158k in the postfix.

I really appreciate for your response in advance and for your valuable work!

AttributeError: 'Namespace' object has no attribute 'label_list'

Training with a custom dataset I get the following error. Changing ' cat_list=args.label_list' to 'cat_list=[list, of, my, classes]' seems to work.

Traceback (most recent call last):
File "/content/Open-GroundingDino/main.py", line 372, in
main(args)
File "/content/Open-GroundingDino/main.py", line 144, in main
model, criterion, postprocessors = build_model_main(args)
File "/content/Open-GroundingDino/main.py", line 81, in build_model_main
model, criterion, postprocessors = build_func(args)
File "/content/Open-GroundingDino/models/GroundingDINO/groundingdino.py", line 802, in build_groundingdino
postprocessors = {'bbox': PostProcess(num_select=args.num_select , text_encoder_type=args.text_encoder_type,nms_iou_threshold=args.nms_iou_threshold,args=args)}
File "/content/Open-GroundingDino/models/GroundingDINO/groundingdino.py", line 652, in init
cat_list=args.label_list
AttributeError: 'Namespace' object has no attribute 'label_list'

Fine Tune on COCO

I am attempting to get fine tuning on coco working before I use my own dataset.

I use the following command to run it in a colab notebook along with a truncated version of the output. Any thoughts on next steps for debugging:

'''
!python /content/Open-GroundingDino/main.py
--output_dir ./logs
-c /content/Open-GroundingDino/config/cfg_coco.py
--datasets /content/Open-GroundingDino/config/datasets_od_example.json
--pretrain_model_path /content/Open-GroundingDino/groundingdino_swint_ogc.pth
'''

Not using distributed mode
Loading config file from /content/Open-GroundingDino/config/cfg_coco.py
INFO 2023-10-18 17:57:30,403 | git:
sha: 9036724, status: has uncommited changes, branch: main
INFO 2023-10-18 17:57:30,403 | Command: /content/Open-GroundingDino/main.py --output_dir ./logs -c /content/Open-GroundingDino/config/cfg_coco.py --datasets /content/Open-GroundingDino/config/datasets_od_example.json --pretrain_model_path /content/Open-GroundingDino/groundingdino_swint_ogc.pth
INFO 2023-10-18 17:57:30,404 | Full config saved to ./logs/config_args_all.json
INFO 2023-10-18 17:57:30,405 | world size: 1
INFO 2023-10-18 17:57:30,405 | rank: 0
INFO 2023-10-18 17:57:30,405 | local_rank: 0
........
DEBUG 2023-10-18 17:57:30,406 | build model ... ...
/content/Open-GroundingDino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
/usr/local/lib/python3.10/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
load tokenizer done.
........
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:905: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/content/Open-GroundingDino/main.py", line 371, in
main(args)
File "/content/Open-GroundingDino/main.py", line 284, in main
train_stats = train_one_epoch(
File "/content/Open-GroundingDino/engine.py", line 47, in train_one_epoch
outputs = model(samples, captions=captions)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/Open-GroundingDino/models/GroundingDINO/groundingdino.py", line 315, in forward
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/Open-GroundingDino/models/GroundingDINO/transformer.py", line 258, in forward
memory, memory_text = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/Open-GroundingDino/models/GroundingDINO/transformer.py", line 580, in forward
output = checkpoint.checkpoint(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/Open-GroundingDino/models/GroundingDINO/transformer.py", line 793, in forward
src2 = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/Open-GroundingDino/models/GroundingDINO/ms_deform_attn.py", line 338, in forward
output = MultiScaleDeformableAttnFunction.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/content/Open-GroundingDino/models/GroundingDINO/ms_deform_attn.py", line 53, in forward
output = _C.ms_deform_attn_forward(
NameError: name '_C' is not defined

Usage of arguement " --options text_encoder_type=/path/to/bert-base-uncased"

Hi, I am wondering if you could elaborate a little bit more about the purpose and functionality of the argument --options text_encoder_type=/path/to/bert-base-uncased? Specifically under what circumstances one should we use it? I tried finetuning on an object detection dataset without visual grounding and noticed that even without including this argument, the training loss is still able to converge.

Thank you so much!

vg dataset test

May I ask how to evaluate the vg model if the training supports vg format but the test does not?