Git Product home page Git Product logo

foundationvision / glee Goto Github PK

View Code? Open in Web Editor NEW
914.0 42.0 72.0 22.82 MB

[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale

Home Page: https://glee-vision.github.io/

License: MIT License

Python 94.59% Shell 0.28% C++ 1.73% Cuda 3.32% Dockerfile 0.07% CMake 0.01%
foundation-model object-detection open-world tracking open-vocabulary-detection open-vocabulary-segmentation open-vocabulary-video-segmentation referring-expression-comprehension referring-expression-segmentation video-instance-segmentation

glee's Introduction

GLEE: General Object Foundation Model for Images and Videos at Scale

Junfeng Wu*, Yi Jiang*, Qihao Liu, Zehuan Yuan, Xiang Bai,and Song Bai

* Equal Contribution, Correspondence

[Project Page] [Paper] [HuggingFace Demo] [Video Demo]

PWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWC

data_demo

Highlight:

  • GLEE is accepted by CVPR2024 as Highlight!
  • GLEE is a general object foundation model jointly trained on over ten million images from various benchmarks with diverse levels of supervision.
  • GLEE is capable of addressing a wide range of object-centric tasks simultaneously while maintaining SOTA performance.
  • GLEE demonstrates remarkable versatility and robust zero-shot transferability across a spectrum of object-level image and video tasks, and able to serve as a foundational component for enhancing other architectures or models.

We will release the following contents for GLEE

  • Demo Code

  • Model Zoo

  • Comprehensive User Guide

  • Training Code and Scripts

  • Detailed Evaluation Code and Scripts

  • Tutorial for Zero-shot Testing or Fine-tuning GLEE on New Datasets

Getting started

  1. Installation: Please refer to INSTALL.md for more details.
  2. Data preparation: Please refer to DATA.md for more details.
  3. Training: Please refer to TRAIN.md for more details.
  4. Testing: Please refer to TEST.md for more details.
  5. Model zoo: Please refer to MODEL_ZOO.md for more details.

Run the demo APP

Try our online demo app on [HuggingFace Demo] or use it locally:

git clone https://github.com/FoundationVision/GLEE
# support CPU and GPU running
python app.py

Introduction

GLEE has been trained on over ten million images from 16 datasets, fully harnessing both existing annotated data and cost-effective automatically labeled data to construct a diverse training set. This extensive training regime endows GLEE with formidable generalization capabilities.

data_demo

GLEE consists of an image encoder, a text encoder, a visual prompter, and an object decoder, as illustrated in Figure. The text encoder processes arbitrary descriptions related to the task, including 1) object category list 2)object names in any form 3)captions about objects 4)referring expressions. The visual prompter encodes user inputs such as 1) points 2) bounding boxes 3) scribbles during interactive segmentation into corresponding visual representations of target objects. Then they are integrated into a detector for extracting objects from images according to textual and visual input.

pipeline

Based on the above designs, GLEE can be used to seamlessly unify a wide range of object perception tasks in images and videos, including object detection, instance segmentation, grounding, multi-target tracking (MOT), video instance segmentation (VIS), video object segmentation (VOS), interactive segmentation and tracking, and supports open-world/large-vocabulary image and video detection and segmentation tasks.

Results

Image-level tasks

imagetask

odinw

Video-level tasks

videotask

visvosrvos`

Citing GLEE

@misc{wu2023GLEE,
  author= {Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai},
  title = {General Object Foundation Model for Images and Videos at Scale},
  year={2023},
  eprint={2312.09158},
  archivePrefix={arXiv}
}

Acknowledgments

  • Thanks UNINEXT for the implementation of multi-dataset training and data processing.

  • Thanks VNext for providing experience of Video Instance Segmentation (VIS).

  • Thanks SEEM for providing the implementation of the visual prompter.

  • Thanks MaskDINO for providing a powerful detector and segmenter.

glee's People

Contributors

wjf5203 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glee's Issues

关于coco指标复现的问题

您好,我使用了您的模型对 COCO 数据集进行测试,发现我得到的评估指标与paper的指标有差距。我想了解一下,现在提供的demo和完整测试过程之间是否存在某些差异?

too many values to unpack (expected 2)

运行时出现如下问题,请问如何解决:
File "E:\hezt\vis\lib\site-packages\gradio\queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
File "E:\hezt\vis\lib\site-packages\gradio\route_utils.py", line 270, in call_process_api
output = await app.get_blocks().process_api(
File "E:\hezt\vis\lib\site-packages\gradio\blocks.py", line 1847, in process_api
result = await self.call_function(
File "E:\hezt\vis\lib\site-packages\gradio\blocks.py", line 1433, in call_function
prediction = await anyio.to_thread.run_sync(
File "E:\hezt\vis\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "E:\hezt\vis\lib\site-packages\anyio_backends_asyncio.py", line 2144, in run_sync_in_worker_thread
return await future
File "E:\hezt\vis\lib\site-packages\anyio_backends_asyncio.py", line 851, in run
result = context.run(func, *args)
File "E:\hezt\vis\lib\site-packages\gradio\utils.py", line 805, in wrapper
response = f(*args, **kwargs)
File "E:\tool\GLEE-new\app.py", line 169, in segment_image
(outputs,_) = GLEEmodel(infer_image, prompt_list, task="coco", batch_name_list=batch_category_name, is_train=False)
ValueError: too many values to unpack (expected 2)

关于visual promt

您好,GLEE是一个很棒的工作。同时,关于算法的一些细节,我有一些疑问想像您请教,如果您有空了,可以回复一下,感谢!

  1. 我使用points作为视觉提示词,GLEE是否支持负点击?能否像SAM一样使用多次点击来对一个目标进行微调?
  2. 我看您的代码实现,似乎会将points变为一个box作为提示,为什么要这么做?我没有在您的论文中找到相关的解释。
  3. 视觉提示词返回的topk_instance是否只能为1?它能否分割出一个被遮挡目标的多个部分?
    感谢!

跨图检测

您好,我试用了GLEE,非常棒的工作!想请教一下GLEE是否支持跨图的检测呢,具体来说,就是在第一张图像上给出scribble或者bbox,然后在另一张图像上检测第一张图上的所指目标。我看到视频有类似功能,请问是否也支持静态图像呢

Detail about object detection decoder.

Hi, there. I believe GLEE is a great work, thanks for open source!
I have a question about object detection: what's the input to the decoder when used as a object detector?
Does it need to input object query including box position from anchor boxes?
If I'm not wrong, in MaskDINO, it will input box position from anchor and mask as object query.
So, what's the object query like in GLEE when used as object detector?
Looking forward for your reply, thanks a lot!

No module named 'projects.GLEE'

你好:我在windows下运行Python app.py时,遇到GLEE\app.py", line 18, in
from projects.GLEE.glee.models.glee_model import GLEE_Model
ModuleNotFoundError: No module named 'projects.GLEE'
的错误,明明projects就在当前文件夹下,怎么就加载不了的呢?盼回复。
主要错误如下:Traceback (most recent call last):
File "E:\tool\GLEE\app.py", line 18, in
from projects.GLEE.glee.models.glee_model import GLEE_Model
ModuleNotFoundError: No module named 'projects.GLEE'
原码app.py部分如下:mport gradio as gr
import numpy as np
import cv2
import torch

from detectron2.config import get_cfg
import sys
#sys.path.insert(0, 'E:/tool/GLEE-main')
#sys.path.append('E:\tool\GLEE\projects\')
from projects.GLEE.glee.models.glee_model import GLEE_Model
from projects.GLEE.glee.config import add_glee_config

How to run the demo locally and correctly?

Hi! I've read README and tried to run the demo on my server, but I think there're a lot of code that is out of sync or missing. And the guides are incomplete.

Here are my steps:

  1. I installed the dependencies according to INSTALL.md
  2. I ran the app.py in this repo
  3. A lot of models are missing:
    • I need GLEE models so I downloaded them on huggingface demo directory
    • Misc models are missing, then I found the ones mentioned in TRAIN.md and downloaded them
  4. There are bugs in app.py on lines like (outputs,_) = GLEEmodel(...), which should be ((outputs, _), _, _) = GLEEmodel(...)
  5. Then I ran the app.py in this repo again, but the results are just random like below.
image

Did I do anything wrong? Should I just clone the huggingface repo instead?

本地运行报错

作者您好,非常感谢您的工作,我在本地运行时,报了缺少文件的错误:
Config '/mnt/yrfs/userdata/hsp/projects/GLEE/app/GLEE/configs/R50.yaml' has no VERSION. Assuming it to be compatible with latest v2.
Traceback (most recent call last):
File "app.py", line 90, in
GLEEmodel_r50 = GLEE_Model(cfg_r50, None, device, None, True).to(device)
File "/mnt/yrfs/userdata/hsp/projects/GLEE/app/GLEE/glee/models/glee_model.py", line 67, in init
self.text_encoder = CLIPTextModel.from_pretrained('GLEE/clip_vit_base_patch32')
File "/home/hsp/anaconda3/envs/GLEE/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3206, in from_pretrained
raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory GLEE/clip_vit_base_patch32.
请问我需要在哪里下载对应的文件呢?

The training code

Hi, Thanks for the solid work. Could you let me know when you'll release the training code?

关于视频任务模型Plus版本

github界面只给了图片任务的R50和SwinL2个版本的模型,然后我在huggingface上demo的files里面看到了视频任务的R50版本(visual prompt,GLEE_vos_r50.pth),想问下作者能不能开源一下视频任务的SwinL版本,是不是因为huggingface上使用的GPU跑不动所以才没放SwinL版本?
此外,关于使用的体验,我发现模型对于没学过的语言提示词效果很差,比如用custom-list不认识人头(head),输入human head才有可能给出比较差的结果。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.