Git Product home page Git Product logo

segment-everything-everywhere-all-at-once's Introduction

👀SEEM: Segment Everything Everywhere All at Once

🍇 [Read our arXiv Paper]   🍎 [Try our Demo]

We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!

by Xueyan Zou*, Jianwei Yang*, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao^, Yong Jae Lee^, in NeurIPS 2023.

A brief introduction of all the generic and interactive segmentation tasks we can do!

SEEM design

🚀 Updates

  • [2023.11.2] SEEM is applied in LLaVA-Interactive: an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Experience the future of interactive image editing with visual chat. [Project Page] [Demo] [Code] [Paper]
  • [2023.10.23] SEEM is used in Set-of-Mark Prompting: a brand-new visual prompting technique for GPT-4V! It totally unleashes the extraordinary visual grounding power of GPT-4V! [Project Page] [Code] [Paper]
  • [2023.10.10] We release the training log for SEEM-Large-v1 and log for SEEM-Tiny-v1!
  • [2023.10.04] We are excited to release ✅ training/evaluation/demo code, ✅ new checkpoints, and ✅ comprehensive readmes for both X-Decoder and SEEM!
  • [2023.09.25] Our work has been accepted to NeurIPS 2023!
  • [2023.07.27] We are excited to release our X-Decoder training code! We will release its descendant SEEM training code very soon!
  • [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
  • [2023.05.02] We have released the SEEM Focal-L and X-Decoder Focal-L checkpoints and configs!
  • [2023.04.28] We have updated the ArXiv that shows better interactive segmentation results than SAM, which trained on x50 more data than us!
  • [2023.04.26] We have released the Demo Code and SEEM-Tiny Checkpoint! Please try the One-Line Started!
  • [2023.04.20] SEEM Referring Video Segmentation is out! Please try the Video Demo and take a look at the NERF examples.

📑 Catalog

We release the following contents for both SEEM and X-Decoder

  • Demo Code
  • Model Checkpoint
  • Comprehensive User Guide
  • Training Code
  • Evaluation Code

👉 One-Line SEEM Demo with Linux:

git clone [email protected]:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh assets/scripts/run_demo.sh

📍 [New] Getting Started:

📍 [New] Latest Checkpoints and Numbers:

COCO Ref-COCOg VOC SBD
Method Checkpoint Backbone PQ ↑ mAP ↑ mIoU ↑ cIoU ↑ mIoU ↑ AP50 ↑ NoC85 ↓ NoC90 ↓ NoC85 ↓ NoC90 ↓
X-Decoder ckpt Focal-T 50.8 39.5 62.4 57.6 63.2 71.6 - - - -
X-Decoder-oq201 ckpt Focal-L 56.5 46.7 67.2 62.8 67.5 76.3 - - - -
SEEM_v0 ckpt Focal-T 50.6 39.4 60.9 58.5 63.5 71.6 3.54 4.59 * *
SEEM_v0 - Davit-d3 56.2 46.8 65.3 63.2 68.3 76.6 2.99 3.89 5.93 9.23
SEEM_v0 ckpt Focal-L 56.2 46.4 65.5 62.8 67.7 76.2 3.04 3.85 * *
SEEM_v1 ckpt SAM-ViT-B 52.0 43.5 60.2 54.1 62.2 69.3 2.53 3.23 * *
SEEM_v1 ckpt SAM-ViT-L 49.0 41.6 58.2 53.8 62.2 69.5 2.40 2.96 * *
SEEM_v1 ckpt/log Focal-T 50.8 39.4 60.7 58.5 63.7 72.0 3.19 4.13 * *
SEEM_v1 ckpt/log Focal-L 56.1 46.3 65.8 62.4 67.8 76.0 2.66 3.44 * *

SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference

🔥 Related projects:

  • FocalNet and DaViT : We used FocalNet and DaViT as the vision backbones.
  • UniCL : We used unified contrastive learning technique for learning image-text representations.
  • X-Decoder : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.

🔥 Other projects you may find interesting:

  • Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity
  • OpenSeed : Strong open-set segmentation methods.
  • Grounding SAM : Combining Grounding DINO and Segment Anything; Grounding DINO: A strong open-set detection model.
  • X-GPT : Conversational Visual Agent supported by X-Decoder.
  • LLaVA : Large Language and Vision Assistant.

💡 Highlights

Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with ONE SINGLE MODEL. We emphasize 4 important features of SEEM below.

  1. Versatility: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
  2. Compositionaliy: deal with any compositions of prompts;
  3. Interactivity: interact with user in multi-rounds, thanks to the memory prompt of SEEM to store the session history;
  4. Semantic awareness: give a semantic label to any predicted mask;

🦄 How to use the demo

  • Try our default examples first;
  • Upload an image;
  • Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
  • Remember to provide the actual prompt for each prompt type you select, otherwise you will meet an error (e.g., remember to draw on the referring image);
  • Our model by default support the vocabulary of COCO 80 categories, others will be classified to 'others' or misclassified. If you want to segment using open-vocabulary labels, include the text label in 'text' button after drawing scribbles.
  • Click "Submit" and wait for a few seconds.

🌋 An interesting example

An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.

assets/images/transformers_gh.png

🌷 NERF Examples

  • Inspired by the example in SA3D, we tried SEEM on NERF Examples and works well :)

🏕️ Click, scribble to mask

With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.

SEEM design

🏔️ Text to mask

SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.

example

🕌 Referring image to mask

With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images. example

SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented. example

🌼 Referring image to video mask

No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify! example

🌻 Audio to mask

We use Whisper to turn audio into text prompt to segment the object. Try it in our demo!

assets/images/audio.png

🌳 Examples of different styles

An example of segmenting a meme.

assets/images/emoj.png

An example of segmenting trees in cartoon style.

assets/images/trees_text.png

An example of segmenting a Minecraft image.

assets/images/minecraft.png
An example of using referring image on a popular teddy bear.

example

Model

SEEM design

Comparison with SAM

In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.

assets/images/compare.jpg

💘 Acknowledgements

  • We appreciate hugging face for the GPU support on demo!

segment-everything-everywhere-all-at-once's People

Contributors

eltociear avatar fengli-ust avatar haozhang534 avatar jwyang avatar linjieli222 avatar maureenzou avatar simonebonato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

segment-everything-everywhere-all-at-once's Issues

Would you mind adding a link to SA3D?

Hi there,

This is Lingxi Xie, one of the co-authors of SA3D.

I read the paper and code and found the project quite interesting and inspiring!

I saw that you referred to SA3D and completed the segmentation on NeRF. If the mentioned SA3D in the repo was our project (sorry if it was not), would you mind adding a link to the text in the readme file so that readers can get the message?

Best,
Lingxi

code for: Referring image to mask

hello
Is this repo is only for a demo? or you plan to release the code,at least for the feature of: Referring image to mask
thanks

about the testing results

image
Hello, author! nice work. Does your project provide the code to obtain the results mentioned in the paper?

Code release

This is a great job. Estimated when will the code be released? Thanks for your contribution to the community.

Suggestion - Integrate MobileSAM into the pipeline for lightweight and faster inference

Reference: https://github.com/ChaoningZhang/MobileSAM

Our project performs on par with the original SAM and keeps exactly the same pipeline as the original SAM except for a change on the image encode, therefore, it is easy to Integrate into any project.

MobileSAM is around 60 times smaller and around 50 times faster than original SAM, and it is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:

image

image

Best Wishes,

Qiao

Question about the cost function

Dear author,

Thanks so much for your contribution and the inference code.

I missed the explanation of your loss function in the paper, could you please briefly describe it? Is it based on the linear combination between the focal and dice loss, like the one in Maskformer?

Kind Regards,
Yuyuan

Demo effection

In some scenarios, the effect may not seem very good...

ValueError: RGBA values should be within 0-1 range

Sometimes color values go beyond 1 .

File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/tasks/interactive.py", line 113, in interactive_infer_image
demo = visual.draw_panoptic_seg(pano_seg.cpu(), pano_seg_info) # rgb Image
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 543, in draw_panoptic_seg
self.overlay_instances(masks=masks, labels=labels, assigned_colors=colors, alpha=alpha)
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 745, in overlay_instances
self.draw_text(
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 890, in draw_text
color = np.maximum(list(mplc.to_rgb(color)), 0.2)
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 496, in to_rgb
return to_rgba(c)[:3]
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 299, in to_rgba
rgba = _to_rgba_no_colorcycle(c, alpha)
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 395, in _to_rgba_no_colorcycle
raise ValueError("RGBA values should be within 0-1 range")
ValueError: RGBA values should be within 0-1 range

Losses used in training

Hi! Wonderful project you have here :)
just wondering if you can provide us with all the losses you use in the training step, since it is not mentioned in the paper.
thanks a lot!

Questions about water surfaces and mirrors

Thanks for your great work! I noticed that your demo can identify water surfaces and mirrors well without misclassifying them due to reflections. Is there any specific design for this?

Code Release

Thanks for the great work! The demo is very inspiring and we are excited about it's potential applications in downstream tasks, especially in robotics. Do you have an ETA on when the code would be released?

We would love to play with the thresholds (seems to have many false positives in the demo) to see if this is useful out of the box. Please let me know! Happy to reach out privately as well.

Thanks
Dhruv (UC Berkeley)

Differences between SEEM Focal-L and the Huggingface Demo model?

The demo outputs a warning:

The current model is run on SEEM Focal-L, for best performance refer to our demo.

And then the performance of the model seems to be worse than the SEEM demo at the Huggingface. In particular, I've noticed that the segmentation with referring text are worse, they "splash" onto the neighboring objects. What are the differences between the official demo on Huggingface and the published SEEM Focal-L checkpoint and config?

Video demo with Referring Text

As far as my current understanding. Video demos only support draw on referring image.
Is there any limitation on it? can we use text or audio prompt?

Some question about the paper

Thank you for the great work ! I am now having some troubles understanding the paper sec.3 subsection "Compositional", what is function in eq.5 "Match" referring to? Are there papers I can turn to?

Question about training with non-prompt, visual-prompt, and text-prompt

Thanks for sharing the cool results! I have one more detailed question.
When you train no-prompt, text-prompt, and visual prompts, how do you train them all together?
Does it mean while training, for every batch, you randomly pick 1 task from 3 tasks (no-prompt, visual prompt, text-prompt)?
It is not clear how you can train them together when the given prompt not aligned with each other, for example, given text-prompt (e.g., caption of the whole image describing all the objects, say 10) but the visual-prompt only has 1 or 2 points represent 1 or 2 objects.

Thank you!

Results on open-vocabulary panoptic segmentation

In your paper, you mention

"...strong performance on many segmentation tasks including closed-set and open-set
panoptic segmentation, ...

I cannot seem to find the section on open-set panoptic segmentation. Do you have results on this task?

Checkpoint for Gradio Demo

Thanks for open-sourcing your great work!

I have tried your Gradio demo and it works very well. I wonder whether your released checkpoint, i.e., 'seem_focalt_v1.pt`, is the same one that was used in the demo. Thanks!

run error

when I run the app.py script, the error is as follows below, and how I can run successfully, Thank you.

Traceback (most recent call last):
  File "D:\Projects\Segment-Everything-Everywhere-All-At-Once\demo_code\app.py", line 18, in <module>
    import whisper
  File "D:\Applications\Anaconda3\envs\seem\lib\site-packages\whisper.py", line 69, in <module>
    libc = ctypes.CDLL(libc_name)
  File "D:\Applications\Anaconda3\envs\seem\lib\ctypes\__init__.py", line 364, in __init__
    if '/' in name or '\\' in name:
TypeError: argument of type 'NoneType' is not iterable

Code release and some typos in the paper

Hi authors,

Thanks for the amazing work!
I really like this paper.
I'd like to know when do you plan to open source the code and checkpoint?

There are some typos in this paper. (Will update if I found more)

  1. line 3 of Fig 3 caption: "object".
  2. First paragraph of Section 3. The red mark of the three types of prompts are not consistent. Also, they are not reflected in Eq. (3), but in Eq. (1).

No such file or directory:Segment-Everything-Everywhere-All-At-Once-main/demo_code/null

I have a problem with this, can you explain it to me?

ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/starlette/responses.py", line 335, in call
stat_result = await anyio.to_thread.run_sync(os.stat, self.path)
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
FileNotFoundError: [Errno 2] No such file or directory: '/home/nts1/users/tungdd7/Segment-Everything-Everywhere-All-At-Once-main/demo_code/null'

AttributeError: module 'enum' has no attribute 'IntFlag'

当我在anaconda中创建好环境,并运行“pip install -r requirements.txt”后,出现了这一提示,请问如何解决呢?

When I created the environment in anaconda and ran "pip install -r requirements.txt", this prompt appeared. How to solve it?

File Not Found: seem_focall_v1.pt

When I try to run app.py I get this error:
FileNotFoundError: [Errno 2] No such file or directory: 'seem_focall_v1.pt'

The link(https://projects4jw.blob.core.windows.net/x-decoder/release/seem_focall_v1.pt) seems to go to this page:
This XML file does not appear to have any style information associated with it. The document tree is shown below.

PublicAccessNotPermitted
Public access is not permitted on this storage account. RequestId:b7249e77-701e-0067-6579-968f11000000 Time:2023-06-04T00:17:46.1304179Z

Any assistance would be appreciated.

RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'

I encountered an error, when using the sample, task is vedio. How can I solve this problem?

Traceback (most recent call last):
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
output = await app.get_blocks().process_api(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/blocks.py", line 1302, in process_api
result = await self.call_function(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/blocks.py", line 1025, in call_function
prediction = await anyio.to_thread.run_sync(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "app.py", line 65, in inference
return interactive_infer_video(model, audio, image, task, *args, **kwargs)
File "/data2/nkm/Segment-Everything-Everywhere-All-At-Once-main/demo_code/tasks/interactive.py", line 226, in interactive_infer_video
refimg_mask = (F.interpolate(refimg_mask, (_height, _width), mode='bilinear', align_corners=True) > 0)
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/torch/nn/functional.py", line 3731, in interpolate
return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'

One or All prompts?

May I know whether a model corresponds to a specific prompt or whether a model can correspond to all the prompts you have listed at the same time (that means you use massive of all of these prompts in different fashions at the same time to train a single model)?

About Ref-COCO dataset overlapping.

I'd like to express my appreciation for your excellent work. it is both engaging and insightful. However, I have some confusion regarding the experiment detailed in Chapter 4.

In the paper, you mentioned using a combination of Ref-COCO, Ref-COCOg, and Ref-COCO+ for COCO image annotations in the referring segmentation task. Then, you report your evaluation on Ref-COCOg. While I find this approach interesting, I'm not quite sure what you mean by "combination." Additionally, I am concerned about the potential for data leakage since Ref-COCO, Ref-COCOg, and Ref-COCO+ are three types of annotations on the same image dataset, which might lead to overlap between the training and test sets of different annotations. Could you please provide further clarification on this experimental part? Thank you!

can't download the released model

I couldn't download the released mode.
How can I get them?

SEEM Focal-L and X-Decoder Focal-L checkpoints.

bellow is error message.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:5c2d36f8-601e-0012-2ca8-93d64b000000 Time:2023-05-31T10:11:10.4213753Z</Message>
</Error>

Always segments person when using text

Hello!

Thank you for releasing this great work! I'm trying out the demo on some fashion images and I primarily get the entire person in the segmentation when using text. I have provided some examples below. Thank you!

image

image

the code doesn't work for me

Following the guide, i manage to install all requirements in a brandy new conda env. I tried to run the zebra example (in which i am interested the most) got no segmentation results. I had tried other examples, but in no vain (no segmentation results at all).
image

the outputs from my terminal console seemed alright, no error message except some warnings. here is the package list i installed, would you be so kind to tell me the reason i failed:

absl-py 1.4.0
accelerate 0.19.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.0.1
antlr4-python3-runtime 4.9.3
anyio 3.7.0
appdirs 1.4.4
astunparse 1.6.3
async-timeout 4.0.2
attrs 23.1.0
black 21.4b2
cachetools 5.3.1
certifi 2023.5.7
charset-normalizer 3.1.0
cityscapesScripts 2.2.2
click 8.1.3
cloudpickle 2.2.1
cmake 3.26.3
coloredlogs 15.0.1
contourpy 1.0.7
cycler 0.11.0
detectron2 0.6
diffdist 0.1
diffusers 0.11.1
einops 0.6.1
exceptiongroup 1.1.1
fastapi 0.95.2
ffmpy 0.3.0
filelock 3.12.0
flatbuffers 23.5.26
fonttools 4.39.4
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
fvcore 0.1.5.post20221221
gast 0.4.0
google-auth 2.19.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
gradio 3.31.0
gradio_client 0.2.5
grpcio 1.54.2
h11 0.14.0
h5py 3.8.0
httpcore 0.17.2
httpx 0.24.1
huggingface-hub 0.14.1
humanfriendly 10.0
hydra-core 1.3.2
idna 3.4
imageio 2.30.0
importlib-metadata 6.6.0
importlib-resources 5.12.0
invisible-watermark 0.1.5
iopath 0.1.9
Jinja2 3.1.2
joblib 1.2.0
json-tricks 3.17.0
jsonschema 4.17.3
keras 2.11.0
kiwisolver 1.4.4
kornia 0.6.4
lazy_loader 0.2
libclang 16.0.0
linkify-it-py 2.0.2
lit 16.0.5
llvmlite 0.40.0
Markdown 3.4.3
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
more-itertools 9.1.0
mpmath 1.3.0
multidict 6.0.4
mup 1.0.0
mypy-extensions 1.0.0
networkx 3.1
nltk 3.8.1
numba 0.57.0
numpy 1.23.5
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.12.0
onnxruntime 1.15.0
openai 0.27.7
openai-whisper 20230314
opencv-python 4.7.0.72
opt-einsum 3.3.0
orjson 3.8.14
packaging 23.1
pandas 2.0.2
pathspec 0.11.1
Pillow 9.5.0
pip 23.0.1
pkgutil_resolve_name 1.3.10
portalocker 2.7.0
protobuf 3.19.6
psutil 5.9.5
pyarrow 12.0.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycocotools 2.0.4
pydantic 1.10.8
pydot 1.4.2
pydub 0.25.1
Pygments 2.15.1
pyparsing 3.0.9
pyquaternion 0.9.9
pyrsistent 0.19.3
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyWavelets 1.4.1
PyYAML 6.0
regex 2023.5.5
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scann 1.2.9
scikit-image 0.20.0
scikit-learn 1.2.2
scipy 1.9.1
seaborn 0.12.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 67.8.0
shapely 2.0.1
six 1.16.0
sniffio 1.3.0
starlette 0.27.0
sympy 1.12
tabulate 0.9.0
tenacity 8.2.2
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.11.1
tensorflow-estimator 2.11.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor 2.3.0
threadpoolctl 3.1.0
tifffile 2023.4.12
tiktoken 0.3.3
timm 0.4.12
tokenizers 0.12.1
toml 0.10.2
toolz 0.12.0
torch 2.0.1
torchmetrics 0.6.0
torchvision 0.15.2
tqdm 4.65.0
transformers 4.19.2
triton 2.0.0
typing 3.7.4.3
typing_extensions 4.6.2
tzdata 2023.3
uc-micro-py 1.0.2
urllib3 1.26.16
uvicorn 0.22.0
vision-datasets 0.2.2
wcwidth 0.2.6
websockets 11.0.3
Werkzeug 2.3.4
wheel 0.38.4
wrapt 1.15.0
yacs 0.1.8
yarl 1.9.2
zipp 3.15.0

This needs a license

Current licensing is ambiguous because the repo has no license. It makes it hard to adopt this work for further R&D, commercial applications, etc.

Demo code installation

Using windows WSL and trying to run the demo code.

I get an error while installing the requirements. For some reason it starts with the github repositories linked in the requirements.txt.

While installing from https://github.com/MaureenZOU/detectron2-xyz.git you get an error saying ModuleNotFoundError: No module named 'torch'.

By installing manually the torch module the error is solved but I don´t think this is ideal.

Another issue comes from the run_demo.sh, at least in my case it is not recognizing the sudo apt commands making it fail. Again if I do the commands manually it works.

How can I supress the "other" label

I would like to know how to supress the "other" label in the predicted video.
And how can I use my own coco datasets to replace the default one.
Thanks you!

Visual referring data process

hi,
Thank you for your work!
I am very interested in the part of Visual referring. There is little information about the data processing for training in the paper, so could you please share some key points of visual referring data processing for training? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.