[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"

License: Apache License 2.0

Python 94.88% C++ 0.50% Cuda 4.55% Shell 0.07%

segment-everything-everywhere-all-at-once's Introduction

👀SEEM: Segment Everything Everywhere All at Once

🍇 [Read our arXiv Paper] 🍎 [Try our Demo]

We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!

by Xueyan Zou*, Jianwei Yang*, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao^, Yong Jae Lee^, in NeurIPS 2023.

A brief introduction of all the generic and interactive segmentation tasks we can do!

🚀 Updates

[2023.11.2] SEEM is applied in LLaVA-Interactive: an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Experience the future of interactive image editing with visual chat. [Project Page] [Demo] [Code] [Paper]
[2023.10.23] SEEM is used in Set-of-Mark Prompting: a brand-new visual prompting technique for GPT-4V! It totally unleashes the extraordinary visual grounding power of GPT-4V! [Project Page] [Code] [Paper]
[2023.10.10] We release the training log for SEEM-Large-v1 and log for SEEM-Tiny-v1!
[2023.10.04] We are excited to release ✅ training/evaluation/demo code, ✅ new checkpoints, and ✅ comprehensive readmes for both X-Decoder and SEEM!
[2023.09.25] Our work has been accepted to NeurIPS 2023!
[2023.07.27] We are excited to release our X-Decoder training code! We will release its descendant SEEM training code very soon!
[2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
[2023.05.02] We have released the SEEM Focal-L and X-Decoder Focal-L checkpoints and configs!
[2023.04.28] We have updated the ArXiv that shows better interactive segmentation results than SAM, which trained on x50 more data than us!
[2023.04.26] We have released the Demo Code and SEEM-Tiny Checkpoint! Please try the One-Line Started!
[2023.04.20] SEEM Referring Video Segmentation is out! Please try the Video Demo and take a look at the NERF examples.

📑 Catalog

We release the following contents for both SEEM and X-Decoder❗

👉 One-Line SEEM Demo with Linux:

git clone [email protected]:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh assets/scripts/run_demo.sh

📍 [New] Getting Started:

📍 [New] Latest Checkpoints and Numbers:

			COCO			Ref-COCOg			VOC		SBD
Method	Checkpoint	Backbone	PQ ↑	mAP ↑	mIoU ↑	cIoU ↑	mIoU ↑	AP50 ↑	NoC85 ↓	NoC90 ↓	NoC85 ↓	NoC90 ↓
X-Decoder	ckpt	Focal-T	50.8	39.5	62.4	57.6	63.2	71.6	-	-	-	-
X-Decoder-oq201	ckpt	Focal-L	56.5	46.7	67.2	62.8	67.5	76.3	-	-	-	-
SEEM_v0	ckpt	Focal-T	50.6	39.4	60.9	58.5	63.5	71.6	3.54	4.59	*	*
SEEM_v0	-	Davit-d3	56.2	46.8	65.3	63.2	68.3	76.6	2.99	3.89	5.93	9.23
SEEM_v0	ckpt	Focal-L	56.2	46.4	65.5	62.8	67.7	76.2	3.04	3.85	*	*
SEEM_v1	ckpt	SAM-ViT-B	52.0	43.5	60.2	54.1	62.2	69.3	2.53	3.23	*	*
SEEM_v1	ckpt	SAM-ViT-L	49.0	41.6	58.2	53.8	62.2	69.5	2.40	2.96	*	*
SEEM_v1	ckpt/log	Focal-T	50.8	39.4	60.7	58.5	63.7	72.0	3.19	4.13	*	*
SEEM_v1	ckpt/log	Focal-L	56.1	46.3	65.8	62.4	67.8	76.0	2.66	3.44	*	*

SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference

🔥 Related projects:

FocalNet and DaViT : We used FocalNet and DaViT as the vision backbones.
UniCL : We used unified contrastive learning technique for learning image-text representations.
X-Decoder : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.

🔥 Other projects you may find interesting:

Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity
OpenSeed : Strong open-set segmentation methods.
Grounding SAM : Combining Grounding DINO and Segment Anything; Grounding DINO: A strong open-set detection model.
X-GPT : Conversational Visual Agent supported by X-Decoder.
LLaVA : Large Language and Vision Assistant.

💡 Highlights

Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with ONE SINGLE MODEL. We emphasize 4 important features of SEEM below.

Versatility: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
Compositionaliy: deal with any compositions of prompts;
Interactivity: interact with user in multi-rounds, thanks to the memory prompt of SEEM to store the session history;
Semantic awareness: give a semantic label to any predicted mask;

🦄 How to use the demo

Try our default examples first;
Upload an image;
Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
Remember to provide the actual prompt for each prompt type you select, otherwise you will meet an error (e.g., remember to draw on the referring image);
Our model by default support the vocabulary of COCO 80 categories, others will be classified to 'others' or misclassified. If you want to segment using open-vocabulary labels, include the text label in 'text' button after drawing scribbles.
Click "Submit" and wait for a few seconds.

🌋 An interesting example

An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.

🌷 NERF Examples

Inspired by the example in SA3D, we tried SEEM on NERF Examples and works well :)

🏕️ Click, scribble to mask

With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.

🏔️ Text to mask

SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.

🕌 Referring image to mask

With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images.

SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented.

🌼 Referring image to video mask

No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify!

🌻 Audio to mask

We use Whisper to turn audio into text prompt to segment the object. Try it in our demo!

🌳 Examples of different styles

An example of segmenting a meme.

An example of segmenting trees in cartoon style.

An example of segmenting a Minecraft image.

An example of using referring image on a popular teddy bear.

Model

Comparison with SAM

In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.

💘 Acknowledgements

We appreciate hugging face for the GPU support on demo!

segment-everything-everywhere-all-at-once's People

Contributors

Stargazers

Watchers

Forkers

zhang-tao-whu anminhhung manomanas yukang2017 licongguan haorand lkxu hsaigroup jwyang moileehyeji yangedai machinelearning-ai undercontroller baicuya sorieil kaidduong jimmyma99 sunrainyg collector-m qianqian121 yijunwu xiaomogo1998 xingpanfeng whuhxb ghljh rpgloverdragon chencontrol kkk222iu ibrandiay fourthm zhsker able-chy zhangningboo wlzhdtk evdcush onepiec1 miaowu99 dandingbudanding jon-drugstore smyucas sfidea edsun3941 yizhangliu flying21 liu4lin lycsqq jasonjeng simonebonato d-mad jaredshuai techthiyanes thanhpham1987 zhaozhipeng1997 lightsun hyojunguy giithuuuub hufeihu luisa13florez13 nnzhangup paperwave weihao-bo eltociear my-basement githuberpilot qxmao research-developer ai-alebrijecircus-x tngamemo ptichkass mohannadehabbarakat amokame goswamig pavankumarpaladi dankosaric46 jaedukseo alexptg tina9309 fxhollow antoniogonzalezsuarez hadryan guitaryourself lidi100 volome vieozhu jiangzhengkai ericlong423 thenetguy rockystevejobs rentainhe gth-ai lestercovax sfgrahman ailearnwjf carlosfudan aliushn aicodehunt dchichkov wuzujiong feitianxiaoxiao jessy-huang

segment-everything-everywhere-all-at-once's Issues

Can our project achieve real-time segmentation video?

about Issues
facebookresearch/segment-anything#240

Would you mind adding a link to SA3D?

Hi there,

This is Lingxi Xie, one of the co-authors of SA3D.

I read the paper and code and found the project quite interesting and inspiring!

I saw that you referred to SA3D and completed the segmentation on NeRF. If the mentioned SA3D in the repo was our project (sorry if it was not), would you mind adding a link to the text in the readme file so that readers can get the message?

Best,
Lingxi

Why does it just show one object in the pic ？although the result seems right using "Panoptic"

Why does it just show one object in the pic ？although the result seems right using "Panoptic". If there is any other setting needed to be changed, please tell me, thx!

[Question] inference with multiple reference images.

Can I infer from multiple reference images?
In your demo, I can infer from a single reference image, but I would like to infer from multiple reference images.

code for: Referring image to mask

hello
Is this repo is only for a demo? or you plan to release the code,at least for the feature of: Referring image to mask
thanks

about the testing results

Hello, author! nice work. Does your project provide the code to obtain the results mentioned in the paper?

Code release

This is a great job. Estimated when will the code be released? Thanks for your contribution to the community.

Suggestion - Integrate MobileSAM into the pipeline for lightweight and faster inference

Reference: https://github.com/ChaoningZhang/MobileSAM

Our project performs on par with the original SAM and keeps exactly the same pipeline as the original SAM except for a change on the image encode, therefore, it is easy to Integrate into any project.

MobileSAM is around 60 times smaller and around 50 times faster than original SAM, and it is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:

Best Wishes,

Qiao

在我得数据据，表现非常糟糕，请问是否可以用我得数据集训练，使其在我的数据集表现优异

Question about the cost function

Dear author,

Thanks so much for your contribution and the inference code.

I missed the explanation of your loss function in the paper, could you please briefly describe it? Is it based on the linear combination between the focal and dice loss, like the one in Maskformer?

Kind Regards,
Yuyuan

Demo effection

In some scenarios, the effect may not seem very good...

ValueError: RGBA values should be within 0-1 range

Sometimes color values go beyond 1 .

File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/tasks/interactive.py", line 113, in interactive_infer_image
demo = visual.draw_panoptic_seg(pano_seg.cpu(), pano_seg_info) # rgb Image
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 543, in draw_panoptic_seg
self.overlay_instances(masks=masks, labels=labels, assigned_colors=colors, alpha=alpha)
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 745, in overlay_instances
self.draw_text(
File "/home/aadalarasan/coda/Segment-Everything-Everywhere-All-At-Once/demo_code/utils/visualizer.py", line 890, in draw_text
color = np.maximum(list(mplc.to_rgb(color)), 0.2)
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 496, in to_rgb
return to_rgba(c)[:3]
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 299, in to_rgba
rgba = _to_rgba_no_colorcycle(c, alpha)
File "/home/aadalarasan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/matplotlib/colors.py", line 395, in _to_rgba_no_colorcycle
raise ValueError("RGBA values should be within 0-1 range")
ValueError: RGBA values should be within 0-1 range

Losses used in training

Hi! Wonderful project you have here :)
just wondering if you can provide us with all the losses you use in the training step, since it is not mentioned in the paper.
thanks a lot!

Use multiple texts to query an image

I wonder how to segment zebra and grass simultaneously. And how to get all the segmentation masks of the zebras?

Questions about water surfaces and mirrors

Thanks for your great work! I noticed that your demo can identify water surfaces and mirrors well without misclassifying them due to reflections. Is there any specific design for this?

Can it use bounding box prompt to get the label and mask?

502 Bad Gateway?

Hello, I just wanted to access the demo of the video, and there was an error in the title. Is there something wrong?https://36d0f3ae0b147b22.gradio.app

Panoptic segmentation failing

In the SEEM huggingface space selecting the panoptic segmentation with just the image is not working, with every image I'm getting an error on the output, could anyone help me with that?

Code Release

Thanks for the great work! The demo is very inspiring and we are excited about it's potential applications in downstream tasks, especially in robotics. Do you have an ETA on when the code would be released?

We would love to play with the thresholds (seems to have many false positives in the demo) to see if this is useful out of the box. Please let me know! Happy to reach out privately as well.

Thanks
Dhruv (UC Berkeley)

How to train our custom datasets?

Differences between SEEM Focal-L and the Huggingface Demo model?

The demo outputs a warning:

The current model is run on SEEM Focal-L, for best performance refer to our demo.

And then the performance of the model seems to be worse than the SEEM demo at the Huggingface. In particular, I've noticed that the segmentation with referring text are worse, they "splash" onto the neighboring objects. What are the differences between the official demo on Huggingface and the published SEEM Focal-L checkpoint and config?

Video demo with Referring Text

As far as my current understanding. Video demos only support draw on referring image.
Is there any limitation on it? can we use text or audio prompt?

Some question about the paper

Thank you for the great work ! I am now having some troubles understanding the paper sec.3 subsection "Compositional", what is function in eq.5 "Match" referring to? Are there papers I can turn to?

Batch Inference

How can we do batch inference with text queries?

Help: how to extract feature given a mask/region of an image?

Hi, I'm wondering how to extract feature given a ground truth mask/region of an image using SEEM.

Thanks

When interactivate mode is stroke, RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'

When interactivate mode is stroke, RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'
The error may locate in "demo_code/tasks/interactive.py mask_ori = (F.interpolate(mask_ori, (height, width), mode='bilinear', align_corners=True) > 0)"
return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)

Question about training with non-prompt, visual-prompt, and text-prompt

Thanks for sharing the cool results! I have one more detailed question.
When you train no-prompt, text-prompt, and visual prompts, how do you train them all together?
Does it mean while training, for every batch, you randomly pick 1 task from 3 tasks (no-prompt, visual prompt, text-prompt)?
It is not clear how you can train them together when the given prompt not aligned with each other, for example, given text-prompt (e.g., caption of the whole image describing all the objects, say 10) but the visual-prompt only has 1 or 2 points represent 1 or 2 objects.

Thank you!

Results on open-vocabulary panoptic segmentation

In your paper, you mention

"...strong performance on many segmentation tasks including closed-set and open-set
panoptic segmentation, ...

I cannot seem to find the section on open-set panoptic segmentation. Do you have results on this task?

Awesome! Does it segment with text in e2e mode?

Does it need detection first and then using box to segmentation like SAM?

Training code

Looking forward...

Checkpoint for Gradio Demo

Thanks for open-sourcing your great work!

I have tried your Gradio demo and it works very well. I wonder whether your released checkpoint, i.e., 'seem_focalt_v1.pt`, is the same one that was used in the demo. Thanks!

run error

when I run the app.py script, the error is as follows below, and how I can run successfully, Thank you.

Traceback (most recent call last):
  File "D:\Projects\Segment-Everything-Everywhere-All-At-Once\demo_code\app.py", line 18, in <module>
    import whisper
  File "D:\Applications\Anaconda3\envs\seem\lib\site-packages\whisper.py", line 69, in <module>
    libc = ctypes.CDLL(libc_name)
  File "D:\Applications\Anaconda3\envs\seem\lib\ctypes\__init__.py", line 364, in __init__
    if '/' in name or '\\' in name:
TypeError: argument of type 'NoneType' is not iterable

Video Demo (Beta) web can not access

Video Demo (Beta) web can not access，maybe down? https://44a0d1f15aefcf1a.gradio.app/

Code release and some typos in the paper

Hi authors,

Thanks for the amazing work!
I really like this paper.
I'd like to know when do you plan to open source the code and checkpoint?

There are some typos in this paper. (Will update if I found more)

line 3 of Fig 3 caption: "object".
First paragraph of Section 3. The red mark of the three types of prompts are not consistent. Also, they are not reflected in Eq. (3), but in Eq. (1).

No such file or directory:Segment-Everything-Everywhere-All-At-Once-main/demo_code/null

I have a problem with this, can you explain it to me?

ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/starlette/responses.py", line 335, in call
stat_result = await anyio.to_thread.run_sync(os.stat, self.path)
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/nts1/miniconda3/envs/SEEM/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
FileNotFoundError: [Errno 2] No such file or directory: '/home/nts1/users/tungdd7/Segment-Everything-Everywhere-All-At-Once-main/demo_code/null'

AttributeError: module 'enum' has no attribute 'IntFlag'

当我在anaconda中创建好环境，并运行“pip install -r requirements.txt”后，出现了这一提示，请问如何解决呢？

When I created the environment in anaconda and ran "pip install -r requirements.txt", this prompt appeared. How to solve it?

File Not Found: seem_focall_v1.pt

When I try to run app.py I get this error:
FileNotFoundError: [Errno 2] No such file or directory: 'seem_focall_v1.pt'

The link(https://projects4jw.blob.core.windows.net/x-decoder/release/seem_focall_v1.pt) seems to go to this page:
This XML file does not appear to have any style information associated with it. The document tree is shown below.

PublicAccessNotPermitted
Public access is not permitted on this storage account. RequestId:b7249e77-701e-0067-6579-968f11000000 Time:2023-06-04T00:17:46.1304179Z

Any assistance would be appreciated.

RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'

I encountered an error, when using the sample， task is vedio. How can I solve this problem?

Traceback (most recent call last):
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
output = await app.get_blocks().process_api(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/blocks.py", line 1302, in process_api
result = await self.call_function(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/gradio/blocks.py", line 1025, in call_function
prediction = await anyio.to_thread.run_sync(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "app.py", line 65, in inference
return interactive_infer_video(model, audio, image, task, *args, **kwargs)
File "/data2/nkm/Segment-Everything-Everywhere-All-At-Once-main/demo_code/tasks/interactive.py", line 226, in interactive_infer_video
refimg_mask = (F.interpolate(refimg_mask, (_height, _width), mode='bilinear', align_corners=True) > 0)
File "/data1/anaconda3/envs/nkm2/lib/python3.8/site-packages/torch/nn/functional.py", line 3731, in interpolate
return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
RuntimeError: "upsample_bilinear2d_channels_last" not implemented for 'Byte'

One or All prompts?

May I know whether a model corresponds to a specific prompt or whether a model can correspond to all the prompts you have listed at the same time (that means you use massive of all of these prompts in different fashions at the same time to train a single model)?

Finetuning code (training code)

Hey folks! Super cool stuff. Looking forward to seeing some code for finetuning! :)

About Ref-COCO dataset overlapping.

I'd like to express my appreciation for your excellent work. it is both engaging and insightful. However, I have some confusion regarding the experiment detailed in Chapter 4.

In the paper, you mentioned using a combination of Ref-COCO, Ref-COCOg, and Ref-COCO+ for COCO image annotations in the referring segmentation task. Then, you report your evaluation on Ref-COCOg. While I find this approach interesting, I'm not quite sure what you mean by "combination." Additionally, I am concerned about the potential for data leakage since Ref-COCO, Ref-COCOg, and Ref-COCO+ are three types of annotations on the same image dataset, which might lead to overlap between the training and test sets of different annotations. Could you please provide further clarification on this experimental part? Thank you!

the model will be open？

can't download the released model

I couldn't download the released mode.
How can I get them?

SEEM Focal-L and X-Decoder Focal-L checkpoints.

bellow is error message.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:5c2d36f8-601e-0012-2ca8-93d64b000000 Time:2023-05-31T10:11:10.4213753Z</Message>
</Error>

Always segments person when using text

Hello!

Thank you for releasing this great work! I'm trying out the demo on some fashion images and I primarily get the entire person in the segmentation when using text. I have provided some examples below. Thank you!

the code doesn't work for me

Following the guide, i manage to install all requirements in a brandy new conda env. I tried to run the zebra example (in which i am interested the most) got no segmentation results. I had tried other examples, but in no vain (no segmentation results at all).

the outputs from my terminal console seemed alright, no error message except some warnings. here is the package list i installed, would you be so kind to tell me the reason i failed:

absl-py 1.4.0
accelerate 0.19.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.0.1
antlr4-python3-runtime 4.9.3
anyio 3.7.0
appdirs 1.4.4
astunparse 1.6.3
async-timeout 4.0.2
attrs 23.1.0
black 21.4b2
cachetools 5.3.1
certifi 2023.5.7
charset-normalizer 3.1.0
cityscapesScripts 2.2.2
click 8.1.3
cloudpickle 2.2.1
cmake 3.26.3
coloredlogs 15.0.1
contourpy 1.0.7
cycler 0.11.0
detectron2 0.6
diffdist 0.1
diffusers 0.11.1
einops 0.6.1
exceptiongroup 1.1.1
fastapi 0.95.2
ffmpy 0.3.0
filelock 3.12.0
flatbuffers 23.5.26
fonttools 4.39.4
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
fvcore 0.1.5.post20221221
gast 0.4.0
google-auth 2.19.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
gradio 3.31.0
gradio_client 0.2.5
grpcio 1.54.2
h11 0.14.0
h5py 3.8.0
httpcore 0.17.2
httpx 0.24.1
huggingface-hub 0.14.1
humanfriendly 10.0
hydra-core 1.3.2
idna 3.4
imageio 2.30.0
importlib-metadata 6.6.0
importlib-resources 5.12.0
invisible-watermark 0.1.5
iopath 0.1.9
Jinja2 3.1.2
joblib 1.2.0
json-tricks 3.17.0
jsonschema 4.17.3
keras 2.11.0
kiwisolver 1.4.4
kornia 0.6.4
lazy_loader 0.2
libclang 16.0.0
linkify-it-py 2.0.2
lit 16.0.5
llvmlite 0.40.0
Markdown 3.4.3
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
more-itertools 9.1.0
mpmath 1.3.0
multidict 6.0.4
mup 1.0.0
mypy-extensions 1.0.0
networkx 3.1
nltk 3.8.1
numba 0.57.0
numpy 1.23.5
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.12.0
onnxruntime 1.15.0
openai 0.27.7
openai-whisper 20230314
opencv-python 4.7.0.72
opt-einsum 3.3.0
orjson 3.8.14
packaging 23.1
pandas 2.0.2
pathspec 0.11.1
Pillow 9.5.0
pip 23.0.1
pkgutil_resolve_name 1.3.10
portalocker 2.7.0
protobuf 3.19.6
psutil 5.9.5
pyarrow 12.0.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycocotools 2.0.4
pydantic 1.10.8
pydot 1.4.2
pydub 0.25.1
Pygments 2.15.1
pyparsing 3.0.9
pyquaternion 0.9.9
pyrsistent 0.19.3
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyWavelets 1.4.1
PyYAML 6.0
regex 2023.5.5
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scann 1.2.9
scikit-image 0.20.0
scikit-learn 1.2.2
scipy 1.9.1
seaborn 0.12.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 67.8.0
shapely 2.0.1
six 1.16.0
sniffio 1.3.0
starlette 0.27.0
sympy 1.12
tabulate 0.9.0
tenacity 8.2.2
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.11.1
tensorflow-estimator 2.11.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor 2.3.0
threadpoolctl 3.1.0
tifffile 2023.4.12
tiktoken 0.3.3
timm 0.4.12
tokenizers 0.12.1
toml 0.10.2
toolz 0.12.0
torch 2.0.1
torchmetrics 0.6.0
torchvision 0.15.2
tqdm 4.65.0
transformers 4.19.2
triton 2.0.0
typing 3.7.4.3
typing_extensions 4.6.2
tzdata 2023.3
uc-micro-py 1.0.2
urllib3 1.26.16
uvicorn 0.22.0
vision-datasets 0.2.2
wcwidth 0.2.6
websockets 11.0.3
Werkzeug 2.3.4
wheel 0.38.4
wrapt 1.15.0
yacs 0.1.8
yarl 1.9.2
zipp 3.15.0

This needs a license

Current licensing is ambiguous because the repo has no license. It makes it hard to adopt this work for further R&D, commercial applications, etc.

Demo code installation

Using windows WSL and trying to run the demo code.

I get an error while installing the requirements. For some reason it starts with the github repositories linked in the requirements.txt.

While installing from https://github.com/MaureenZOU/detectron2-xyz.git you get an error saying ModuleNotFoundError: No module named 'torch'.

By installing manually the torch module the error is solved but I don´t think this is ideal.

Another issue comes from the run_demo.sh, at least in my case it is not recognizing the sudo apt commands making it fail. Again if I do the commands manually it works.

How can I supress the "other" label

I would like to know how to supress the "other" label in the predicted video.
And how can I use my own coco datasets to replace the default one.
Thanks you!

Visual referring data process

hi,
Thank you for your work!
I am very interested in the part of Visual referring. There is little information about the data processing for training in the paper, so could you please share some key points of visual referring data processing for training? Thanks!

taijuanle

as the title.

ux-decoder / segment-everything-everywhere-all-at-once Goto Github PK