Git Product home page Git Product logo

x-decoder's Introduction

X-Decoder: Generalized Decoding for Pixel, Image, and Language

[Project Page] [Paper] [HuggingFace All-in-One Demo] [HuggingFace Instruct Demo] [Video]

by Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^ in CVPR 2023.

🌢️ Getting Started

We release the following contents for both SEEM and X-Decoder❗

  • Demo Code
  • Model Checkpoint
  • Comprehensive User Guide
  • Training Code
  • Evaluation Code

πŸ‘‰ One-Line SEEM Demo with Linux:

git clone [email protected]:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh aasets/scripts/run_demo.sh

πŸ“ [New] Getting Started:

πŸ“ [New] Latest Checkpoints and Numbers:

COCO Ref-COCOg VOC SBD
Method Checkpoint backbone PQ ↑ mAP ↑ mIoU ↑ cIoU ↑ mIoU ↑ AP50 ↑ NoC85 ↓ NoC90 ↓ NoC85 ↓ NoC90 ↓
X-Decoder ckpt Focal-T 50.8 39.5 62.4 57.6 63.2 71.6 - - - -
X-Decoder-oq201 ckpt Focal-L 56.5 46.7 67.2 62.8 67.5 76.3 - - - -
SEEM_v0 ckpt Focal-T 50.6 39.4 60.9 58.5 63.5 71.6 3.54 4.59 * *
SEEM_v0 - Davit-d3 56.2 46.8 65.3 63.2 68.3 76.6 2.99 3.89 5.93 9.23
SEEM_v0 ckpt Focal-L 56.2 46.4 65.5 62.8 67.7 76.2 3.04 3.85 * *
SEEM_v1 ckpt Focal-T 50.8 39.4 60.7 58.5 63.7 72.0 3.19 4.13 * *
SEEM_v1 ckpt SAM-ViT-B 52.0 43.5 60.2 54.1 62.2 69.3 2.53 3.23 * *
SEEM_v1 ckpt SAM-ViT-L 49.0 41.6 58.2 53.8 62.2 69.5 2.40 2.96 * *

SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference

πŸ”₯ News

  • [2023.10.04] We are excited to release βœ… training/evaluation/demo code, βœ… new checkpoints, and βœ… comprehensive readmes for both X-Decoder and SEEM!
  • [2023.09.24] We are providing new demo command/code for inference (DEMO.md)!
  • [2023.07.19] 🎒 We are excited to release the x-decoder training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)!
  • [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
  • [2023.04.14] We are releasing SEEM, a new universal interactive interface for image segmentation! You can use it for any segmentation tasks, way beyond what X-Decoder can do!

  • [2023.03.20] As an aspiration of our X-Decoder, we developed OpenSeeD ([Paper][Code]) to enable open-vocabulary segmentation and detection with a single model, Check it out!
  • [2023.03.14] We release X-GPT which is an conversational version of our X-Decoder through GPT-3 langchain!
  • [2023.03.01] The Segmentation in the Wild Challenge had been launched and ready for submitting results!
  • [2023.02.28] We released the SGinW benchmark for our challenge. Welcome to build your own models on the benchmark!
  • [2023.02.27] Our X-Decoder has been accepted by CVPR 2023!
  • [2023.02.07] We combine X-Decoder (strong image understanding), GPT-3 (strong language understanding) and Stable Diffusion (strong image generation) to make an instructional image editing demo, check it out!
  • [2022.12.21] We release inference code of X-Decoder.
  • [2022.12.21] We release Focal-T pretrained checkpoint.
  • [2022.12.21] We release open-vocabulary segmentation benchmark.

πŸ–ŒοΈ DEMO

🫐 [X-GPT]   πŸ“[Instruct X-Decoder]

demo

🎢 Introduction

github_figure

X-Decoder is a generalized decoding model that can generate pixel-level segmentation and token-level texts seamlessly!

It achieves:

  • State-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets;
  • Better or competitive finetuned performance to generalist and specialist models on segmentation and VL tasks;
  • Friendly for efficient finetuning and flexible for novel task composition.

It supports:

  • One suite of parameters pretrained for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, and Image-Text Retrieval;
  • One model architecture finetuned for Semantic/Instance/Panoptic Segmentation, Referring Segmentation, Image Captioning, Image-Text Retrieval and Visual Question Answering (with an extra cls head);
  • Zero-shot task composition for Region Retrieval, Referring Captioning, Image Editing.

Acknowledgement

  • We appreciate the contructive dicussion with Haotian Zhang
  • We build our work on top of Mask2Former
  • We build our demos on HuggingFace πŸ€— with sponsored GPUs
  • We appreciate the discussion with Xiaoyu Xiang during rebuttal

Citation

@article{zou2022xdecoder,
  author      = {Zou*, Xueyan and Dou*, Zi-Yi and Yang*, Jianwei and Gan, Zhe and Li, Linjie and Li, Chunyuan and Dai, Xiyang and Wang, Jianfeng and Yuan, Lu and Peng, Nanyun and Wang, Lijuan and Lee*, Yong Jae and Gao*, Jianfeng},
  title       = {Generalized Decoding for Pixel, Image and Language},
  publisher   = {arXiv},
  year        = {2022},
}

x-decoder's People

Contributors

eltociear avatar jwyang avatar maureenzou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

x-decoder's Issues

PQ evaluation result on ADE20K

Hi, Thanks for your great work!
I try to run an evaluation on the ADE20K dataset with the BestSeg Tiny checkpoint. I found that the mIoU and AP consist of your results, but PQ is not aligned. Could you help me identify the problem?

INFO:datasets.evaluation.segmentation_evaluation:OrderedDict([('sem_seg', {'mIoU': 25.138934771544356, 'fwIoU': 56.54702259148684, 'IoU-wall': 66.66384375012157, 'IoU-building': 61.80618454872132, 'IoU-sky': 93.03627440691773, 'IoU-floor': 54.56389385541954, 'IoU-tree': 64.13564186259205, 'IoU-ceiling': 77.29479183734789, 'IoU-road, route': 79.45689179433187, 'IoU-bed': 72.96180819185307, 'IoU-window ': 49.276455812945194, 'IoU-grass': 62.04998439647351, 'IoU-cabinet': 46.98513878773616, 'IoU-sidewalk, pavement': 55.164837905552744, 'IoU-person': 81.96654332757166, 'IoU-earth, ground': 0.0, 'IoU-door': 36.46062723798975, 'IoU-table': 28.028372398862402, 'IoU-mountain, mount': 40.54859414005354, 'IoU-plant': 17.151210866614175, 'IoU-curtain': 68.51877388456774, 'IoU-chair': 40.64549598364446, 'IoU-car': 75.92264676089647, 'IoU-water': 18.77329669322049, 'IoU-painting, picture': 8.316185301015338, 'IoU-sofa': 57.89617118568877, 'IoU-shelf': 29.76269778595836, 'IoU-house': 18.14527867328128, 'IoU-sea': 43.37431317680453, 'IoU-mirror': 59.25143146822302, 'IoU-rug': 21.110705561460193, 'IoU-field': 20.738062589728905, 'IoU-armchair': 27.52600064000787, 'IoU-seat': 20.41313024837476, 'IoU-fence': 27.28278320109175, 'IoU-desk': 6.503097748798531, 'IoU-rock, stone': 25.970516263564562, 'IoU-wardrobe, closet, press': 0.1846807453809998, 'IoU-lamp': 0.7250267466787069, 'IoU-tub': 52.04585362344179, 'IoU-rail': 8.451170913691197, 'IoU-cushion': 0.8200596407011419, 'IoU-base, pedestal, stand': 0.0, 'IoU-box': 13.988375531656741, 'IoU-column, pillar': 0.0, 'IoU-signboard, sign': 15.923987465222256, 'IoU-chest of drawers, chest, bureau, dresser': 4.93644486794556, 'IoU-counter': 3.323037256225233, 'IoU-sand': 39.24199391657301, 'IoU-sink': 53.1492515051619, 'IoU-skyscraper': 17.558166889843616, 'IoU-fireplace': 25.26774697327011, 'IoU-refrigerator, icebox': 55.05294079040842, 'IoU-grandstand, covered stand': 31.396726806009056, 'IoU-path': 6.703153377491354, 'IoU-stairs': 25.672768197014335, 'IoU-runway': 45.43841992927305, 'IoU-case, display case, showcase, vitrine': 0.0, 'IoU-pool table, billiard table, snooker table': 80.27526367967762, 'IoU-pillow': 0.13484984987869275, 'IoU-screen door, screen': 0.0, 'IoU-stairway, staircase': 0.3789820949091252, 'IoU-river': 11.185766890263418, 'IoU-bridge, span': 26.68986668917918, 'IoU-bookcase': 1.635284656019348, 'IoU-blind, screen': 27.46825264385997, 'IoU-coffee table': 1.7962906383315793, 'IoU-toilet, can, commode, crapper, pot, potty, stool, throne': 82.81076010282526, 'IoU-flower': 26.89008628784697, 'IoU-book': 35.18320239666976, 'IoU-hill': 0.0, 'IoU-bench': 31.55542601768192, 'IoU-countertop': 15.173809141969999, 'IoU-stove': 12.98205015198985, 'IoU-palm, palm tree': 3.563747397914385, 'IoU-kitchen island': 18.49397313213103, 'IoU-computer': 21.26985664127961, 'IoU-swivel chair': 20.40526039736884, 'IoU-boat': 71.28289053362035, 'IoU-bar': 0.04609290003238961, 'IoU-arcade machine': 13.801956630762666, 'IoU-hovel, hut, hutch, shack, shanty': 6.769738693649287, 'IoU-bus': 71.51411629003796, 'IoU-towel': 52.67571260631483, 'IoU-light': 10.874940000210218, 'IoU-truck': 18.308968818456865, 'IoU-tower': 3.8846859382760988, 'IoU-chandelier': 1.039063125144151, 'IoU-awning, sunshade, sunblind': 8.464720109362403, 'IoU-street lamp': 0.0, 'IoU-booth': 0.0, 'IoU-tv': 21.251801174226408, 'IoU-plane': 49.31738594020662, 'IoU-dirt track': 0.4271491891808845, 'IoU-clothes': 19.80153487917838, 'IoU-pole': 19.48645788874662, 'IoU-land, ground, soil': 0.2774159463904913, 'IoU-bannister, banister, balustrade, balusters, handrail': 0.6005131816565449, 'IoU-escalator, moving staircase, moving stairway': 0.0, 'IoU-ottoman, pouf, pouffe, puff, hassock': 21.542985891204715, 'IoU-bottle': 26.710336877651663, 'IoU-buffet, counter, sideboard': 0.09555932713171633, 'IoU-poster, posting, placard, notice, bill, card': 20.477183085814193, 'IoU-stage': 3.6712931019231108, 'IoU-van': 29.00911106040356, 'IoU-ship': 7.458659731394134, 'IoU-fountain': 16.93792762129684, 'IoU-conveyer belt, conveyor belt, conveyer, conveyor, transporter': 8.341539715134479, 'IoU-canopy': 0.0, 'IoU-washer, automatic washer, washing machine': 48.10438207779419, 'IoU-plaything, toy': 4.378879596393649, 'IoU-pool': 21.915560445632035, 'IoU-stool': 19.11406844106464, 'IoU-barrel, cask': 4.771511771759197, 'IoU-basket, handbasket': 15.027955178141156, 'IoU-falls': 34.95419449012886, 'IoU-tent': 71.33059398347683, 'IoU-bag': 15.193300948116296, 'IoU-minibike, motorbike': 62.37326771941579, 'IoU-cradle': 8.939156718393596, 'IoU-oven': 13.716478380604702, 'IoU-ball': 20.336170440554575, 'IoU-food, solid food': 43.168141456019306, 'IoU-step, stair': 0.09271339970316758, 'IoU-tank, storage tank': 17.602110706421065, 'IoU-trade name': 0.0, 'IoU-microwave': 80.83249143309624, 'IoU-pot': 8.95837787354705, 'IoU-animal': 65.3558342925042, 'IoU-bicycle': 47.88619345844731, 'IoU-lake': 0.0, 'IoU-dishwasher': 5.388336798677167, 'IoU-screen': 0.0, 'IoU-blanket, cover': 0.0, 'IoU-sculpture': 16.587339263108802, 'IoU-hood, exhaust hood': 0.0, 'IoU-sconce': 0.4015924011733482, 'IoU-vase': 23.626310871145257, 'IoU-traffic light': 24.812425480213438, 'IoU-tray': 6.6554194644082285, 'IoU-trash can': 19.14883613779226, 'IoU-fan': 6.708352822431747, 'IoU-pier': 27.35937044593413, 'IoU-crt screen': 5.049604189252767, 'IoU-plate': 21.94244018437383, 'IoU-monitor': 1.7734684944451007, 'IoU-bulletin board': 14.728702703084497, 'IoU-shower': 0.4926401990256038, 'IoU-radiator': 28.439220471239995, 'IoU-glass, drinking glass': 22.272511653951643, 'IoU-clock': 21.36579342404726, 'IoU-flag': 40.49251138551105, 'mACC': 40.569499444810965, 'pACC': 68.73159223690485, 'ACC-wall': 75.00736634097991, 'ACC-building': 68.57669111441295, 'ACC-sky': 95.63273189197791, 'ACC-floor': 59.757976224767575, 'ACC-tree': 92.34077795588281, 'ACC-ceiling': 89.35127228636448, 'ACC-road, route': 87.81631900386382, 'ACC-bed': 96.07369829673503, 'ACC-window ': 70.58632284693935, 'ACC-grass': 82.43561432559335, 'ACC-cabinet': 73.53039116272248, 'ACC-sidewalk, pavement': 73.39709750648018, 'ACC-person': 93.47207883771334, 'ACC-earth, ground': 0.0, 'ACC-door': 59.03325176174202, 'ACC-table': 49.13000241166902, 'ACC-mountain, mount': 55.726524253558, 'ACC-plant': 19.260336571944435, 'ACC-curtain': 88.2108816110385, 'ACC-chair': 52.78305442082364, 'ACC-car': 84.76829270902667, 'ACC-water': 23.58851803147177, 'ACC-painting, picture': 8.464140875846445, 'ACC-sofa': 83.19727622430966, 'ACC-shelf': 62.45517008717844, 'ACC-house': 82.99261107811003, 'ACC-sea': 65.22998750814483, 'ACC-mirror': 75.9656331517485, 'ACC-rug': 87.16420063917124, 'ACC-field': 44.32838478466386, 'ACC-armchair': 61.51674279239121, 'ACC-seat': 25.294529795433164, 'ACC-fence': 68.33004233403108, 'ACC-desk': 9.217060578819886, 'ACC-rock, stone': 80.14134453358818, 'ACC-wardrobe, closet, press': 0.18750462726011477, 'ACC-lamp': 0.7313907751753096, 'ACC-tub': 54.90131018701956, 'ACC-rail': 17.063962634011762, 'ACC-cushion': 0.8297668224687995, 'ACC-base, pedestal, stand': 0.0, 'ACC-box': 16.42992975333477, 'ACC-column, pillar': 0.0, 'ACC-signboard, sign': 16.680974773986584, 'ACC-chest of drawers, chest, bureau, dresser': 7.298404728266865, 'ACC-counter': 4.522503597325615, 'ACC-sand': 78.5309654979637, 'ACC-sink': 76.42303360469313, 'ACC-skyscraper': 18.593973495505224, 'ACC-fireplace': 38.415542283296865, 'ACC-refrigerator, icebox': 80.36443917573747, 'ACC-grandstand, covered stand': 48.92070676299068, 'ACC-path': 7.661385180390852, 'ACC-stairs': 41.16969922793024, 'ACC-runway': 61.60456632679726, 'ACC-case, display case, showcase, vitrine': 0.0, 'ACC-pool table, billiard table, snooker table': 85.14147567769025, 'ACC-pillow': 0.14709102357123177, 'ACC-screen door, screen': 0.0, 'ACC-stairway, staircase': 0.3908804987683872, 'ACC-river': 60.18685279307111, 'ACC-bridge, span': 91.48940963227, 'ACC-bookcase': 1.7295716153615495, 'ACC-blind, screen': 45.49583725804772, 'ACC-coffee table': 2.171676544142192, 'ACC-toilet, can, commode, crapper, pot, potty, stool, throne': 88.57000639791931, 'ACC-flower': 62.576209367420034, 'ACC-book': 53.07705546372753, 'ACC-hill': 0.0, 'ACC-bench': 62.2090848766363, 'ACC-countertop': 53.97093238884925, 'ACC-stove': 14.81976881160654, 'ACC-palm, palm tree': 3.90129710231016, 'ACC-kitchen island': 23.658562423857195, 'ACC-computer': 23.590959401135642, 'ACC-swivel chair': 54.51053688359166, 'ACC-boat': 84.5157004760545, 'ACC-bar': 0.05370413376683698, 'ACC-arcade machine': 14.322053974702028, 'ACC-hovel, hut, hutch, shack, shanty': 30.552171846217842, 'ACC-bus': 95.6193035949202, 'ACC-towel': 69.97914975225157, 'ACC-light': 64.74776622744498, 'ACC-truck': 77.98814563928875, 'ACC-tower': 6.134903306742333, 'ACC-chandelier': 1.1626590262105216, 'ACC-awning, sunshade, sunblind': 8.870710016787092, 'ACC-street lamp': 0.0, 'ACC-booth': 0.0, 'ACC-tv': 87.94082314414074, 'ACC-plane': 67.62691661931636, 'ACC-dirt track': 21.395178822216664, 'ACC-clothes': 46.82647250095726, 'ACC-pole': 24.199969080930664, 'ACC-land, ground, soil': 1.342942047443026, 'ACC-bannister, banister, balustrade, balusters, handrail': 1.4065128280502344, 'ACC-escalator, moving staircase, moving stairway': 0.0, 'ACC-ottoman, pouf, pouffe, puff, hassock': 30.51551038508793, 'ACC-bottle': 44.12124899262045, 'ACC-buffet, counter, sideboard': 0.2823001368343289, 'ACC-poster, posting, placard, notice, bill, card': 39.77516660258875, 'ACC-stage': 10.280593761417894, 'ACC-van': 34.128791328900626, 'ACC-ship': 8.036438615467821, 'ACC-fountain': 20.769331093975826, 'ACC-conveyer belt, conveyor belt, conveyer, conveyor, transporter': 35.06400740119194, 'ACC-canopy': 0.0, 'ACC-washer, automatic washer, washing machine': 48.34456072987988, 'ACC-plaything, toy': 17.207446042128925, 'ACC-pool': 39.923903234763294, 'ACC-stool': 48.73579710706946, 'ACC-barrel, cask': 57.04858345659347, 'ACC-basket, handbasket': 23.651472142144463, 'ACC-falls': 99.14987892214951, 'ACC-tent': 97.1923852633308, 'ACC-bag': 22.070884678834986, 'ACC-minibike, motorbike': 88.75300666809125, 'ACC-cradle': 26.168406573666324, 'ACC-oven': 74.7178956403201, 'ACC-ball': 22.763769713820047, 'ACC-food, solid food': 86.51076453274894, 'ACC-step, stair': 0.09411371821290193, 'ACC-tank, storage tank': 74.70981373983683, 'ACC-trade name': 0.0, 'ACC-microwave': 93.46128726097432, 'ACC-pot': 12.047977627389498, 'ACC-animal': 85.93243604308462, 'ACC-bicycle': 79.06879414072718, 'ACC-lake': 0.0, 'ACC-dishwasher': 8.238783926705175, 'ACC-screen': 0.0, 'ACC-blanket, cover': 0.0, 'ACC-sculpture': 65.21635914143977, 'ACC-hood, exhaust hood': 0.0, 'ACC-sconce': 0.45374864016323674, 'ACC-vase': 68.9828183382218, 'ACC-traffic light': 33.1904184354154, 'ACC-tray': 20.66102428737029, 'ACC-trash can': 22.464727360977815, 'ACC-fan': 9.924247942311895, 'ACC-pier': 39.18706427507897, 'ACC-crt screen': 10.554895203564191, 'ACC-plate': 32.65891890057357, 'ACC-monitor': 2.2339792399656915, 'ACC-bulletin board': 51.96720392257036, 'ACC-shower': 9.204919620375751, 'ACC-radiator': 29.92570733348865, 'ACC-glass, drinking glass': 27.533586416405427, 'ACC-clock': 33.470510146026925, 'ACC-flag': 51.60076343642118})])
INFO:datasets.evaluation.panoptic_evaluation:Writing all panoptic predictions to /tmp/panoptic_eval614xvyra ...
INFO:datasets.evaluation.panoptic_evaluation:Panoptic Evaluation Results:
|        |   PQ   |   SQ   |   RQ   |  #categories  |
|:------:|:------:|:------:|:------:|:-------------:|
|  All   | 18.975 | 54.343 | 23.494 |      150      |
| Things | 16.644 | 56.361 | 21.067 |      100      |
| Stuff  | 23.637 | 50.306 | 28.347 |      50       |
INFO:detectron2.evaluation.coco_evaluation:Preparing results for COCO format ...
INFO:detectron2.evaluation.coco_evaluation:Saving results to ../../data/output/test/coco_instances_results.json
INFO:detectron2.evaluation.coco_evaluation:Evaluating predictions with unofficial COCO API...
Loading and preparing results...
DONE (t=0.07s)
creating index...
index created!
INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *bbox*
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 3.38 seconds.
INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results...
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 0.45 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
INFO:detectron2.evaluation.coco_evaluation:Evaluation results for bbox: 
|  AP   |  AP50  |  AP75  |  APs  |  APm  |  APl  |
|:-----:|:------:|:------:|:-----:|:-----:|:-----:|
| 0.000 | 0.000  | 0.000  | 0.000 | 0.000 | 0.000 |
INFO:detectron2.evaluation.coco_evaluation:Per-category bbox AP: 
| category                   | AP    | category                                                 | AP    | category                                  | AP    |
|:---------------------------|:------|:---------------------------------------------------------|:------|:------------------------------------------|:------|
| bed                        | 0.000 | window                                                   | 0.000 | cabinet                                   | 0.000 |
| person                     | 0.000 | door                                                     | 0.000 | table                                     | 0.000 |
| curtain                    | 0.000 | chair                                                    | 0.000 | car                                       | 0.000 |
| painting, picture          | 0.000 | sofa                                                     | 0.000 | shelf                                     | 0.000 |
| mirror                     | 0.000 | armchair                                                 | 0.000 | seat                                      | 0.000 |
| fence                      | 0.000 | desk                                                     | 0.000 | wardrobe, closet, press                   | 0.000 |
| lamp                       | 0.000 | tub                                                      | 0.000 | rail                                      | 0.000 |
| cushion                    | 0.000 | box                                                      | 0.000 | column, pillar                            | 0.000 |
| signboard, sign            | 0.000 | chest of drawers, chest, bureau, dresser                 | 0.000 | counter                                   | 0.000 |
| sink                       | 0.000 | fireplace                                                | 0.000 | refrigerator, icebox                      | 0.000 |
| stairs                     | 0.000 | case, display case, showcase, vitrine                    | 0.000 | pool table, billiard table, snooker table | 0.000 |
| pillow                     | 0.000 | screen door, screen                                      | 0.000 | bookcase                                  | 0.000 |
| coffee table               | 0.000 | toilet, can, commode, crapper, pot, potty, stool, throne | 0.000 | flower                                    | 0.000 |
| book                       | 0.000 | bench                                                    | 0.000 | countertop                                | 0.000 |
| stove                      | 0.000 | palm, palm tree                                          | 0.000 | kitchen island                            | 0.000 |
| computer                   | 0.000 | swivel chair                                             | 0.000 | boat                                      | 0.000 |
| arcade machine             | 0.000 | bus                                                      | 0.000 | towel                                     | 0.000 |
| light                      | 0.000 | truck                                                    | 0.000 | chandelier                                | 0.000 |
| awning, sunshade, sunblind | 0.000 | street lamp                                              | 0.000 | booth                                     | 0.000 |
| tv                         | 0.000 | plane                                                    | 0.000 | clothes                                   | 0.000 |
| pole                       | 0.000 | bannister, banister, balustrade, balusters, handrail     | 0.000 | ottoman, pouf, pouffe, puff, hassock      | 0.000 |
| bottle                     | 0.000 | van                                                      | 0.000 | ship                                      | 0.000 |
| fountain                   | 0.000 | washer, automatic washer, washing machine                | 0.000 | plaything, toy                            | 0.000 |
| stool                      | 0.000 | barrel, cask                                             | 0.000 | basket, handbasket                        | 0.000 |
| bag                        | 0.000 | minibike, motorbike                                      | 0.000 | oven                                      | 0.000 |
| ball                       | 0.000 | food, solid food                                         | 0.000 | step, stair                               | 0.000 |
| trade name                 | 0.000 | microwave                                                | 0.000 | pot                                       | 0.000 |
| animal                     | 0.000 | bicycle                                                  | 0.000 | dishwasher                                | 0.000 |
| screen                     | 0.000 | sculpture                                                | 0.000 | hood, exhaust hood                        | 0.000 |
| sconce                     | 0.000 | vase                                                     | 0.000 | traffic light                             | 0.000 |
| tray                       | 0.000 | trash can                                                | 0.000 | fan                                       | 0.000 |
| plate                      | 0.000 | monitor                                                  | 0.000 | bulletin board                            | 0.000 |
| radiator                   | 0.000 | glass, drinking glass                                    | 0.000 | clock                                     | 0.000 |
| flag                       | 0.000 |                                                          |       |                                           |       |
Loading and preparing results...
DONE (t=0.99s)
creating index...
index created!
INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *segm*
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 3.73 seconds.
INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results...
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 0.46 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.101
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.180
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.099
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.031
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.119
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.211
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.152
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.231
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.082
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.254
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.397
INFO:detectron2.evaluation.coco_evaluation:Evaluation results for segm: 
|   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
|:------:|:------:|:------:|:-----:|:------:|:------:|
| 10.118 | 18.028 | 9.900  | 3.135 | 11.858 | 21.123 |
INFO:detectron2.evaluation.coco_evaluation:Per-category segm AP: 
| category                   | AP     | category                                                 | AP     | category                                  | AP     |
|:---------------------------|:-------|:---------------------------------------------------------|:-------|:------------------------------------------|:-------|
| bed                        | 40.379 | window                                                   | 14.416 | cabinet                                   | 5.196  |
| person                     | 25.115 | door                                                     | 13.388 | table                                     | 4.216  |
| curtain                    | 19.579 | chair                                                    | 13.900 | car                                       | 30.626 |
| painting, picture          | 2.294  | sofa                                                     | 34.544 | shelf                                     | 2.526  |
| mirror                     | 36.965 | armchair                                                 | 17.870 | seat                                      | 1.049  |
| fence                      | 2.379  | desk                                                     | 1.640  | wardrobe, closet, press                   | 0.069  |
| lamp                       | 3.607  | tub                                                      | 17.333 | rail                                      | 0.478  |
| cushion                    | 3.489  | box                                                      | 2.804  | column, pillar                            | 0.845  |
| signboard, sign            | 5.112  | chest of drawers, chest, bureau, dresser                 | 6.025  | counter                                   | 0.291  |
| sink                       | 22.024 | fireplace                                                | 4.248  | refrigerator, icebox                      | 53.591 |
| stairs                     | 6.187  | case, display case, showcase, vitrine                    | 0.015  | pool table, billiard table, snooker table | 45.625 |
| pillow                     | 0.395  | screen door, screen                                      | 0.246  | bookcase                                  | 0.242  |
| coffee table               | 12.635 | toilet, can, commode, crapper, pot, potty, stool, throne | 50.129 | flower                                    | 5.641  |
| book                       | 1.696  | bench                                                    | 3.321  | countertop                                | 3.413  |
| stove                      | 12.236 | palm, palm tree                                          | 2.441  | kitchen island                            | 13.246 |
| computer                   | 0.816  | swivel chair                                             | 4.634  | boat                                      | 12.942 |
| arcade machine             | 13.251 | bus                                                      | 35.889 | towel                                     | 13.597 |
| light                      | 0.553  | truck                                                    | 10.119 | chandelier                                | 0.892  |
| awning, sunshade, sunblind | 3.094  | street lamp                                              | 0.022  | booth                                     | 0.206  |
| tv                         | 52.412 | plane                                                    | 21.707 | clothes                                   | 1.826  |
| pole                       | 0.347  | bannister, banister, balustrade, balusters, handrail     | 0.048  | ottoman, pouf, pouffe, puff, hassock      | 10.297 |
| bottle                     | 8.790  | van                                                      | 10.889 | ship                                      | 14.938 |
| fountain                   | 1.374  | washer, automatic washer, washing machine                | 5.893  | plaything, toy                            | 0.103  |
| stool                      | 4.042  | barrel, cask                                             | 15.875 | basket, handbasket                        | 3.537  |
| bag                        | 3.126  | minibike, motorbike                                      | 19.403 | oven                                      | 11.351 |
| ball                       | 7.071  | food, solid food                                         | 0.624  | step, stair                               | 0.469  |
| trade name                 | 1.155  | microwave                                                | 55.900 | pot                                       | 0.650  |
| animal                     | 15.783 | bicycle                                                  | 8.503  | dishwasher                                | 3.732  |
| screen                     | 0.165  | sculpture                                                | 5.731  | hood, exhaust hood                        | 0.000  |
| sconce                     | 0.672  | vase                                                     | 15.819 | traffic light                             | 5.216  |
| tray                       | 0.168  | trash can                                                | 7.170  | fan                                       | 0.648  |
| plate                      | 0.314  | monitor                                                  | 6.052  | bulletin board                            | 0.687  |
| radiator                   | 16.973 | glass, drinking glass                                    | 7.016  | clock                                     | 18.014 |
| flag                       | 7.855  |                                                          |        |                                           |        |
INFO:__main__:{'ade20k_panoptic_val/sem_seg/mIoU': 25.138934771544356, 'ade20k_panoptic_val/panoptic_seg/PQ': 18.97506503772162, 'ade20k_panoptic_val/panoptic_seg/SQ': 54.342604302857666, 'ade20k_panoptic_val/panoptic_seg/RQ': 23.493577018336495, 'ade20k_panoptic_val/bbox/AP': 0.0, 'ade20k_panoptic_val/segm/AP': 10.118232565750764}

Small typo in the README

  • [2022.02.07] We combine X-Decoder (strong image understanding), GPT-3 (strong language understanding) and Stable Diffusion (strong image generation) to make an instructional image editing demo, check it out!

The date should be 2023.02.07 instead of 2022, right?

About "X-Decoder-Seg+"

Hi, thanks for this nice work!

Please specify the process of "we take the heuristic way to extract noun phrases from COCO captions and use them as extra supervision on top of the matched decoder outputs".

  1. What do you mean by "matched decoder outputs"
  2. How does a "noun phrase" match a decoder output?

Looking forward to your reply!

Using all things/stuffs for inference

I want to do inference for all possible things/stuffs XDecoder is trained on.
How do I construct the metadatacatalog (that would be used in inference), contain all these classes?

Sampling strategy

Is the sampling strategy exactly the same as UniCL? Could you explain this more specifically? Thanks.

Traning code

Thanks for the great work!
Will all the training code be released recently?

strange results of instance segmentation

Hello, thank you for your great work! I encountered some strange errors when running the code with COCO images. I used the BestSegTiny model for open vocabulary instance segmentation, and most of the examples worked very well. However, when it came to categories like "man", "woman", "boy", "girl", "guy", strange results were generated. Especially, "man" is always recognized as "sky".
I'm not sure what caused this error, and I'm wondering if using a larger model would correct the results? I'm looking forward to the release of new ckpts.
pic1
pic2

Wrong description in the HTML project page

When posting a link to your project page on Discord, the embedded text is wrong.

Discord

This seems to be a copy-paste of the description of "Nerfies: Deformable Neural Radiance Fields".

You might want to edit the description in the HTML code of your project:

  <meta name="description"
        content="Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.">
  <meta name="keywords" content="Nerfies, D-NeRF, NeRF">
  <title>X-Decoder: Generalized Decoding for Pixel, Image and Language</title>

How to download xdecoder_focalt_vqa.pt

Can't download with the url

wget https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_last_novg.pt
wget https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_vqa.pt

PQ result when inference

Hi,
Thank you for your consideration.
I am running this evaluation code with BestSeg Tiny checkpoint on ADE dataset, but I cannot see where is PQ showing in the output. Could you show me how I can get the value?

mpirun -n 8 python eval.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml  --overrides WEIGHT [BestSeg Tiny](https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_best_openseg.pt)
INFO:datasets.evaluation.segmentation_evaluation:OrderedDict([('sem_seg', {'mIoU': 24.906070100032558, 'fwIoU': 56.48813787121508, 'IoU-wall': 66.61977529364381, 'IoU-building': 61.686172646915516
, 'IoU-sky': 93.01439625679751, 'IoU-floor': 55.50932632362232, 'IoU-tree': 64.54442464927908, 'IoU-ceiling': 77.49496143151943, 'IoU-road, route': 77.31548965512836, 'IoU-bed': 77.4902292301069, 
'IoU-window ': 48.1756515295317, 'IoU-grass': 60.422415485190086, 'IoU-cabinet': 48.013506766288764, 'IoU-sidewalk, pavement': 55.61478885979676, 'IoU-person': 82.24137142446519, 'IoU-earth, groun
d': 0.0, 'IoU-door': 36.11607509482755, 'IoU-table': 27.384548258439782, 'IoU-mountain, mount': 40.955413211111924, 'IoU-plant': 18.663105611819013, 'IoU-curtain': 64.62422235351946, 'IoU-chair': 
39.0646160710763, 'IoU-car': 75.91038694971013, 'IoU-water': 21.279867350426827, 'IoU-painting, picture': 9.535676742284622, 'IoU-sofa': 57.18693982433452, 'IoU-shelf': 29.684473915323572, 'IoU-ho
use': 19.03444434625438, 'IoU-sea': 43.15172790035995, 'IoU-mirror': 60.23490728806231, 'IoU-rug': 21.813911723878732, 'IoU-field': 20.382156190607574, 'IoU-armchair': 24.99894621869992, 'IoU-seat
': 18.71646580465292, 'IoU-fence': 26.003060872301774, 'IoU-desk': 7.8400777016298075, 'IoU-rock, stone': 25.894778455588952, 'IoU-wardrobe, closet, press': 0.1436282418813819, 'IoU-lamp': 0.54134
76360545296, 'IoU-tub': 54.26845967428996, 'IoU-rail': 8.755158777335478, 'IoU-cushion': 0.41486664629697106, 'IoU-base, pedestal, stand': 0.0, 'IoU-box': 14.633044429384977, 'IoU-column, pillar':
 0.01473894282814901, 'IoU-signboard, sign': 17.25362379657421, 'IoU-chest of drawers, chest, bureau, dresser': 3.856919525909023, 'IoU-counter': 3.5828783721346875, 'IoU-sand': 42.102149254481276
, 'IoU-sink': 57.40884944322997, 'IoU-skyscraper': 0.00863472303898623, 'IoU-fireplace': 22.032838709976183, 'IoU-refrigerator, icebox': 56.98516552782412, 'IoU-grandstand, covered stand': 25.7227
0214487473, 'IoU-path': 7.399178947435783, 'IoU-stairs': 21.38567914570656, 'IoU-runway': 33.59013043788163, 'IoU-case, display case, showcase, vitrine': 0.0, 'IoU-pool table, billiard table, snoo
ker table': 74.53544749221741, 'IoU-pillow': 0.07518131113896317, 'IoU-screen door, screen': 0.0, 'IoU-stairway, staircase': 0.43446658348730305, 'IoU-river': 10.924773698806375, 'IoU-bridge, span
': 28.482600059873786, 'IoU-bookcase': 0.5151970123771235, 'IoU-blind, screen': 27.547533785121125, 'IoU-coffee table': 0.8817475730284592, 'IoU-toilet, can, commode, crapper, pot, potty, stool, t
hrone': 82.89734111449071, 'IoU-flower': 28.097456972058275, 'IoU-book': 36.279829668107304, 'IoU-hill': 0.041143133735352846, 'IoU-bench': 33.56947155364012, 'IoU-countertop': 16.015611690381427,
 'IoU-stove': 10.272521332419522, 'IoU-palm, palm tree': 2.3869788714500717, 'IoU-kitchen island': 21.81746243524508, 'IoU-computer': 17.14787349332021, 'IoU-swivel chair': 20.157961725297216, 'Io
U-boat': 68.79855577096204, 'IoU-bar': 0.02561358846374377, 'IoU-arcade machine': 10.70786438145052, 'IoU-hovel, hut, hutch, shack, shanty': 3.0488305908362476, 'IoU-bus': 70.62583672672505, 'IoU-
towel': 52.92953584348092, 'IoU-light': 11.821621424549436, 'IoU-truck': 17.873127725791846, 'IoU-tower': 4.224330548274466, 'IoU-chandelier': 1.1220895092091254, 'IoU-awning, sunshade, sunblind':
 6.616024058269303, 'IoU-street lamp': 0.0, 'IoU-booth': 0.15086258505263087, 'IoU-tv': 23.65220188276453, 'IoU-plane': 56.773936346924756, 'IoU-dirt track': 0.43521387670587325, 'IoU-clothes': 21
.258650644694562, 'IoU-pole': 19.32663959381968, 'IoU-land, ground, soil': 0.46559736720594347, 'IoU-bannister, banister, balustrade, balusters, handrail': 0.560122796151464, 'IoU-escalator, movin
g staircase, moving stairway': 0.0162891978031097, 'IoU-ottoman, pouf, pouffe, puff, hassock': 23.25492985064977, 'IoU-bottle': 28.12634833884654, 'IoU-buffet, counter, sideboard': 0.2758246120797
785, 'IoU-poster, posting, placard, notice, bill, card': 19.13703697778062, 'IoU-stage': 1.77526462298315, 'IoU-van': 25.709683301926933, 'IoU-ship': 7.068624870371017, 'IoU-fountain': 3.101805718
290426, 'IoU-conveyer belt, conveyor belt, conveyer, conveyor, transporter': 16.533787213711964, 'IoU-canopy': 0.0, 'IoU-washer, automatic washer, washing machine': 22.37631128950778, 'IoU-playthi
ng, toy': 5.314420570382705, 'IoU-pool': 20.959738781790747, 'IoU-stool': 16.827886868101345, 'IoU-barrel, cask': 4.834328636287863, 'IoU-basket, handbasket': 11.583300088020923, 'IoU-falls': 32.7
7802080487416, 'IoU-tent': 71.40640707727364, 'IoU-bag': 13.473183014587715, 'IoU-minibike, motorbike': 63.639934684823466, 'IoU-cradle': 10.330452727976267, 'IoU-oven': 12.40630869557509, 'IoU-ba
ll': 27.72555385976339, 'IoU-food, solid food': 44.03799235138715, 'IoU-step, stair': 0.11994717922382803, 'IoU-tank, storage tank': 14.484766361076412, 'IoU-trade name': 1.4821179250349048, 'IoU-
microwave': 81.64071768382925, 'IoU-pot': 11.439637772080083, 'IoU-animal': 65.39664495514317, 'IoU-bicycle': 55.0969895488581, 'IoU-lake': 0.0035481019133139567, 'IoU-dishwasher': 2.4782123291820
644, 'IoU-screen': 0.0, 'IoU-blanket, cover': 0.09492631909517851, 'IoU-sculpture': 14.610162806171632, 'IoU-hood, exhaust hood': 0.0, 'IoU-sconce': 0.3579455589777741, 'IoU-vase': 21.379527228890
403, 'IoU-traffic light': 23.907536415452817, 'IoU-tray': 5.549347607095087, 'IoU-trash can': 21.94514409763293, 'IoU-fan': 6.584765209604298, 'IoU-pier': 52.210612766345754, 'IoU-crt screen': 7.6
8999518999519, 'IoU-plate': 14.452253891471493, 'IoU-monitor': 19.805141076205974, 'IoU-bulletin board': 15.188605496920054, 'IoU-shower': 1.1203642318951137, 'IoU-radiator': 22.429703078072507, '
IoU-glass, drinking glass': 22.770563080904573, 'IoU-clock': 19.53432728011467, 'IoU-flag': 42.22801707703967, 'mACC': 40.35234468002594, 'pACC': 68.7597599281644, 'ACC-wall': 75.15670120826825, '
ACC-building': 68.67683393707412, 'ACC-sky': 95.78909199997325, 'ACC-floor': 60.22084554380375, 'ACC-tree': 92.80746256443825, 'ACC-ceiling': 89.93663735376995, 'ACC-road, route': 87.8153535576437
9, 'ACC-bed': 94.38398201483389, 'ACC-window ': 70.65793138930107, 'ACC-grass': 84.02449503110103, 'ACC-cabinet': 75.40193167798917, 'ACC-sidewalk, pavement': 74.67516786509663, 'ACC-person': 93.4
3694710780233, 'ACC-earth, ground': 0.0, 'ACC-door': 58.60643747791655, 'ACC-table': 48.870157085166504, 'ACC-mountain, mount': 54.60010698573554, 'ACC-plant': 21.0733259943901, 'ACC-curtain': 87.
95054658274954, 'ACC-chair': 50.591434334198006, 'ACC-car': 84.48022360668509, 'ACC-water': 27.84542279761975, 'ACC-painting, picture': 9.731982703479733, 'ACC-sofa': 83.30536364230761, 'ACC-shelf
': 62.63653966371074, 'ACC-house': 81.32707087266269, 'ACC-sea': 65.19502955657389, 'ACC-mirror': 78.17358618008736, 'ACC-rug': 88.76685399337744, 'ACC-field': 40.696410424907555, 'ACC-armchair': 
58.069813501995185, 'ACC-seat': 24.152624100075183, 'ACC-fence': 68.61097175507057, 'ACC-desk': 10.22930868729243, 'ACC-rock, stone': 78.59270114544158, 'ACC-wardrobe, closet, press': 0.1457117503
9727374, 'ACC-lamp': 0.5445746343245543, 'ACC-tub': 57.88049209449019, 'ACC-rail': 20.92687386168558, 'ACC-cushion': 0.4199210621996574, 'ACC-base, pedestal, stand': 0.0, 'ACC-box': 16.91968602077
615, 'ACC-column, pillar': 0.014741265376664746, 'ACC-signboard, sign': 18.173210119568388, 'ACC-chest of drawers, chest, bureau, dresser': 6.056264461843328, 'ACC-counter': 5.369140951577996, 'AC
C-sand': 79.06253722594334, 'ACC-sink': 77.63702513641833, 'ACC-skyscraper': 0.008999024316310968, 'ACC-fireplace': 33.242078468315626, 'ACC-refrigerator, icebox': 80.61806003729038, 'ACC-grandsta
nd, covered stand': 55.012662215203775, 'ACC-path': 8.455863083593265, 'ACC-stairs': 34.94661706879935, 'ACC-runway': 45.469493363822686, 'ACC-case, display case, showcase, vitrine': 0.0, 'ACC-poo
l table, billiard table, snooker table': 79.12364012022634, 'ACC-pillow': 0.08182200100365528, 'ACC-screen door, screen': 0.0, 'ACC-stairway, staircase': 0.45854961938294114, 'ACC-river': 53.40964
6868303604, 'ACC-bridge, span': 91.17194986492365, 'ACC-bookcase': 0.5256641175594662, 'ACC-blind, screen': 44.68027965107038, 'ACC-coffee table': 1.1000662146558753, 'ACC-toilet, can, commode, cr
apper, pot, potty, stool, throne': 89.06724591073926, 'ACC-flower': 58.852205974377355, 'ACC-book': 57.24606762924744, 'ACC-hill': 0.041354229382179974, 'ACC-bench': 67.33395668697592, 'ACC-counte
rtop': 54.03231228704103, 'ACC-stove': 11.937893768857206, 'ACC-palm, palm tree': 2.63794537266041, 'ACC-kitchen island': 27.909905774199434, 'ACC-computer': 18.984995644741925, 'ACC-swivel chair'
: 49.960279406736305, 'ACC-boat': 84.16026197565539, 'ACC-bar': 0.031770913970170823, 'ACC-arcade machine': 14.247488261725652, 'ACC-hovel, hut, hutch, shack, shanty': 14.606730445461208, 'ACC-bus
': 95.69015807634807, 'ACC-towel': 70.05251673009853, 'ACC-light': 67.63898063484336, 'ACC-truck': 77.54445385266723, 'ACC-tower': 5.937523893669983, 'ACC-chandelier': 1.1684525706219897, 'ACC-awn
ing, sunshade, sunblind': 6.683245602112896, 'ACC-street lamp': 0.0, 'ACC-booth': 0.31014948361165906, 'ACC-tv': 88.33659716622509, 'ACC-plane': 65.70826800415259, 'ACC-dirt track': 20.84110890936
7247, 'ACC-clothes': 44.47192964329344, 'ACC-pole': 24.358429311277728, 'ACC-land, ground, soil': 2.5005435964340075, 'ACC-bannister, banister, balustrade, balusters, handrail': 1.188901409370764,
 'ACC-escalator, moving staircase, moving stairway': 0.01704667895314093, 'ACC-ottoman, pouf, pouffe, puff, hassock': 32.68594650789151, 'ACC-bottle': 49.44455857023213, 'ACC-buffet, counter, side
board': 0.575008821879276, 'ACC-poster, posting, placard, notice, bill, card': 43.11282519543765, 'ACC-stage': 6.299271112652419, 'ACC-van': 28.385769962751855, 'ACC-ship': 7.534140075716604, 'ACC
-fountain': 3.7697330071891897, 'ACC-conveyer belt, conveyor belt, conveyer, conveyor, transporter': 57.181953140127796, 'ACC-canopy': 0.0, 'ACC-washer, automatic washer, washing machine': 22.4823
30654615268, 'ACC-plaything, toy': 20.16468316860214, 'ACC-pool': 39.71261974177426, 'ACC-stool': 40.52768837010897, 'ACC-barrel, cask': 56.09284332688588, 'ACC-basket, handbasket': 20.20389454035
3666, 'ACC-falls': 97.21638040459206, 'ACC-tent': 97.23345227556878, 'ACC-bag': 20.450235714746995, 'ACC-minibike, motorbike': 90.5424878500474, 'ACC-cradle': 30.83022818957288, 'ACC-oven': 75.076
36381995809, 'ACC-ball': 34.179809213364656, 'ACC-food, solid food': 91.2132115723329, 'ACC-step, stair': 0.12409349135330214, 'ACC-tank, storage tank': 47.44675391326661, 'ACC-trade name': 1.5337
997519713598, 'ACC-microwave': 94.3721055965209, 'ACC-pot': 15.866938656916988, 'ACC-animal': 86.82721077890477, 'ACC-bicycle': 84.62376841921701, 'ACC-lake': 0.003810539953511413, 'ACC-dishwasher
': 3.5840118152860265, 'ACC-screen': 0.0, 'ACC-blanket, cover': 0.13857919853293602, 'ACC-sculpture': 64.88891435681658, 'ACC-hood, exhaust hood': 0.0, 'ACC-sconce': 0.4362750900451494, 'ACC-vase : 71.10418893006829, 'ACC-traffic light': 32.50758035172832, 'ACC-tray': 16.195304534852863, 'ACC-trash can': 26.094582519773635, 'ACC-fan': 9.99769344696142, 'ACC-pier': 79.61677079999443, 'ACC-crt screen': 17.499422785849276, 'ACC-plate': 25.26266284611618, 'ACC-monitor': 25.13872113987642, 'ACC-bulletin board': 47.330646015438575, 'ACC-shower': 17.920782490799922, 'ACC-radiator': 23.469
234552561414, 'ACC-glass, drinking glass': 28.009329682587975, 'ACC-clock': 32.82381945761426, 'ACC-flag': 53.67568311405477})])
INFO:detectron2.evaluation.coco_evaluation:Preparing results for COCO format ...                                                                                                                    
INFO:detectron2.evaluation.coco_evaluation:Saving results to ../data/train_outputs/xdecoder/test/coco_instances_results.json                                                                        
INFO:detectron2.evaluation.coco_evaluation:Evaluating predictions with unofficial COCO API...                                                                                                       
Loading and preparing results...                                                                  
DONE (t=0.07s)                                                                                    
creating index...                                                                                 
index created!                                                                                    
INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *bbox*                                                                                                                            
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 4.56 seconds.                                                                                                          
INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results...                                                                                                                         
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 0.65 seconds.                                                                                                        
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000                                                                                                                     
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000                                                                                                                     
INFO:detectron2.evaluation.coco_evaluation:Evaluation results for bbox:                                                                                                                             
|  AP   |  AP50  |  AP75  |  APs  |  APm  |  APl  |                                                                                                                                                 
|:-----:|:------:|:------:|:-----:|:-----:|:-----:|                                                                                                                                                 
| 0.000 | 0.000  | 0.000  | 0.000 | 0.000 | 0.000 |                                                                                                                                                 
INFO:detectron2.evaluation.coco_evaluation:Per-category bbox AP:                                                                                                                                    
| category                   | AP    | category                                                 | AP    | category                                  | AP    |
|:---------------------------|:------|:---------------------------------------------------------|:------|:------------------------------------------|:------|
| bed                        | 0.000 | window                                                   | 0.000 | cabinet                                   | 0.000 |
| person                     | 0.000 | door                                                     | 0.000 | table                                     | 0.000 |
| curtain                    | 0.000 | chair                                                    | 0.000 | car                                       | 0.000 |
| painting, picture          | 0.000 | sofa                                                     | 0.000 | shelf                                     | 0.000 |
| mirror                     | 0.000 | armchair                                                 | 0.000 | seat                                      | 0.000 |
| fence                      | 0.000 | desk                                                     | 0.000 | wardrobe, closet, press                   | 0.000 |
| lamp                       | 0.000 | tub                                                      | 0.000 | rail                                      | 0.000 |
| cushion                    | 0.000 | box                                                      | 0.000 | column, pillar                            | 0.000 |
| signboard, sign            | 0.000 | chest of drawers, chest, bureau, dresser                 | 0.000 | counter                                   | 0.000 |
| sink                       | 0.000 | fireplace                                                | 0.000 | refrigerator, icebox                      | 0.000 |
| stairs                     | 0.000 | case, display case, showcase, vitrine                    | 0.000 | pool table, billiard table, snooker table | 0.000 |
| pillow                     | 0.000 | screen door, screen                                      | 0.000 | bookcase                                  | 0.000 |
| coffee table               | 0.000 | toilet, can, commode, crapper, pot, potty, stool, throne | 0.000 | flower                                    | 0.000 |
| book                       | 0.000 | bench                                                    | 0.000 | countertop                                | 0.000 |
| stove                      | 0.000 | palm, palm tree                                          | 0.000 | kitchen island                            | 0.000 |
| computer                   | 0.000 | swivel chair                                             | 0.000 | boat                                      | 0.000 |
| arcade machine             | 0.000 | bus                                                      | 0.000 | towel                                     | 0.000 |
| light                      | 0.000 | truck                                                    | 0.000 | chandelier                                | 0.000 |
| awning, sunshade, sunblind | 0.000 | street lamp                                              | 0.000 | booth                                     | 0.000 |
| tv                         | 0.000 | plane                                                    | 0.000 | clothes                                   | 0.000 |
| pole                       | 0.000 | bannister, banister, balustrade, balusters, handrail     | 0.000 | ottoman, pouf, pouffe, puff, hassock      | 0.000 |
| bottle                     | 0.000 | van                                                      | 0.000 | ship                                      | 0.000 |
| fountain                   | 0.000 | washer, automatic washer, washing machine                | 0.000 | plaything, toy                            | 0.000 |
| stool                      | 0.000 | barrel, cask                                             | 0.000 | basket, handbasket                        | 0.000 |
| bag                        | 0.000 | minibike, motorbike                                      | 0.000 | oven                                      | 0.000 |
| ball                       | 0.000 | food, solid food                                         | 0.000 | step, stair                               | 0.000 |
| trade name                 | 0.000 | microwave                                                | 0.000 | pot                                       | 0.000 |                                       
| animal                     | 0.000 | bicycle                                                  | 0.000 | dishwasher                                | 0.000 |                                       
| screen                     | 0.000 | sculpture                                                | 0.000 | hood, exhaust hood                        | 0.000 |                                       
| sconce                     | 0.000 | vase                                                     | 0.000 | traffic light                             | 0.000 |                                       
| tray                       | 0.000 | trash can                                                | 0.000 | fan                                       | 0.000 |                                       
| plate                      | 0.000 | monitor                                                  | 0.000 | bulletin board                            | 0.000 |                                       
| radiator                   | 0.000 | glass, drinking glass                                    | 0.000 | clock                                     | 0.000 |                                       
| flag                       | 0.000 |                                                          |       |                                           |       |                                       
Loading and preparing results...                                                                                                                                                                    
DONE (t=0.87s)                                                                                                                                                                                      
creating index...                                                                                                                                                                                   
index created!                                                                                                                                                                                      
INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *segm*                                                                                                                            
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 4.98 seconds.                                                                                                          
INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results...                                                                                                                         
INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 0.60 seconds.                                                                                                        
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.100                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.180                                                                                                                     
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.095                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.034                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.118                                                                                                                     
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.210                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.153                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.238                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.246                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.092                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.265                                                                                                                     
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399                                                                                                                     
INFO:detectron2.evaluation.coco_evaluation:Evaluation results for segm:                                                                                                                             
|  AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |                                                                                                                                               
|:-----:|:------:|:------:|:-----:|:------:|:------:|                                                                                                                                               
| 9.953 | 17.999 | 9.532  | 3.372 | 11.822 | 20.981 |                                                                                                                                               
INFO:detectron2.evaluation.coco_evaluation:Per-category segm AP:                                                                                                                                    
| category                   | AP     | category                                                 | AP     | category                                  | AP     |
|:---------------------------|:-------|:---------------------------------------------------------|:-------|:------------------------------------------|:-------|
| bed                        | 39.526 | window                                                   | 14.031 | cabinet                                   | 4.851  |
| person                     | 25.881 | door                                                     | 13.307 | table                                     | 3.983  |
| curtain                    | 19.278 | chair                                                    | 13.749 | car                                       | 30.366 |
| painting, picture          | 1.841  | sofa                                                     | 35.950 | shelf                                     | 2.820  |
| mirror                     | 35.698 | armchair                                                 | 16.265 | seat                                      | 1.160  |
| fence                      | 2.201  | desk                                                     | 1.411  | wardrobe, closet, press                   | 0.025  |
| lamp                       | 3.738  | tub                                                      | 19.004 | rail                                      | 0.470  |
| cushion                    | 3.202  | box                                                      | 2.820  | column, pillar                            | 1.295  |
| signboard, sign            | 5.334  | chest of drawers, chest, bureau, dresser                 | 9.151  | counter                                   | 0.226  |
| sink                       | 21.508 | fireplace                                                | 4.796  | refrigerator, icebox                      | 54.810 |
| stairs                     | 6.297  | case, display case, showcase, vitrine                    | 0.019  | pool table, billiard table, snooker table | 45.733 |
| pillow                     | 0.389  | screen door, screen                                      | 0.183  | bookcase                                  | 0.071  |
| coffee table               | 12.674 | toilet, can, commode, crapper, pot, potty, stool, throne | 51.184 | flower                                    | 6.356  |
| book                       | 1.527  | bench                                                    | 3.421  | countertop                                | 3.035  |
| stove                      | 13.380 | palm, palm tree                                          | 1.796  | kitchen island                            | 18.354 |
| computer                   | 0.644  | swivel chair                                             | 6.743  | boat                                      | 12.381 |
| arcade machine             | 8.560  | bus                                                      | 36.174 | towel                                     | 13.658 |
| light                      | 0.660  | truck                                                    | 9.730  | chandelier                                | 2.660  |
| awning, sunshade, sunblind | 4.058  | street lamp                                              | 0.047  | booth                                     | 0.246  |
| tv                         | 50.874 | plane                                                    | 20.882 | clothes                                   | 0.990  |
| pole                       | 0.214  | bannister, banister, balustrade, balusters, handrail     | 0.093  | ottoman, pouf, pouffe, puff, hassock      | 9.956  |
| bottle                     | 9.767  | van                                                      | 11.846 | ship                                      | 13.218 |
| fountain                   | 0.703  | washer, automatic washer, washing machine                | 4.326  | plaything, toy                            | 0.099  |
| stool                      | 3.533  | barrel, cask                                             | 3.285  | basket, handbasket                        | 3.355  |
| bag                        | 2.679  | minibike, motorbike                                      | 18.870 | oven                                      | 9.317  |
| ball                       | 7.168  | food, solid food                                         | 0.943  | step, stair                               | 0.691  |
| trade name                 | 0.839  | microwave                                                | 56.907 | pot                                       | 0.579  |
| animal                     | 14.792 | bicycle                                                  | 7.610  | dishwasher                                | 2.745  |
| screen                     | 0.264  | sculpture                                                | 5.385  | hood, exhaust hood                        | 0.000  |
| sconce                     | 0.851  | vase                                                     | 15.081 | traffic light                             | 4.156  |
| tray                       | 0.703  | trash can                                                | 8.347  | fan                                       | 0.497  |
| plate                      | 0.300  | monitor                                                  | 5.606  | bulletin board                            | 0.783  |
| radiator                   | 14.912 | glass, drinking glass                                    | 7.721  | clock                                     | 17.529 |
| flag                       | 8.235  |                                                          |        |                                           |        |
INFO:__main__:{'ade20k_panoptic_val/sem_seg/mIoU': 24.906070100032558, 'ade20k_panoptic_val/bbox/AP': 0.0, 'ade20k_panoptic_val/segm/AP': 9.953305772030166}

'refcocog_umd_val.json' file not found

Hi, when I try to evaluate X-Decoder using suggested refcocog dataset, I met this error saying 'refcocog_umd_VAL.json' file not found. After checking the script, 'refcoco2json.py' only help to produce the file 'refcocog_umd_TRAIN.json'. However, if I simply modified the refdcoco2json file dir from train2017 to val2017, the script does not work. Thus, I want to ask how can I create the file 'refcocog_umd_val.json' as required?

coco_caption.zip download link is invalid

when I run install_cococapeval.sh, I cannot download coco_caption.zip and it returned PublicAccessNotPermitted Error. I wonder how to solve this? Thanks for your work!

HuggingFace Spaces demo not working

The HF Spaces demo keeps building for a long time and shows no sight of progress. The Instruct X decoder demo is similarly not working, as I get a stack trace complaining about communication OpenAI. Ideally both, or at least one of these demos, are usable and testable from the user perspective.

image

This is a screenshot of stack trace for the second demo.

image

Clarification on Referring Segmentation

Based on the code:

texts_grd.append([x['raw'].lower() for x in ann['sentences']])

t_emb = getattr(self.sem_seg_head.predictor.lang_encoder, "{}_text_embeddings".format('grounding')).t()
v_emb = caption_pred_result[:-1]
v_emb = v_emb / (v_emb.norm(dim=-1, keepdim=True) + 1e-7)
vt_sim = v_emb @ t_emb
max_id = vt_sim.max(0)[1][0]
grd_masks += [mask_pred_result[max_id]]

Is it true that referring segmentation in X-Decoder is done by segmentation -> classification (matching mask with highest similarity)?

About X-Decoder

Thank you for your great work.

I’m trying to run X-Decoder in a local environment.

For Train and Evaluation, what kind of PC specs do you expect when running on a Single GPU?

Also, for DATASET.md, do I need to download the entire dataset?

Finally, regarding the demo, when do you plan to release it?

I look forward to hearing from you.

gpu cost duing training

Thanks for sharing the awesome work!

I have a minor question.

how many GPU hours does the model training process need on the COCO and ADE20K datasets?

The loss of referring segmentation

Thanks for the great work,

In section 4.1, you mentioned that the model was pre-trained on "panoramic segmentation, image-text"
pairs (itp), and referring segmentation.I can't find the details of how you useReferring Segmentationdata in 3.4, would you mind providing more details aboutReferring Segmentation` data loss in the pre-training phase? or did I miss it?

Thanks

FileNotFoundError: [Errno 2] No such file or directory: 'caption_class_similarity.pth'

Thank you for your great work.
when I try to run the train script, I met a error, can you give me some help?
I met the error when run the commend:

CUDA_VISIBLE_DEVICES=0 python entry.py train \
            --conf_files configs/xdecoder/segvlp_focalt_lang.yaml \
            --overrides \
            COCO.INPUT.IMAGE_SIZE 1024 \
            MODEL.DECODER.CAPTIONING.ENABLED True \
            MODEL.DECODER.RETRIEVAL.ENABLED True \
            MODEL.DECODER.GROUNDING.ENABLED True \
            MODEL.DECODER.CAPTIONING_WEIGHT 8 \
            MODEL.DECODER.RETRIEVAL_WEIGHT 8 \
            MODEL.DECODER.TOP_CAPTIONING_LAYERS 3 \
            MODEL.DECODER.TOP_RETRIEVAL_LAYERS 3 \
            MODEL.DECODER.TOP_GROUNDING_LAYERS 6 \
            COCO.TEST.BATCH_SIZE_TOTAL 1 \
            COCO.TRAIN.BATCH_SIZE_TOTAL 1 \
            COCO.TRAIN.BATCH_SIZE_PER_GPU 1 \
            VLP.TEST.BATCH_SIZE_TOTAL 32 \
            VLP.TRAIN.BATCH_SIZE_TOTAL 32 \
            VLP.TRAIN.BATCH_SIZE_PER_GPU 32 \
            MODEL.DECODER.HIDDEN_DIM 512 \
            MODEL.ENCODER.CONVS_DIM 512 \
            MODEL.ENCODER.MASK_DIM 512 \
            FP16 True \

and met:
FileNotFoundError: [Errno 2] No such file or directory: 'caption_class_similarity.pth',
where to download the pth file?

sh install_cococapeval.sh : [ERROR 409: Public access is not permitted on this storage account..]

Thank you for your great work.

I ran install_cococapeval.sh and got the following error.
Error Description:
--2023-07-21 18:39:40--
Resolving projects4jw.blob.core.windows.net (projects4jw.blob.core.windows.net)... 20.60.153.33
Connecting to projects4jw.blob.core.windows.net (projects4jw.blob.core.windows.net)|20.60.153.33|:443... connected.
HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account.
2023-07-21 18:39:40 ERROR 409: Public access is not permitted on this storage account..

Probably because this URL [https://projects4jw.blob.core.windows.net/x-decoder/release/coco_caption.zip] is not publicly available.
Can you give us public access to your resources?

The shape mismatch in evaluation

**I encountered a simple bug when evaluating. How can this problem be solved? It seems like 817920 is three times as much as 272540. **


/home/user/anaconda3/envs/xdecoder/lib/python3.9/site-packages/detectron2/structures/image_list.py:88: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  max_size = (max_size + (stride - 1)) // stride * stride
/nvme/user/project/X-Decoder/xdecoder/modules/position_encoding.py:41: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
/nvme/user/project/X-Decoder/xdecoder/architectures/xdecoder_model.py:899: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  topk_indices = (topk_indices // self.sem_seg_head.num_classes)
Traceback (most recent call last):
  File "/nvme/user/project/X-Decoder/entry.py", line 53, in <module>
    main()
  File "/nvme/user/project/X-Decoder/entry.py", line 48, in main
    trainer.eval()
  File "/nvme/user/project/X-Decoder/trainer/default_trainer.py", line 82, in eval
    results = self._eval_on_set(self.save_folder)
  File "/nvme/user/project/X-Decoder/trainer/default_trainer.py", line 87, in _eval_on_set
    results = self.pipeline.evaluate_model(self, save_folder)
  File "/nvme/user/project/X-Decoder/./pipeline/XDecoderPipeline.py", line 162, in evaluate_model
    self.evaluator.process(batch, outputs)
  File "/home/user/anaconda3/envs/xdecoder/lib/python3.9/site-packages/detectron2/evaluation/evaluator.py", line 88, in process
    evaluator.process(inputs, outputs)
  File "/nvme/user/project/X-Decoder/datasets/evaluation/segmentation_evaluation.py", line 113, in process
    (self._num_classes + 4) * pred.reshape(-1) + gt.reshape(-1),
ValueError: operands could not be broadcast together with shapes (272640,) (817920,) 

Question about the Segmentation task

Hi @MaureenZOU ,
Thank you for your consideration

I have a question that if I run this:

mpirun -n 8 python eval.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml  --overrides WEIGHT [BestSeg Tiny](https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_best_openseg.pt)

and just only for ADE panoptic dataset:
image

What would be the model between the following two?
image

Referring captioning demo not using grounding mask

Hi, thanks for the great work! Quick question about demo/demo_refcap.py: the grounding mask is zeroed out at this line, which seems counterintuitive if we want to pass it to the cross-attention layers. Should the line be removed for proper behavior?

Unable to reproduce open segmentation results for Pascal VOC

In your paper you report the mIoU of the X-Decoder (T) model to be 96.2. I tried to reproduce these results. I did not find the appropriate evaluation script so I implemented it myself based on the demo_semseg.py file. I'm using the BestSeg Tiny model and for every image I input the labels present in the ground truth segmentation of that image.

The mIoU I get this way is 51.6. In many cases the model does not find the target class and segments everything as "background".

Here is the function that I use to segment the image, based on demo_semseg.py.

def segment_image(model, image_ori, classes):
    with torch.no_grad():
        model.model.sem_seg_head.predictor.lang_encoder.get_text_embeddings(classes + ["background"], is_eval=True)
        metadata = MetadataCatalog.get('demo')
        model.model.metadata = metadata
        model.model.sem_seg_head.num_classes = len(classes)

        t = [transforms.Resize(512, interpolation=Image.BICUBIC)]
        transform = transforms.Compose(t)

        width = image_ori.size[-2]
        height = image_ori.size[-1]
        image = transform(image_ori)
        image = np.asarray(image)
        image = torch.from_numpy(image.copy()).permute(2, 0, 1).cuda()

        batch_inputs = [{'image': image.squeeze(), 'height': height, 'width': width}]
        outputs = model.forward(batch_inputs)
        sem_seg = outputs[-1]['sem_seg'].max(0)[1]
        classes_detected = sem_seg.unique()
        classes_detected = [classes[i] for i in classes_detected]

    return sem_seg, classes_detected

Am I doing something wrong here? Could you maybe share the code you used to obtain the reported result on VOC?

Typo in Paper

In Figure 4, there is a typo in the word "green". Great work team!
image

Generic segmentation

Hi @MaureenZOU ,
I have quick question about where is in the code for the arrow from Text Encoder to Semantic Output when inference.

image

Attempt to run X-Decoder demo in google colab

Hi there! The hugging face demo is super cool! Thanks for sharing! :)

Today I tried running inference in google colab.

I installed following your instructions with:

!git clone https://github.com/microsoft/X-Decoder.git
!pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
!python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
!pip install git+https://github.com/cocodataset/panopticapi.git
!curl -O https://projects4jw.blob.core.windows.net/x-decoder/release/xdecoder_focalt_last_novg.pt
!python -m pip install -r X-Decoder/requirements.txt

And then ran with:

%cd /content/X-Decoder
!python demo/demo_captioning.py evaluate --conf_files configs/xdecoder/svlp_focalt_lang.yaml  --overrides WEIGHT ../xdecoder_focalt_last_novg.pt

And got the following segmentation fault error:

/content/X-Decoder
$UNUSED$ criterion.empty_weight, Ckpt Shape: torch.Size([134])
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py:329: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
  warnings.warn(
2023-01-03 10:20:33.905664: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-03 10:20:35.257457: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-01-03 10:20:35.257602: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-01-03 10:20:35.257631: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[c639da4ebb3c:07861] *** Process received signal ***
[c639da4ebb3c:07861] Signal: Segmentation fault (11)
[c639da4ebb3c:07861] Signal code: Address not mapped (1)
[c639da4ebb3c:07861] Failing at address: 0x7f282066f20d
[c639da4ebb3c:07861] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f2822f11980]
[c639da4ebb3c:07861] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f2822b50775]
[c639da4ebb3c:07861] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f28233bbe44]
[c639da4ebb3c:07861] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f2822b51605]
[c639da4ebb3c:07861] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f28233b9cb3]
[c639da4ebb3c:07861] *** End of error message ***

Do you have any idea what might be going on?

A little bug in `vlpencoder.py`

code:

clss_embeddings.append(extract_mean_emb(class_names))

I think this should be:

if prompt:
    for clss in class_names:
        txts = [template.format(clss.replace('-other','').replace('-merged','').replace('-stuff','')) for template in templates]
        clss_embeddings.append(extract_mean_emb(txts))
else:
    for clss in class_names:
        clss_embeddings.append(extract_mean_emb([clss]))

How to get pred bboxes

Hi there,

I was playing with inference demo.
I was wondering from the output of the model how to get the predicted bboxes in addition to the segmentation.
The pred_boxes field is empty although it detects segments. Any help is much appreciated.

ipdb> outputs[-1].keys()
dict_keys(['sem_seg', 'panoptic_seg', 'instances', 'captions', 'masks'])
ipdb> outputs[-1]['instances']
Instances(num_instances=0, image_height=408, image_width=612, fields=[pred_masks: tensor([], device='cuda:0', size=(0, 408, 612)), pred_boxes: Boxes(tensor([], size=(0, 4))), scores: tensor([], device='cuda:0'), pred_classes: tensor([], device='cuda:0', dtype=torch.int64)])

Thanks

is the model can segment all types of classes?

Hi,thanks for your great work!
it sees x-decoder is zero-shot segment model, but when i test it by my data which including floor and bed class , it can not work well, could you pls. explain it ? thanks!

Question about caption finetune

Hello, thanks for your great work!

I would like to reproduce the results of caption finetuning. which dataset did you perform the caption finetuning?
And does the current code support finetune?

Text Encoder src code

Hi,
Thank you for your consideration.
Could you show me where I could find this part in the code?
image

Thank you.

Error about runing demo

Your great work is awesome! When I try to run the demo_semseg.py script, and the error is as follows below, and how I can run successfully.
Traceback (most recent call last): File "D:\Projects\X-Decoder\demo\demo_instseg.py", line 97, in <module> main() File "D:\Projects\X-Decoder\demo\demo_instseg.py", line 40, in main opt, cmdline_args = load_opt_command(args) File "D:\Projects\X-Decoder\utils\arguments.py", line 69, in load_opt_command assert len(cmdline_args.overrides) % 2 == 0, "overrides arguments is not paired, required: key value" AssertionError: overrides arguments is not paired, required: key value
How I can input the correct "overrides arguments"? thank you.

About Training Dataset

Hello,

Thank you so much for updating the file related to training.

I would like to request one correction regarding the Training Dataset.

When three json files were downloaded by wget,

captions_train2017_filtrefgumdval_filtvlp.json
grounding_train2017_filtrefgumdval_filtvlp.json
panoptic_train2017_filtrefgumdval_filtvlp.json

I found that the file format is not json but html.

So, I changed it to the address listed on the hugging face.

wget -P ../xdecoder_data https://huggingface.co/xdecoder/X-Decoder/resolve/main/captions_train2017_filtrefgumdval_filtvlp.json
wget -P ../xdecoder_data https://huggingface.co/xdecoder/X-Decoder/resolve/main/grounding_train2017_filtrefgumdval_filtvlp.json
wget -P ../xdecoder_data https://huggingface.co/xdecoder/X-Decoder/resolve/main/panoptic_train2017_filtrefgumdval_filtvlp.json

Please confirm and thank you again.

Question about DaViT-L pretrained checkpoint.

Hi, thanks for your excellent work!

I noticed you used DaViT-L as a backbone in your experiments. However, the original repo does not contain a pretrained checkpoint for DaViT-L. Do you plan on releasing that anytime soon in the future?

Additonally, do you have any numbers for DaViT-L Mask2Former or Swin-L X-Decoder when comparing to other methods on the ADE20K dataset as that would be a more fair comparison for the decoder architectures (with the same backbone). Particularly, I am interested in your experimental setting for the 52.4 PQ (SOTA) result on the ADE20K dataset.

Thanks!

Where to download checkpoints with "novg"

Thanks for your work!

I notice the code in "demo_captioning.py ":
if 'novg' not in pretrained_pth: assert False, "Using the ckpt without visual genome training data will be much better."

But I can not find where to download checkpoints with "novg"?
Have you already uploaded this checkpoint?

Thanks!

Extension to Video Datasets

How do we extend x-decoder to video datasets? In the appendix, it is mentioned that the model generalizes to generic segmentation and referring segmentation on videos.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.