Git Product home page Git Product logo

semantic-segment-anything's Introduction

SSA Icon

Semantic Segment Anything
Jiaqi Chen, Zeyu Yang, and Li Zhang
Zhang Vision Group, Fudan Univerisity

Semantic Segment Anything (SSA) project enhances the Segment Anything dataset (SA-1B) with a dense category annotation engine. SSA is an automated annotation engine that serves as the initial semantic labeling for the SA-1B dataset. While human review and refinement may be required for more accurate labeling. Thanks to the combined architecture of close-set segmentation and open-vocabulary segmentation, SSA produces satisfactory labeling for most samples and has the capability to provide more detailed annotations using image caption method. This tool fills the gap in SA-1B's limited fine-grained semantic labeling, while also significantly reducing the need for manual annotation and associated costs. It has the potential to serve as a foundation for training large-scale visual perception models and more fine-grained CLIP models.

๐Ÿค” Why do we need SSA?

  • SA-1B is the largest image segmentation dataset to date, providing fine mask segmentation annotations. However, it does not provide category annotations for each mask, which are essential for training a semantic segmentation model.
  • Advanced close-set segmenters like Oneformer, open-set segmenters like CLIPSeg, and image caption methods like BLIP can provide rich semantic annotations. However, their mask segmentation predictions may not be as comprehensive and accurate as the mask annotations in SA-1B.
  • Therefore, by combining the fine image segmentation annotations of SA-1B with the rich semantic annotations provided by these advanced models, we can provide a more densely categorized image segmentation dataset.

๐Ÿ‘ What SSA can do?

  • SSA + SA-1B: SSA provides open-vocabulary and dense mask-level category annotations for large-scale SA-1B dataset. After manual review and refinement, these annotations can be used to train segmentation models or fine-grained CLIP models.
  • SSA + SAM: This combination can provide detailed segmentation masks and category labels for new data, while keeping manual labor costs relatively low. Users can first run SAM to obtain mask annotations, and then input the image and mask annotation files into SSA to obtain category labels.

๐Ÿš„ Semantic segment anything engine

The SSA engine consists of three components:

  • (I) Close-set semantic segmentor (green). Two close-set semantic segmentation models trained on COCO and ADE20K datasets respectively are used to segment the image and obtain rough category information. The predicted categories only include simple and basic categories to ensure that each mask receives a relevant label.
  • (II) Open-vocabulary classifier (blue). An image captioning model is utilized to describe the cropped image patch corresponding to each mask. Nouns or phrases are then extracted as candidate open-vocabulary categories. This process provides more diverse category labels.
  • (III) Final decision module (orange). The SSA engine uses a Class proposal filter (i.e. a CLIP) to filter out the top-k most reasonable predictions from the mixed class list. Finally, the Open-vocabulary Segmentor predicts the most suitable category within the mask region based on the top-k classes and image patch.

๐Ÿ“– News

๐Ÿ”ฅ 2023/04/10: Semantic Segment Anything is released.
๐Ÿ”ฅ 2023/04/05: SA-1B is released.

Examples

  • Addition example for Open-vocabulary annotations

๐Ÿ’ป Requirements

  • Python 3.7+
  • CUDA 11.1+

๐Ÿ› ๏ธ Installation

conda env create -f environment.yaml
conda activate ssa
python -m spacy download en_core_web_sm

๐Ÿš€ Quick Start

1. Download SA-1B dataset

Download the SA-1B dataset and unzip it to the data/sa_1b folder.

Folder sturcture:

โ”œโ”€โ”€ Semantic-Segment-Anything
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ sa_1b
โ”‚   โ”‚   โ”œโ”€โ”€ sa_223775.jpg
โ”‚   โ”‚   โ”œโ”€โ”€ sa_223775.json
โ”‚   โ”‚   โ”œโ”€โ”€ ...

Run our Semantic annotation engine with 8 GPUs:

python scripts/main.py --data_dir=data/examples --out_dir=output --world_size=8 --save_img

For each mask, we add two new fields (e.g. 'class_name': 'face' and 'class_proposals': ['face', 'person', 'sun glasses']). The class name is the most likely category for the mask, and the class proposals are the top-k most likely categories from Class proposal filter. k is set to 3 by default.

{
    'bbox': [81, 21, 434, 666],
    'area': 128047,
    'segmentation': {
        'size': [1500, 2250],
        'counts': 'kYg38l[18oeN8mY14aeN5\\Z1>'
    }, 
    'predicted_iou': 0.9704002737998962,
    'point_coords': [[474.71875, 597.3125]],
    'crop_box': [0, 0, 1381, 1006],
    'id': 1229599471,
    'stability_score': 0.9598413705825806,
    'class_name': 'face',
    'class_proposals': ['face', 'person', 'sun glasses']
}

๐Ÿ“ˆ Future work

We hope that excellent researchers in the community can come up with new improvements and ideas to do more work based on SSA. Some of our ideas are as follows:

  • (I) The masks in SA-1B are often in three levels: whole, part, and subpart, and SSA often cannot provide accurate descriptions for too small part or subpart regions. Instead, we use broad categories. For example, SSA may predict "person" for body parts like neck or hand. Therefore, an architecture for more detailed semantic prediction is needed.
  • (II) SSA is an ensemble of multiple models, which makes the inference speed slower compared to end-to-end models. We look forward to more efficient designs in the future.

๐Ÿ˜„ Acknowledgement

๐Ÿ“œ Citation

If you find this work useful for your research, please cite our github repo:

@misc{chen2023semantic,
    title = {Semantic Segment Anything},
    author = {Chen, Jiaqi and Yang, Zeyu and Zhang, Li},
    howpublished = {\url{https://github.com/fudan-zvg/Semantic-Segment-Anything}},
    year = {2023}
}

semantic-segment-anything's People

Contributors

jiaqi-chen-00 avatar lzrobots avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.