Git Product home page Git Product logo

hi-sam's Introduction

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation.

[arXiv preprint]

This is the official repository for Hi-SAM, a unified hierarchical text segmentation model. Refer to our paper for more details.

✨ Highlight

overview

  • Hierarchical Text Segmentation. Hi-SAM unifies text segmentation across stroke, word, text-line, and paragraph levels. Hi-SAM also achieves layout analysis as a by-product.

  • Automatic and Interactive. Hi-SAM supports both automatic mask generation and interactive promptable mode. Given a single-point prompt, Hi-SAM provides word, text-line, and paragraph masks.

  • High-Quality Text Stroke Segmentation & Stroke Labeling Assistant. High-quality text stroke segmentation by introducing mask feature of 1024×1024 resolution with minimal modification in SAM's original mask decoder. Our contributed stroke level annotations for HierText can be downloaded following data_preparation.md. Some examples are displayed here:

example

💡 Overview of Hi-SAM

Hi-SAM

🔥 News

  • [2024/03/24]: Update Efficient Hi-SAM-S leveraging EfficientSAM.
  • [2024/02/23]: Inference and evaluation codes are released. Checkpoints are available. Some applications are provided.

🛠️ Install

Recommended: Linux Python 3.8 Pytorch 1.10 CUDA 11.1

conda create --name hi_sam python=3.8 -y
conda activate hi_sam
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/ymy-k/Hi-SAM.git
cd Hi-SAM
pip install -r requirements.txt

📌 Checkpoints

You can download the following model weights and put them in pretrained_checkpoint/.

  • SAM-TSS (only for text stroke segmentation)
Model Used Dataset Weights fgIOU F-score
SAM-TSS-B Total-Text OneDrive 80.93 86.25
SAM-TSS-L Total-Text OneDrive 84.59 88.69
SAM-TSS-H Total-Text OneDrive 84.86 89.68
Model Used Dataset Weights fgIOU F-score
SAM-TSS-B TextSeg OneDrive 87.15 92.81
SAM-TSS-L TextSeg OneDrive 88.77 93.79
SAM-TSS-H TextSeg OneDrive 88.96 93.87
Model Used Dataset Weights fgIOU F-score
SAM-TSS-B HierText OneDrive 73.39 81.34
SAM-TSS-L HierText OneDrive 78.37 84.99
SAM-TSS-H HierText OneDrive 79.27 85.63
  • Hi-SAM
Model Used Dataset Weights Stroke F-score Word F-score Text-Line F-score Paragraph F-score
Efficient Hi-SAM-S HierText OneDrive 75.60 waiting results
Hi-SAM-B HierText OneDrive 79.78 78.34 82.15 71.15
Hi-SAM-L HierText OneDrive 82.90 81.83 84.85 74.49
Hi-SAM-H HierText OneDrive 83.36 82.86 85.30 75.97

The results of Hi-SAM on the test set are reported here.

Note: For faster downloading and saving storage, above checkpoints do not contain the parameters in SAM's ViT image encoder. Please follow segment-anything to achieve sam_vit_b_01ec64.pth, sam_vit_l_0b3195.pth, sam_vit_h_4b8939.pth and put them in pretrained_checkpoint/ for loading the frozen parameters in ViT image encoder.

▶️ Usage

1. Visualization Demo

1.1 Text stroke segmentation (for SAM-TSS & Hi-SAM):

python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/
  • --checkpoint: the model path.
  • --model-type: the backbone type. Use vit_b for ViT-Base backbone, vit_l for ViT-Large, vit_h for ViT-Huge. Use vit_s for ViT-S.
  • --input: the input image path.
  • --output: the output image path or folder.

To achieve better quality on small texts using sliding window, run the following script:

python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/2e0cb33320757201_sliding.png --patch_mode
  • --patch_mode: enabling sliding window inference. The default patch size is 512×512, the stride is 384 (for HierText). You can adjust the setting for better result on your data.

1.2 Word, Text-line, and Paragraph Segmentation (for Hi-SAM)

Run the following script for promptable segmentation on demo/img293.jpg:

python demo_hisam.py --checkpoint pretrained_checkpoint/hi_sam_l.pth --model-type vit_l --input demo/img293.jpg --output demo/ --hier_det
  • --hier_det: enabling hierarchical segmentation. Hi-SAM predicts a word mask, a text-line mask, and a paragraph mask for each single-point prompt. See demo_hisam.py for the point position and details.

2. Evaluation

Please follow data_preparation.md to prepare the datasets at first.

2.1 Text Stroke Segmentation (for SAM-TSS & Hi-SAM)

If you only want to evaluate the text stroke segmentation part performance, run the following script:

python -m torch.distributed.launch --nproc_per_node=8 train.py --checkpoint <saved_model_path> --model-type <select_vit_type> --val_datasets hiertext_test --eval
  • --nproc_per_node: you can use multiple GPUs for faster evaluation.
  • --val_datasets: the selected dataset for evaluation. Use totaltext_test for evaluation on the test set of Total-Text, textseg_test for the test set of TextSeg, hiertext_test for the test set of HierText.
  • --eval: use evaluation mode.

If you want to evaluate the performance on HierText with sliding window inference, run the following scripts:

mkdir img_eval
python demo_hisam.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --output img_eval/ --patch_mode
python eval_img.py

Using sliding window takes a relatively long time. For faster inference, you can divide the test images into multiple folders and conduct inference for each folder with an individual GPU.

2.2 Hierarchical Text Segmentation (for Hi-SAM)

For stroke level performance, please follow section 2.1. For word, text-line, and paragraph level performance on HierText, please follow the subsequent steps.

Step 1: run the following scripts to get the required jsonl file:

python demo_amg.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --total_points 1500 --batch_points 100 --eval
cd hiertext_eval
python collect_results.py --saved_name res_1500pts.jsonl

For faster inference, you can divide the test or validation images into multiple folders and conduct inference for each folder with an individual GPU.

  • --input: the test or validation image folder of HierText.
  • --total_points: the foreground points number per image. 1500 is the default setting.
  • --batch_points: the points number processed by H-Decoder per batch. It can be changed according to your GPU memory condition.
  • --eval: use evaluation mode. For each image, the prediction results will be saved as a jsonl file in folder hiertext_eval/res_per_img/.
  • --saved_name: the saved jsonl file for submission on website or off-line evaluation. The jsonl files of all images will be merged into one jsonl file.

Step 2: if you conduct inference on the test set of HierText, please submit the final jsonl file to the official website to achieve the evaluation metrics. If you conduct inference on the validation set: (1) follow HierText repo to download and achieve the validation ground-truth validation.jsonl. Put it in hiertext_eval/gt/. (2) Run the following script borrowed from HierText repo to get the evaluation metrics:

python eval.py --gt=gt/validation.jsonl --result=res_1500pts.jsonl --output=score.txt --mask_stride=1 --eval_lines --eval_paragraphs
cd ..

The evaluation process will take about 20 minutes. The evaluation metrics will be saved in the txt file determined by --output.

👁️ Applications

1. Promptable Multi-granularity Text Erasing and Inpainting

Combining Hi-SAM with Stable-Diffusion-inpainting for interactive text erasing and inpainting (click a single-point for word, text-line, or paragraph erasing and inpainting). You can see this project to implement the combination of Hi-SAM and Stable-Diffusion.

2. Text Detection

Only word level or only text-line level text detection. Directly segment contact text instance region instead of the shrunk text kernel region.

spotting

Two demo models are provided here: word_detection_totaltext.pth (trained on Total-Text, only for word detection). line_detection_ctw1500.pth, (trained on CTW1500, only for text-line detection). Put them in pretrained_checkpoint/. Then, for example, run the following script for word detection (only for the detection demo on Total-Text):

python demo_text_detection.py --checkpoint pretrained_checkpoint/word_detection_totaltext.pth --model-type vit_h --input demo/img643.jpg --output demo/ --dataset totaltext

For text-line detection (only for the detection demo on CTW1500):

python demo_text_detection.py --checkpoint pretrained_checkpoint/line_detection_ctw1500.pth --model-type vit_h --input demo/1165.jpg --output demo/ --dataset ctw1500

3. Promptable Scene Text Spotting

Combination with a single-point scene text spotter, SPTSv2. SPTSv2 can recognize scene texts but only predicts a single-point position for one instance. Providing the point position as prompt to Hi-SAM, the intact text mask can be achieved. Some demo figures are provided bellow, the green stars indicate the point prompts. The masks are generated by the word detection model in section 2. Text Detection.

spotting

🏷️ TODO

  • Release inference and evaluation codes.
  • Release model weights.
  • Release Efficient Hi-SAM
  • Release training codes
  • Release online demo.

💗 Acknowledgement

✒️ Citation

If you find Hi-SAM helpful in your research, please consider giving this repository a ⭐ and citing:

@article{ye2024hi-sam,
  title={Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation},
  author={Ye, Maoyuan and Zhang, Jing and Liu, Juhua and Liu, Chenyu and Yin, Baocai and Liu, Cong and Du, Bo and Tao, Dacheng},
  journal={arXiv preprint arXiv:2401.17904},
  year={2024}
}

hi-sam's People

Contributors

ymy-k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hi-sam's Issues

是否有性能优化的打算?

这是一项很酷的工作,我一直想强化sam的文字分割能力,但没成功,你们做到了!

我利用hi-sam提取mv中的歌词,虽然仍有一些背景像素的干扰,但总体胜过了之前的方法。

只不过目前的速度有些慢,这应该是sam的锅,目前已有很多的工作提升sam的推理速度,很期待hi-sam能变成faster-hi-sam!

一个推理速度足够快的hi-sam能转变为生产就绪的强大基础组件!

File "vit_h_maskdecoder.pth" is missing

I tried to run the Hi-SAM Visualization Demo of hierarchical segmentation using the following command

python demo_hisam.py --checkpoint pretrained_checkpoint/hi_sam_l.pth --model-type vit_l --input demo/img293.jpg --output demo/ --hier_det

but got the error
FileNotFoundError: [Errno 2] No such file or directory: 'pretrained_checkpoint/vit_h_maskdecoder.pth'

I did not find any reference to this file in the README. Do you provide this file for download anywhere?

Poor textline detection quality.

Hi, first of all, nice job.
I found that quality of the textline detection model is poor. To be more precise, many lines are just not segmented.

To reproduce:
python demo_text_detection.py --checkpoint pretrained_checkpoint/line_detection_ctw1500.pth --model-type vit_h --input demo/1.jpg --output demo/ --dataset ctw1500

Images:
1
2

Results:
1
2

How is output token in Fig3 obtained?

Thanks for your splendid work, but I didn't see any description of how to get the output token before S-Decoder and output tokens before H-Decoder. The paper says ''Let ts out ∈ R1×256 denote the inherited output token, which is the first slice of SAM’s output tokens.'' but it's still a little ambiguous.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.