Git Product home page Git Product logo

clip_surgery's Introduction

Changes:

  • Setting e.g. clipmodel = "ViT-L/14@336px" in demo.py now works -> auto input_dims variable
  • Small change to clip_model.py to accept this variable
  • In clip.py, bypass SHA256 checksum verification -> You can put your fine-tune in place of .cache/clip/<original_model>.pt
  • Include model.py from OpenAI/clip -> config for fine-tuned torch.save .pt files w/o inbuilt model config
  • Save plots rather than using plt.show()
  • ⚠️ Note: No changes made to demo.ipynb - use demo.py from the console!

Advanced:

  • Use runall.py (type "python runall.py --help"). Will:
  • Batch process images + perform CLIP Surgery in a fully automated way:
    1. Gets some CLIP opinions in gradient ascent -> model's own texts (labels) about the images.
    1. Performs CLIP Surgery with whatever CLIP "saw" in the images.
  • ⚠️ You can use large models, but from CLIP ViT-L/14 on, will require >24 GB memory.
  • FUN: After [above], run FUN_word-world-records.py to get a list of CLIP's craziest predicted longwords.
  • ℹ️ Requires: Original OpenAI/CLIP "pip install git+https://github.com/openai/CLIP.git"

  • Original CLIP Gradient Ascent Script: Used with permission by Twitter / X: @advadnoun
  • CLIP 'opinions' may contain biased rants, slurs, and profanity. For more information, refer to the CLIP Model Card.

example-github


ORIGINAL README.md:

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks (arxiv)

Introduction

This work focuses on the explainability of CLIP via its raw predictions. We identify two problems about CLIP's explainability: opposite visualization and noisy activations. Then we propose the CLIP Surgery, which does not require any fine-tuning or additional supervision. It greatly improves the explainability of CLIP, and enhances downstream open-vocabulary tasks such as multi-label recognition, semantic segmentation, interactive segmentation (specifically the Segment Anything Model, SAM), and multimodal visualization. Currently, we offer a simple demo for interpretability analysis, and how to convert text to point prompts for SAM. Rest codes including evaluation and other tasks will be released later.

Opposite visualization is due to wrong relation in self-attention: image

Noisy activations is owing to redundant features across lables: image

Our visualization results: image

Text2Points to guide SAM: image

Multimodal visualization: image

Segmentation results: image

Multilabel results: image

Demo

Firstly to install the SAM, and download the model

pip install git+https://github.com/facebookresearch/segment-anything.git
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Then explain CLIP via jupyter demo "demo.ipynb". Or use the python file:

python demo.py

(Note: demo's results are slightly different from the experimental code, specifically no apex amp fp16 for easier to use.)

Cite

@misc{li2023clip,
      title={CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks}, 
      author={Yi Li and Hualiang Wang and Yiqun Duan and Xiaomeng Li},
      year={2023},
      eprint={2304.05653},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

clip_surgery's People

Contributors

zer0int avatar eli-yili avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.