Git Product home page Git Product logo

miniclip's Introduction

MiniClip: A quick demo to explore text descriptions and saliency maps for Clip models

GitHub: https://github.com/HendrikStrobelt/miniClip

This demo app uses the OpenAI Clip (Resnet50) model. You can upload an image and test your text descriptions to observe:

  1. How similar are the embeddings of your descriptions to the image embeddings under the Clip model?
  2. What is the salience map when using the text-embeddings of your descriptions as logits for the image model part of Clip?

By playing around with the demo several observations can be made that enable more tangible access to the clip model.

Table of Content

Insight: Descriptions are not labels

Recently, a use case etsblished for Clip to use the model for zero-shot classification. Here we want to make the case that when you are in control of the description text it will become prevalent how close the relation to language models is. Synonyms or properties of objects have a similar response in the text embeddings which results in similar probability to the same image embedding. The example below shows how "wood" and "barrel" seem to trigger a similar response - not only in logits/softmax but also in saliency. The room below is full of items that all can be identified by saliency maps:

Another example of how well saliency maps capture the objects of interest. Even abstract descriptions like "covid 19 protection" seem to point towards the mask of the person in front.

Insight: A prime/prefix can influence results

Using a prime like "an image of" can change the similarity of close descriptions and image embeddings. The phrase "a barrel and a chair" without prefix seems is slightly more similar to the image embedding. The difference for "some clothes" with and without prefix are more significant. Independent of the prefix, the saliency maps seem to point to the same objects.

Insight: Text adversaries are dominant

It seems that text labels on top of an images are dominant over image content. In the following example, we observe a very strong focus on the word "pharao" which is completely outside of the image context. Even small sub-phrases, like "ph" or "pha" already guide the saliency strictly towards the text label.

Here is an example to show how dominant text is - even in presence of visual objects of the same type in a scene image:

It's fun. Try it for yourself

Trying to uncover how visual and textual embeddings merge into one amalgamation of modalities can be truly fascinating. Please clone the repo and try the demo on your GPU (or even CPU) machine.

To prepare environment and install dependencies:

$ git clone https://github.com/HendrikStrobelt/miniClip.git
$ cd miniClip
$ conda create -n miniclip python=3.8
$ conda activate miniclip
$ pip install -r requirements.txt

and then run:

$ streamlit run app.py

Author, Cite, and Thanks

Demo and text are created by Hendrik Strobelt who is the Explainability Lead at the MIT-IBM Watson AI Lab. Thanks go to David Bau who helped through great conversations and to the creators of streamlit and torchray for great open source software.

Please cite if you used the demo to generate novel hypothesis and ideas:

@misc{HenMiniClip2021,
  author = {Strobelt, Hendrik},
  title = {MiniClip},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/HendrikStrobelt/miniClip}}
}

Licenses

MiniClip uses an Apache 2 License. A list of dependencies can be found here.

miniclip's People

Contributors

hendrikstrobelt avatar

Stargazers

nhw avatar Ashish Rai avatar Michael Gallacher avatar webber avatar blldd avatar Koorye avatar  avatar Arslan avatar Tx Jukie Zhang avatar  avatar 爱可可-爱生活 avatar Hide Matsuki avatar Xinyu Huang avatar Tony avatar fly2fly avatar  avatar pharmapsychotic avatar  avatar Mengyin Liu avatar Lerrhoo avatar Feiyang(Vance) Chen  avatar Qingyun Wang avatar  avatar  avatar  avatar rotem israeli avatar 名無しKさん avatar  avatar  avatar Raunak  avatar  avatar Louis Owen avatar Guangxing Han avatar Wanqing Cui avatar Sourabh Shrishrimal avatar Yi (Jerry) Li avatar Perez avatar  avatar Grace Luo avatar  avatar  avatar Bilal Alsallakh avatar  avatar Himanshu Maurya avatar Gabriele Sarti avatar  avatar Philippe Schwaller avatar

Watchers

 avatar David Bau avatar

miniclip's Issues

Grad CAM for Visual Transformers

Awesome repo; thank you for creating this! I was wondering if you looked into visualizing the Grad CAM saliency maps for the ViT based CLIP models (and if so, which layer we should select as the saliency layer)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.