Git Product home page Git Product logo

alphaclip's Introduction

Hi there 👋

alphaclip's People

Contributors

aleafy avatar sunzey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alphaclip's Issues

Poor performance on COCO dataset.

Hi! I test the performance of Alpha-CLIP on the COCO datasets by feeding the whole image with the ground truth mask corresponding to each object to the model, but the classification accuracy is only around 30%, is the result normal?

Annotations of the generated Imagenet

Thank you for your great work!
Will the annotations (mask, caption) of ImageNet generated by the data generation pipeline proposed in this paper be open source soon? I would appreciate it if you could provide it.

Can you provide an offical demo? I have write one.

`import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms

from PIL import Image
import clip
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)

alpha_clip_mask_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224)),
transforms.Normalize(0.5, 0.26)
])

img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'

raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)

#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)`

ViT-H/14 Model

will you consider training alpha-clip based on ViT-H/14?

use Text prompt to get SAM mask

great work!

I see the graph
image
Does it mean that using a text prompt effectively generates the SAM mask by AlphaCLIP ?

Text prompt? #4

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.
image

As it shown the original approach of exclusively deriving the SAM mask from the text prompt did not produce desired outcomes.

Alpha clip has reduced zero shooting ability compared to the original clip?

I test the alpha-clip and clip(openai) ability in one image for zero-shot recognition.The results are below:
1
maxi_dress
using same text as : purple long sleeve dress", "purple sleeveless dress", "a flower", the second text is right.
clip : tensor([[ 0.0174, 0.9826, 0.0000]], device='cuda:0')
alpha-clip : tensor([[ 0.2263, 0.7735, 0.0003]], device='cuda:0')

code:
clip:

import clip

import torch
from PIL import Image
`torch.set_printoptions(precision=4,sci_mode=False)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-L/14', device)

img_path = 'assets/maxi_dress.jpg'
raw_img = Image.open(img_path)
img = preprocess(raw_img).unsqueeze(0).to(device)
image_features = model.encode_image(img).to(dtype=torch.float)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)

with torch.no_grad():
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(text_probs)#tensor([[    0.0174,     0.9826,     0.0000]], device='cuda:0')
    values, indices = torch.topk(text_probs, 1)

alpha-clip:

import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms
torch.set_printoptions(precision=4,sci_mode=False)
from PIL import Image
import clip
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)

alpha_clip_mask_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(0.5, 0.26)
])

img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'

raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)

#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)
print(image_features.type())
model, preprocess = clip.load('ViT-L/14', device)
# I tried the follow to extract the text features:
#text_tokenized = tokenizer.encode('my great text')
#text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)
print(text_features.shape)
print(text_features.type())
with torch.no_grad():
    #image_features = model.encode_image(img)  # Input tensor should have shape (b,c,h,w)
    #text_features = model.encode_text(tokenized_sleevelength_prompt)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(text_probs)#tensor([[    0.2263,     0.7735,     0.0003]], device='cuda:0')
    values, indices = torch.topk(text_probs, 1)

Could you release the code of integrating blip2 with alpha clip?

Screenshot 2024-02-02 at 7 00 07 PM

Amazing paper , had a pleasant experience reading it.
So had a few doubts for using alpha clip with BLIP 2 are u using the frozen alpha clip model as the image encoder and then sending the mask and the image and then doing blip2 training or as seen in the image u are somehow integrating mask with qformer and image with qformer , i didnt quite understand it.
Moreover did u compare zero shot image captioning ablites similar to BLIP2 with the llava bench ?
Thanks

AhphaCLIP with llm Demo error

when I run the app.py in demo/with_llm, I find the "send" button is useless. I have download all the required checkpoints. This is my terminal error:
image
image

Similarity (cosine dist.) calculation

Hi guys,

Great work !!! Appreciate that you publish your code as well.

I have a short question: How can I calculate the image-text similarity as in the normal CLIP?

The image part is pretty clear, I have managed to get img_features but I can't figure out how to get text_features.

I tried the following:

import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
tokenizer = SimpleTokenizer()

alpha_clip_modelname = "ViT-L/14@336px"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load(alpha_clip_modelname, device='cpu', lora_adapt=False, rank=-1)
alpha_clip_model = alpha_clip_model.float().to(device)
alpha_clip_mask_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((336, 336)),
    transforms.Normalize(0.5, 0.26)
])


raw_img = PIL.Image.open(img_path)
mask = get_mask()
alpha_map = alpha_clip_mask_transform(mask * 255).to(device, dtype=torch.float)
image = alpha_clip_preprocess(raw_img).to(device, dtype=torch.float)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]

# I tried the follow to extract the text features:
text_tokenized = tokenizer.encode('my great text')
text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text_features = alpha_clip_model.encode_text(text_tokenized)

But then I got the following exception:

Traceback (most recent call last):
  File "/home/brian/miniconda3/envs/vipergpt/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-40-d3f32f6440c5>", line 1, in <module>
    alpha_clip_model.encode_text(z)
  File "/home/brian/code/grounding_ge/third_party/AlphaCLIP/alpha_clip/model.py", line 488, in encode_text
    x = x + self.positional_embedding.type(self.dtype)
RuntimeError: The size of tensor a (2) must match the size of tensor b (77) at non-singleton dimension 0

Note: I see that the pre-trained CLIP text encoder is not useful as the embedding size is 768 and not 512 as in CLIP

Can you help me with this issue? I am sure I am missing something very easy to fix.

Thanks!!!

Captions in GRIT

Thank you for your work.

Considering the captions in the GRIT dataset consist solely of noun words like berries, person ...
Did you use Templates to expand the captions, such as "a photo of a xxx"?

Demo error

Demo i deployed locally didn't work. There is no output in the terminal.
image

ViT-B/32 Model

Do you have plans to release a model that uses ViT-B/32 embeddings?

question about the alpha-clip combined with LLaVA-7b

Sorry to bother you in your busy time and i am hurry to cary out alpha-clip with LLaVA-7b-clip.
I followed the instructions in here and changed something.
The input images and masks are as follows:
image
image
and the rewrited forward are as follows:
image
But the generated text doesn't seems to be a specific area of concern and i am wondering why. can you give me some useful advice?
Thank you

Encoding Images with Alpha Channel?

Hello, my Dataset is completely composed of Images with an alpha channel, my question is if encoding those would already do it in a fashion where only the non-alpha region (the point of interest) is being regarded?
Currently im using CLIP but the results are bad when working with an alpha channel.

Masked image for classification

Thanks for your great work.

Given the mask of interest and the image, the intuitive idea is to mask the unrelated regions in the image for classification, which should be useful. Have you tried this?

Model Release

Great job! By the way, do you have a timeline for when the Alpha-CLIP with LLaVA and Open-clip model will be released?

Request for Alpha-CLIP with LLaVA Web Demo and Local Demo

I would like to request the addition of a web demo and instructions for a local demo for Alpha-CLIP with LLaVA. Having both options would greatly enhance accessibility and usability for users interested in exploring Alpha-CLIP. Thank you!

Question: Can you provide some guidance for finetuning MLLM with alpha-clip vision encoder?

Fine-tuning alpha-clip LLaVA-1.5 is mentioned in paper 4.2. Alpha-CLIP in MLLM. I wonder the detail settings. Do you train the model just following stage 1, stage 2 in Paper GPT4ROI?

And if i want to train a general version of Alpha-CLIP MLLM, can you provide some guidance for fine tuning? Should we completely follow the settings of LLaVA-1.5 to train MLP and LLM from scratch, or should we just fine-tune the MLP layer using part of the data to align the original CLIP space and Alpha-CLIP space?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.