sunzey / alphaclip Goto Github PK

View Code? Open in Web Editor NEW

509.0 10.0 27.0 173.46 MB

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Home Page: https://aleafy.github.io/alpha-clip

License: Apache License 2.0

Python 5.68% Jupyter Notebook 94.19% Shell 0.14%

deep-learning machine-learning vision-language vision-language-model vision-transformer vision-and-language

alphaclip's Introduction

Hi there 👋

alphaclip's People

Contributors

Stargazers

Watchers

alphaclip's Issues

Poor performance on COCO dataset.

Hi! I test the performance of Alpha-CLIP on the COCO datasets by feeding the whole image with the ground truth mask corresponding to each object to the model, but the classification accuracy is only around 30%, is the result normal?

Table 6: Performance of Alpha-CLIP in region level captioning

Great work!
I am confused with Tab .6 result, the performance is Alpha-CLIP with LLaVA-1.5 or fine-tune this model with vicuna-7b on these datasets(RefCOCOg or VG)？

Fail to download clip_l14_grit+mim_fultune_6xe.pth

Annotations of the generated Imagenet

Thank you for your great work!
Will the annotations (mask, caption) of ImageNet generated by the data generation pipeline proposed in this paper be open source soon? I would appreciate it if you could provide it.

Can you provide an offical demo? I have write one.

`import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms

from PIL import Image
import clip
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)

alpha_clip_mask_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224)),
transforms.Normalize(0.5, 0.26)
])

img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'

raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)

#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)`

ViT-H/14 Model

will you consider training alpha-clip based on ViT-H/14?

when will release alphaclip with ViT-H/14

Do you have plans to release the training code based on openclip?

use Text prompt to get SAM mask

great work!

I see the graph

Does it mean that using a text prompt effectively generates the SAM mask by AlphaCLIP ?

Text prompt? #4

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.

As it shown the original approach of exclusively deriving the SAM mask from the text prompt did not produce desired outcomes.

"502 Bad Gateway" Error in Demo

"502 Bad Gateway" Error in Huggingface Demo and OpenXLab Demo

Alpha clip has reduced zero shooting ability compared to the original clip？

I test the alpha-clip and clip(openai) ability in one image for zero-shot recognition.The results are below:

using same text as : purple long sleeve dress", "purple sleeveless dress", "a flower", the second text is right.
clip : tensor([[ 0.0174, 0.9826, 0.0000]], device='cuda:0')
alpha-clip : tensor([[ 0.2263, 0.7735, 0.0003]], device='cuda:0')

code:
clip:

import clip

import torch
from PIL import Image
`torch.set_printoptions(precision=4,sci_mode=False)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-L/14', device)

img_path = 'assets/maxi_dress.jpg'
raw_img = Image.open(img_path)
img = preprocess(raw_img).unsqueeze(0).to(device)
image_features = model.encode_image(img).to(dtype=torch.float)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)

with torch.no_grad():
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(text_probs)#tensor([[    0.0174,     0.9826,     0.0000]], device='cuda:0')
    values, indices = torch.topk(text_probs, 1)

alpha-clip:

import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms
torch.set_printoptions(precision=4,sci_mode=False)
from PIL import Image
import clip
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)

alpha_clip_mask_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(0.5, 0.26)
])

img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'

raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)

#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)
print(image_features.type())
model, preprocess = clip.load('ViT-L/14', device)
# I tried the follow to extract the text features:
#text_tokenized = tokenizer.encode('my great text')
#text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)
print(text_features.shape)
print(text_features.type())
with torch.no_grad():
    #image_features = model.encode_image(img)  # Input tensor should have shape (b,c,h,w)
    #text_features = model.encode_text(tokenized_sleevelength_prompt)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    print(text_probs)#tensor([[    0.2263,     0.7735,     0.0003]], device='cuda:0')
    values, indices = torch.topk(text_probs, 1)

What data enhancements were used in AlphaCLIP?

How to perform geometric transformations on mask and image simultaneously/

Could you release the code of integrating blip2 with alpha clip?

Amazing paper , had a pleasant experience reading it.
So had a few doubts for using alpha clip with BLIP 2 are u using the frozen alpha clip model as the image encoder and then sending the mask and the image and then doing blip2 training or as seen in the image u are somehow integrating mask with qformer and image with qformer , i didnt quite understand it.
Moreover did u compare zero shot image captioning ablites similar to BLIP2 with the llava bench ?
Thanks

AhphaCLIP with llm Demo error

when I run the app.py in demo/with_llm, I find the "send" button is useless. I have download all the required checkpoints. This is my terminal error:

Similarity (cosine dist.) calculation

Hi guys,

Great work !!! Appreciate that you publish your code as well.

I have a short question: How can I calculate the image-text similarity as in the normal CLIP?

The image part is pretty clear, I have managed to get img_features but I can't figure out how to get text_features.

I tried the following:

import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
tokenizer = SimpleTokenizer()

alpha_clip_modelname = "ViT-L/14@336px"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load(alpha_clip_modelname, device='cpu', lora_adapt=False, rank=-1)
alpha_clip_model = alpha_clip_model.float().to(device)
alpha_clip_mask_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((336, 336)),
    transforms.Normalize(0.5, 0.26)
])


raw_img = PIL.Image.open(img_path)
mask = get_mask()
alpha_map = alpha_clip_mask_transform(mask * 255).to(device, dtype=torch.float)
image = alpha_clip_preprocess(raw_img).to(device, dtype=torch.float)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]

# I tried the follow to extract the text features:
text_tokenized = tokenizer.encode('my great text')
text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text_features = alpha_clip_model.encode_text(text_tokenized)

But then I got the following exception:

Traceback (most recent call last):
  File "/home/brian/miniconda3/envs/vipergpt/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-40-d3f32f6440c5>", line 1, in <module>
    alpha_clip_model.encode_text(z)
  File "/home/brian/code/grounding_ge/third_party/AlphaCLIP/alpha_clip/model.py", line 488, in encode_text
    x = x + self.positional_embedding.type(self.dtype)
RuntimeError: The size of tensor a (2) must match the size of tensor b (77) at non-singleton dimension 0

Note: I see that the pre-trained CLIP text encoder is not useful as the embedding size is 768 and not 512 as in CLIP

Can you help me with this issue? I am sure I am missing something very easy to fix.

Thanks!!!

Captions in GRIT

Thank you for your work.

Considering the captions in the GRIT dataset consist solely of noun words like berries, person ...
Did you use Templates to expand the captions, such as "a photo of a xxx"?

can you provided the mask of Imagenet ？

Demo error

Demo i deployed locally didn't work. There is no output in the terminal.

ViT-B/32 Model

Do you have plans to release a model that uses ViT-B/32 embeddings?

Web demo of Alpha-CLIP with Stable Diffusion doesn't work?

The magic number of 1.9231 and 6

Dear author @SunzeY,

Thanks a lot for this great work! By going through the code, may I ask what is the role of the magic number 6 and 1.9231 when constructing the mask?

AlphaCLIP/demo/with_llm/app.py

Line 151 in 3e7cdd7

mask_torch = (masks > 6) * 1.9231 + (masks <= 6) * (-1.9231)

for one image，regardless of how the alpha channel is modified，feature similarity is consistently above 0.97 （even between mask=0 and mask=1)

question about the alpha-clip combined with LLaVA-7b

Sorry to bother you in your busy time and i am hurry to cary out alpha-clip with LLaVA-7b-clip.
I followed the instructions in here and changed something.
The input images and masks are as follows:

and the rewrited forward are as follows:

But the generated text doesn't seems to be a specific area of concern and i am wondering why. can you give me some useful advice?
Thank you

Encoding Images with Alpha Channel?

Hello, my Dataset is completely composed of Images with an alpha channel, my question is if encoding those would already do it in a fashion where only the non-alpha region (the point of interest) is being regarded?
Currently im using CLIP but the results are bad when working with an alpha channel.

Do you consider trying Alpha-DINOv2?

DINOv2 seems powerful in some applications like CLIP, maybe a similar Alpha-DINOv2 is very useful too

Masked image for classification

Thanks for your great work.

Given the mask of interest and the image, the intuitive idea is to mask the unrelated regions in the image for classification, which should be useful. Have you tried this?

Model Release

Great job! By the way, do you have a timeline for when the Alpha-CLIP with LLaVA and Open-clip model will be released?

will you consider using the eva-clip-g model to get an eva-clip-g version of Alpha-CLIP?

Excellent work, congratulations, will you consider using the eva-clip-g model to get an eva-clip-g version of Alpha-CLIP?

Will you provide code for the data generation process?

We would like to reproduce your results and are wondering when the data synthesis code will be opened up.

Request for Alpha-CLIP with LLaVA Web Demo and Local Demo

I would like to request the addition of a web demo and instructions for a local demo for Alpha-CLIP with LLaVA. Having both options would greatly enhance accessibility and usability for users interested in exploring Alpha-CLIP. Thank you!

Question: Can you provide some guidance for finetuning MLLM with alpha-clip vision encoder?

Fine-tuning alpha-clip LLaVA-1.5 is mentioned in paper 4.2. Alpha-CLIP in MLLM. I wonder the detail settings. Do you train the model just following stage 1, stage 2 in Paper GPT4ROI?

And if i want to train a general version of Alpha-CLIP MLLM, can you provide some guidance for fine tuning? Should we completely follow the settings of LLaVA-1.5 to train MLP and LLM from scratch, or should we just fine-tune the MLP layer using part of the data to align the original CLIP space and Alpha-CLIP space?