sunzey / alphaclip Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Home Page: https://aleafy.github.io/alpha-clip
License: Apache License 2.0
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Home Page: https://aleafy.github.io/alpha-clip
License: Apache License 2.0
Hi! I test the performance of Alpha-CLIP on the COCO datasets by feeding the whole image with the ground truth mask corresponding to each object to the model, but the classification accuracy is only around 30%, is the result normal?
Great work!
I am confused with Tab .6 result, the performance is Alpha-CLIP with LLaVA-1.5 or fine-tune this model with vicuna-7b on these datasets(RefCOCOg or VG)?
Thank you for your great work!
Will the annotations (mask, caption) of ImageNet generated by the data generation pipeline proposed in this paper be open source soon? I would appreciate it if you could provide it.
`import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms
from PIL import Image
import clip
import numpy as np
device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)
alpha_clip_mask_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224)),
transforms.Normalize(0.5, 0.26)
])
img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'
raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)
#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)`
will you consider training alpha-clip based on ViT-H/14?
great work!
I see the graph
Does it mean that using a text prompt effectively generates the SAM mask by AlphaCLIP ?
We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.
As it shown the original approach of exclusively deriving the SAM mask from the text prompt did not produce desired outcomes.
"502 Bad Gateway" Error in Huggingface Demo and OpenXLab Demo
I test the alpha-clip and clip(openai) ability in one image for zero-shot recognition.The results are below:
using same text as : purple long sleeve dress", "purple sleeveless dress", "a flower", the second text is right.
clip : tensor([[ 0.0174, 0.9826, 0.0000]], device='cuda:0')
alpha-clip : tensor([[ 0.2263, 0.7735, 0.0003]], device='cuda:0')
code:
clip:
import clip
import torch
from PIL import Image
`torch.set_printoptions(precision=4,sci_mode=False)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-L/14', device)
img_path = 'assets/maxi_dress.jpg'
raw_img = Image.open(img_path)
img = preprocess(raw_img).unsqueeze(0).to(device)
image_features = model.encode_image(img).to(dtype=torch.float)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)
with torch.no_grad():
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(text_probs)#tensor([[ 0.0174, 0.9826, 0.0000]], device='cuda:0')
values, indices = torch.topk(text_probs, 1)
alpha-clip:
import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
import torch
from torchvision import transforms
torch.set_printoptions(precision=4,sci_mode=False)
from PIL import Image
import clip
import numpy as np
device = "cuda" if torch.cuda.is_available() else "cpu"
alpha_clip_modelname = "ViT-L/14"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load("ViT-L/14", alpha_vision_ckpt_pth="checkpoints/clip_l14_grit+mim_fultune_6xe.pth")
alpha_clip_model = alpha_clip_model.float().to(device)
alpha_clip_mask_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224)),
transforms.Normalize(0.5, 0.26)
])
img_path = 'assets/maxi_dress.jpg'
mask_path = 'assets/1.png'
raw_img = Image.open(img_path)
mask = Image.open(img_path)
mask = np.array(mask)
mask = torch.tensor(mask)
mask = mask[:,:,0]
mask = mask.unsqueeze(2)
#print(mask.shape)
mask = np.array(mask)
#print(mask.shape)
alpha_map = alpha_clip_mask_transform(mask * 255).to(device,dtype=torch.float).unsqueeze(0)
#print(alpha_map.shape)
#alpha_map = alpha_map[0,:,:]
#alpha_map = alpha_map.unsqueeze(0)
#print(alpha_map.shape)
image = alpha_clip_preprocess(raw_img).unsqueeze(0).to(device, dtype=torch.float)
#print(image.shape)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
print(image_features.shape)
print(image_features.type())
model, preprocess = clip.load('ViT-L/14', device)
# I tried the follow to extract the text features:
#text_tokenized = tokenizer.encode('my great text')
#text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text = clip.tokenize(["purple long sleeve dress", "purple sleeveless dress", "a flower"]).to(device)
text_features = model.encode_text(text).to(dtype=torch.float)
print(text_features.shape)
print(text_features.type())
with torch.no_grad():
#image_features = model.encode_image(img) # Input tensor should have shape (b,c,h,w)
#text_features = model.encode_text(tokenized_sleevelength_prompt)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(text_probs)#tensor([[ 0.2263, 0.7735, 0.0003]], device='cuda:0')
values, indices = torch.topk(text_probs, 1)
How to perform geometric transformations on mask and image simultaneously/
Amazing paper , had a pleasant experience reading it.
So had a few doubts for using alpha clip with BLIP 2 are u using the frozen alpha clip model as the image encoder and then sending the mask and the image and then doing blip2 training or as seen in the image u are somehow integrating mask with qformer and image with qformer , i didnt quite understand it.
Moreover did u compare zero shot image captioning ablites similar to BLIP2 with the llava bench ?
Thanks
Hi guys,
Great work !!! Appreciate that you publish your code as well.
I have a short question: How can I calculate the image-text similarity as in the normal CLIP?
The image part is pretty clear, I have managed to get img_features
but I can't figure out how to get text_features
.
I tried the following:
import alpha_clip
from alpha_clip.simple_tokenizer import SimpleTokenizer
tokenizer = SimpleTokenizer()
alpha_clip_modelname = "ViT-L/14@336px"
alpha_clip_model, alpha_clip_preprocess = alpha_clip.load(alpha_clip_modelname, device='cpu', lora_adapt=False, rank=-1)
alpha_clip_model = alpha_clip_model.float().to(device)
alpha_clip_mask_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((336, 336)),
transforms.Normalize(0.5, 0.26)
])
raw_img = PIL.Image.open(img_path)
mask = get_mask()
alpha_map = alpha_clip_mask_transform(mask * 255).to(device, dtype=torch.float)
image = alpha_clip_preprocess(raw_img).to(device, dtype=torch.float)
image_features = alpha_clip_model.visual(image, alpha_map) # This works, I have image_features vector of size [1,768]
# I tried the follow to extract the text features:
text_tokenized = tokenizer.encode('my great text')
text_tokenized = torch.Tensor(text_tokenized ).type(torch.int).to(device)
text_features = alpha_clip_model.encode_text(text_tokenized)
But then I got the following exception:
Traceback (most recent call last):
File "/home/brian/miniconda3/envs/vipergpt/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-40-d3f32f6440c5>", line 1, in <module>
alpha_clip_model.encode_text(z)
File "/home/brian/code/grounding_ge/third_party/AlphaCLIP/alpha_clip/model.py", line 488, in encode_text
x = x + self.positional_embedding.type(self.dtype)
RuntimeError: The size of tensor a (2) must match the size of tensor b (77) at non-singleton dimension 0
Note: I see that the pre-trained CLIP text encoder is not useful as the embedding size is 768 and not 512 as in CLIP
Can you help me with this issue? I am sure I am missing something very easy to fix.
Thanks!!!
Thank you for your work.
Considering the captions in the GRIT dataset consist solely of noun words like berries, person ...
Did you use Templates to expand the captions, such as "a photo of a xxx"?
Do you have plans to release a model that uses ViT-B/32 embeddings?
Dear author @SunzeY,
Thanks a lot for this great work! By going through the code, may I ask what is the role of the magic number 6 and 1.9231 when constructing the mask?
AlphaCLIP/demo/with_llm/app.py
Line 151 in 3e7cdd7
Sorry to bother you in your busy time and i am hurry to cary out alpha-clip with LLaVA-7b-clip.
I followed the instructions in here and changed something.
The input images and masks are as follows:
and the rewrited forward are as follows:
But the generated text doesn't seems to be a specific area of concern and i am wondering why. can you give me some useful advice?
Thank you
Hello, my Dataset is completely composed of Images with an alpha channel, my question is if encoding those would already do it in a fashion where only the non-alpha region (the point of interest) is being regarded?
Currently im using CLIP but the results are bad when working with an alpha channel.
DINOv2 seems powerful in some applications like CLIP, maybe a similar Alpha-DINOv2 is very useful too
Thanks for your great work.
Given the mask of interest and the image, the intuitive idea is to mask the unrelated regions in the image for classification, which should be useful. Have you tried this?
Great job! By the way, do you have a timeline for when the Alpha-CLIP with LLaVA and Open-clip model will be released?
Excellent work, congratulations, will you consider using the eva-clip-g model to get an eva-clip-g version of Alpha-CLIP?
We would like to reproduce your results and are wondering when the data synthesis code will be opened up.
I would like to request the addition of a web demo and instructions for a local demo for Alpha-CLIP with LLaVA. Having both options would greatly enhance accessibility and usability for users interested in exploring Alpha-CLIP. Thank you!
Fine-tuning alpha-clip LLaVA-1.5 is mentioned in paper 4.2. Alpha-CLIP in MLLM. I wonder the detail settings. Do you train the model just following stage 1, stage 2 in Paper GPT4ROI?
And if i want to train a general version of Alpha-CLIP MLLM, can you provide some guidance for fine tuning? Should we completely follow the settings of LLaVA-1.5 to train MLP and LLM from scratch, or should we just fine-tune the MLP layer using part of the data to align the original CLIP space and Alpha-CLIP space?
when i use diffusion, ' AttributeError: 'NoneType' object has no attribute 'from_pretrained' ' occurs.
model, vis_preprocess, txt_preprocess = load_model_and_preprocess("blip_diffusion", "base", device=device, is_eval=True
what is the problem?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.