Comments (2)
Sorry for the delay. But I'm now preparing for my final examination. I don't have time to do this now. By the way, Openxlab resource is limited for us to inference LLaVA-1.5-13b and is difficult to deploy llava. If you are in hurry, you can based on LLaVA code, and replace weight of CLIP it used in llava/serve/model_worker.py
with alpha-clip. we will open official implementation after my final examinations.
def rewrited_forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
print("[Warning] using rewrited alpha forword")
global mask_torch
batch_size = pixel_values.shape[0]
patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
if mask_torch is None:
print("[Warning] no mask specified!")
alpha = torch.ones_like((pixel_values[:, [0], :, :])) * 1.9231
else:
alpha = mask_torch
patch_embeds = patch_embeds + self.patch_embedding_alpha(alpha)
patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
class_embeds = self.class_embedding.expand(batch_size, 1, -1)
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
visual_encoder = self.model.model.vision_tower.vision_tower.vision_model
visual_encoder.embeddings.patch_embedding_alpha = torch.nn.Conv2d(in_channels=1,
out_channels=visual_encoder.embeddings.patch_embedding.out_channels,
kernel_size=visual_encoder.embeddings.patch_embedding.kernel_size,
stride=visual_encoder.embeddings.patch_embedding.stride,
bias=False)
visual_encoder.embeddings.forward = types.MethodType(rewrited_forward, visual_encoder.embeddings)
state_dict = torch.load('clip_l14@336_grit1m_fultune_8xe.pth')
converted_dict = collections.OrderedDict()
for k, v in state_dict.items():
if 'transformer.resblocks' in k:
new_key = k.replace('transformer.resblocks', 'encoder.layers').replace('attn', 'self_attn').replace('ln_1', 'layer_norm1').replace('ln_2', 'layer_norm2') \
.replace('c_fc', 'fc1').replace('c_proj', 'fc2')
if ('self_attn' in new_key) and ('out' not in new_key): # split qkv attn
if 'weight' in new_key :
converted_dict[new_key.replace('in_proj', 'q_proj')] = v[:1024, :]
converted_dict[new_key.replace('in_proj', 'k_proj')] = v[1024:2048, :]
converted_dict[new_key.replace('in_proj', 'v_proj')] = v[2048:, :]
else:
assert 'bias' in new_key
converted_dict[new_key.replace('in_proj', 'q_proj')] = v[:1024]
converted_dict[new_key.replace('in_proj', 'k_proj')] = v[1024:2048]
converted_dict[new_key.replace('in_proj', 'v_proj')] = v[2048:]
else:
converted_dict[new_key] = v
else:
new_key = k.replace('class_embedding', 'embeddings.class_embedding') \
.replace('conv1.weight', 'embeddings.patch_embedding.weight') \
.replace('positional_embedding', 'embeddings.position_embedding.weight') \
.replace('conv1_alpha.weight', 'embeddings.patch_embedding_alpha.weight') \
.replace('ln_pre.weight', 'pre_layrnorm.weight') \
.replace('ln_pre.bias', 'pre_layrnorm.bias') \
.replace('ln_post.weight', 'post_layernorm.weight') \
.replace('ln_post.bias', 'post_layernorm.bias')
converted_dict[new_key] = v
visual_encoder.load_state_dict(converted_dict, strict=False)
visual_encoder = visual_encoder.half().cuda()
from alphaclip.
Thanks for your response, and good luck with your final examinations. Appreciate your help!
from alphaclip.
Related Issues (20)
- "502 Bad Gateway" Error in Demo HOT 4
- for one image,regardless of how the alpha channel is modified,feature similarity is consistently above 0.97 (even between mask=0 and mask=1) HOT 2
- Alpha clip has reduced zero shooting ability compared to the original clip? HOT 1
- question about the alpha-clip combined with LLaVA-7b HOT 11
- AttributeError: 'NoneType' object has no attribute 'from_pretrained' HOT 1
- The Alpha-clip demo with LLAVA will constantly repeat a sentence under certain specific images. HOT 2
- ViT-H/14 Model HOT 2
- Encoding Images with Alpha Channel? HOT 6
- Question: Can you provide some guidance for finetuning MLLM with alpha-clip vision encoder? HOT 2
- Will you provide code for the data generation process? HOT 2
- What data enhancements were used in AlphaCLIP? HOT 3
- Could you release the code of integrating blip2 with alpha clip? HOT 4
- The magic number of 1.9231 and 6 HOT 2
- Annotations of the generated Imagenet HOT 2
- Do you consider trying Alpha-DINOv2? HOT 1
- Do you have plans to release the training code based on openclip? HOT 1
- can you provided the mask of Imagenet ? HOT 1
- Table 6: Performance of Alpha-CLIP in region level captioning HOT 1
- Poor performance on COCO dataset.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alphaclip.