facebookresearch / imagebind Goto Github PK

View Code? Open in Web Editor NEW

7.9K 100.0 700.0 2.62 MB

ImageBind One Embedding Space to Bind Them All

License: Other

Python 100.00%

imagebind's People

Contributors

Stargazers

Watchers

Forkers

notoctosting jamesthesnake matiashgarcia marcofernandez007 louis-liu1027 kabbotalukder prajna1999 mindcrime-forks deserteagle22 vonwooding moerehman tanishabassan gautham2k3 lericdax ajinkyapuar seanreynoldscs shero4 satyajitghana bekyilma cheredoua lahfir ahmadhakami codeaudit arosstale zhongyuan-ye andrefelixbr ryan9770 shkim1980 wilsonodpn mavende lyhiving usedupnote kefahalshaer russpalms aigc-awesome arnaudhillen touristshaun ryanrussell00 haojianguo unrealxinda duwizerak zhuoyue renormalizedkat luisgradossalinas rubenszimbres techthiyanes ymg2007 felipehime flavioanjos lucianogis brettmontaigne jaywoo bingtian88 jiyulongxu fengyunzaidushi zlgenuine hufeihu yongheng-xie evdcush schio js-lan guangkechen yukunchen guoqiangjia iamkomen ashwallbera jeanru petercao bdonkey hongwen-sun mrinal18 allthingsllm anggadaz existeundelta kimsoohwan stanleyjacob ai-awe chenchy zxyonaroll circlestarzero iamleon121 alainshumbusho huntercqu deisler134 cv-ip peternara apollohuang1 xuexidi kalvin001 prathamesh-88 mmarking qiangzhangcv whitefu vishaal27 zeynepruveyda kennykwok1 daydreamcoding cshallwe gdin2015cs21 sahil811

imagebind's Issues

3rd party dependencies.

What does third-party dependencies refer to and what is the relevant version?

How to load and transform depth data?

How can I do action recognition or sence classification using IMU data ?

would you like to give us more examples showing how did imu data translate to text or videos in applications? Thanks

Text/Audio/Image > Video/3D

Great Job!

Will it support Text/Audio/Image > Video/3D conversion, approximately when?

IMU Input Dimensions are Unclear - Missing Information on Data Prep

Hello,

What is the required format for IMU input embeddings? Or rather, why does T have to be 2000?
I've tried to run the code using sample embeddings as specified in the appendix of the paper.

For IMU we use a 6×T tensor to represent the sequence of IMU sensor readings over time.

Initially I tried to use the sample from the Ego-4D dataset: https://ego4d-data.org/docs/data/imu/

but this kept throwing size mismatch errors

I am trying to create a joint embedding for a single

RGB Image -> (1,3, 224, 224)
Depth Image -> (1,1, 224, 224)
IMU reading -> (1, 6, T) where T represents the number of time steps. I tried a few different options for T, but found 2000 to be the only one that doesn't throw a size mismatch and produces embeddings for each

Does this mean the model requires a minimum of 2000 time steps for IMU sensors?

Thank you for your help

the minimum requirement of gpu?

The model file's size is more than 4g, so what's the minimum requirement of gpu? I have only 3060

[Help] How can I generate images or audio?

Hey, could someone explain me (no AI/ML background) on how this model could be used to generate images or audio?
I can generate 3 x 3 tensors in code, no problem, but what's the next step to leverage these tensors?

I'm pretty sure I'm not the only one who will stand here and think to himself: "what now?"
I would appreciate a hint or anything that would explain how I could use these tensors without having to read the paper (which I tried but didn't really grasp).

Confuse about ImageNet1k results

Wonderful work!
In Table 2, the top-1 accuray of ImageNet1k is 77.7%, which is higher than CLIP(OpenCLIP) by 2.2%(2.0%). But ImageBind did not train the vision encoder and text encoder, so what make results different or anything I miss?

Any plans to release smaller checkpoints?

Do you have any plans to release smaller model checkpoints other than imagebind_huge?

imageBind

meta image-audio AI

ImageBind with SAM Simple Demo: Segment with Different Modalities

Thanks a lot for release such an amazing work!

We implement a simple and interesting demo by combing ImageBind with SAM here: ImageBind-SAM which can segment things with different modalities, and the project is still under develop

This basic idea is followed with IEA: Image Editing Anything and CLIP-SAM which generate the referring mask with the following steps:

Step 1: Generate auto masks with SamAutomaticMaskGenerator
Step 2: Crop all the box region from the masks
Step 3: Compute the similarity with cropped images and different modalities
Step 4: Merge the highest similarity mask region

And the result is shown as:

Input Model	Modality	Generate Mask
	car audio
	"A car"

And the threshold for each box will influence a lot on the final result, we will do more test on it!

Generating a video and text as output given a input voice

Hi @likethesky @Celebio @neuhaus @colesbury ,
Thanks for the great work and paving way for the multimodal AI research. I am new to multimodal AI.I only worked on computer vision before. I have a small query. How we can make use of Imagebind to create a video and Video Captions(subtitles) as outputs given an input audio in another language ? Just curious to apply Imagebind in different applications .

selective modality finetune

Thanks for the awesome work!
I wonder if I have my own audio-text dataset available for example, and want to just finetune the audio-text modality, how can I achieve it?

Our demo (InternGPT with ImageBind) has been released! Welcome to try it.

Our InternGPT has supported the ImageBind officially. This online demo provides an easy way to access this awesome work.

Welcome to try our demo! We are looking forward to your suggestions and PRs.

The Video demo is here for your reference:

video_demo_with_imagebind.mp4

cuda 11.3 is deprecated for pytorch 1.13, cu116 or cu117 is recommended

If torch 1.13 is intended, the requirements file may need to be updated to change pip wheel to cu116 or cu117. See the links below.
https://pytorch.org/blog/PyTorch-1.13-release/#cuda10.2
https://pytorch.org/get-started/previous-versions/

How to use ImageBind to generate image or audio?

I can run the example code. But how to run the model to generate the some images and audio?

Question about the no. of video clip frames

I'm not sure how many clips are fed into the model. In data.py load_and_transform_video_data loads 5 clips by default, whereas in the paper, it says 2 clips are sampled in 2 second videos (sec. 3.3). Are these referring to the same thing?

Google colab link for generating images or audio

I have tried to run the code in README, it ran successfully

But how I can generate images or audio by prompt like "cat meow"

How could I train ImageBind completely from scratch?

Thank you for the very cool work!
I'm having trouble finding your implementation of NCE loss, however. I know @fabawi has implemented a version of this for his LoRA fine-tuning version (kudos). However, if I wanted to train the original ImageBind model completely from scratch how would I do this?

No module named 'models'

from models import imagebind_model

ModuleNotFoundError: No module named 'models'

I tried using jupyter notebook and spyder. I've already tried to changing enviroments also.

Any idea?

raise RuntimeError("No audio I/O backend is available.")

error info:
D:\soft\anaconda3\envs\ImageBind\lib\site-packages\torchvision\transforms_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Traceback (most recent call last):
File "E:\github\ImageBind\test.py", line 21, in
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
File "E:\github\ImageBind\data.py", line 135, in load_and_transform_audio_data
waveform, sr = torchaudio.load(audio_path)
File "D:\soft\anaconda3\envs\ImageBind\lib\site-packages\torchaudio\backend\no_backend.py", line 16, in load
raise RuntimeError("No audio I/O backend is available.")
RuntimeError: No audio I/O backend is available.

This is awesome

Thanks for building it and releasing it opensource!

Such a simple idea in hindsight. It's great it works.

import data error in Google Colab

When I try to run the demo in Google Colab, I got the error:

import data

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 2>()
1 import torch
----> 2 import data
3 from models import imagebind_model
4 from models.imagebind_model import ModalityType

5 frames
/content/ImageBind/data.py in
17 from pytorchvideo import transforms as pv_transforms
18 from pytorchvideo.data.clip_sampling import ConstantClipsPerVideoSampler
---> 19 from pytorchvideo.data.encoded_video import EncodedVideo
20
21 from torchvision import transforms

/usr/local/lib/python3.8/site-packages/pytorchvideo/data/init.py in
1 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
2
----> 3 from .ava import Ava # noqa
4 from .charades import Charades # noqa
5 from .clip_sampling import ( # noqa; noqa

/usr/local/lib/python3.8/site-packages/pytorchvideo/data/ava.py in
10 from iopath.common.file_io import g_pathmgr
11 from pytorchvideo.data.clip_sampling import ClipInfo, ClipSampler
---> 12 from pytorchvideo.data.labeled_video_dataset import LabeledVideoDataset
13
14

/usr/local/lib/python3.8/site-packages/pytorchvideo/data/labeled_video_dataset.py in
12
13 from .labeled_video_paths import LabeledVideoPaths
---> 14 from .utils import MultiProcessSampler
15
16

/usr/local/lib/python3.8/site-packages/pytorchvideo/data/utils.py in
14 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
15
---> 16 import av
17 import numpy as np
18 import torch

/usr/local/lib/python3.8/site-packages/av/init.py in
18 # MUST import the core before anything else in order to initalize the underlying
19 # library that is being wrapped.
---> 20 from av._core import time_base, library_versions
21
22 # Capture logging (by importing it).

ModuleNotFoundError: No module named 'av._core'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

I try to: !pip install av
But the problem has not been resolved.

Thermal x Vision Support

Following issue 14, I created a small example for thermal embedding. While the Vision x Text and Thermal x Text are working properly, it seems the Vision x Thermal does not yield the correct result.

def load_and_transform_thermal_data(thermal_paths, device):
    if image_paths is None:
        return None

    thermal_ouputs = []
    for thermal_path in thermal_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
#                 transforms.Normalize(
#                     mean=(0.5),
#                     std=(0.5),
#                 ),
            ]
        )
        with open(thermal_path, "rb") as fopen:
            thermal = Image.open(fopen).convert("L")
        thermal = data_transform(thermal).to(device)
        thermal_ouputs.append(thermal)
    return torch.stack(thermal_ouputs, dim=0)

And the results are:

Vision x Text: 
 [[9.9997604e-01 2.3943641e-05]
 [6.0792509e-06 9.9999392e-01]]
Thermal x Text x : 
 [[1.0000000e+00 1.2433221e-11]
 [2.8220674e-02 9.7177935e-01]]
Vision x Thermal Cosine: 
 [[0.1554441  0.02945926]
 [0.16725276 0.03671783]]
Vision x Thermal Softmax: 
 [[0.7789999  0.22100005]
 [0.7867338  0.21326624]]

Could not find a version that satisfies the requirement decord==0.6.0

Question about the paper / training

Hi Authors,
Maybe I missed this while reading the paper: How did you tackle the dataset imbalance problem for each mode? For e.g. you'll have a lot more Image-Text pairs compared to Image-Depth or Image-IMU?

help with embedding arithmetic and image retrieval

Hi,
Thanks for your great work.
I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.

In the paper, the embedding arithmetic is described as follows:

For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above.

To obtain the embedding features after temperature scaling can I just use the following code?:

########## - step 1 - ########## 
# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10

After obtaining the embedding features after temperature scaling, do I need to apply another ℓ2 normalization, something like:

########## - step 2 - ########## 
img_embedding = embeddings[ModalityType.VISION]
txt_embedding = embeddings[ModalityType.TEXT]

img_embedding = img_embedding / torch.norm(img_embedding, dim=-1, keepdim=True)
txt_embedding = txt_embedding / torch.norm(txt_embedding, dim=-1, keepdim=True)

and then combine the embeddings of the two modalities?:

combined_embs = 0.5* img_embedding + 0.5* txt_embedding

Then, I just use the combined_embs and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?

I apologize for the long post.
I greatly appreciate any tips and advice on how to approach this issue.

Many thanks!

Suggestion - Integrate MobileSAM into the pipeline for lightweight and faster inference

Reference: https://github.com/ChaoningZhang/MobileSAM

Our project performs on par with the original SAM and keeps exactly the same pipeline as the original SAM except for a change on the image encode, therefore, it is easy to Integrate into any project.

MobileSAM is around 60 times smaller and around 50 times faster than original SAM, and it is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:

pip install -r requirements.txt

I guess it should be
pip install -r requirement.txt
in readme.md

unable to get past gzip

something seems to be wrong with the bpe_simple_vocab_16e6.txt.gz. I get this error upon executing and kind of stuck on this. ANy help will be appreciated. As am unable to move further.

ModalityType.TEXT: data.load_and_transform_text(text_list, device),

File "/Users/FD00199/Downloads/data.py", line 109, in load_and_transform_text
tokenizer = SimpleTokenizer(bpe_path=BPE_PATH)
File "/Users/FD00199/Downloads/models/multimodal_preprocessors.py", line 505, in init
merges = gzip.open(bpe_bytes).read().decode("utf-8").split("\n")
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 479, in read
if not self._read_gzip_header():
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 427, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'\n\n')

The issue about Audio to Image Generation

An amazing work!!!

It's well known that https://github.com/lucidrains/DALLE2-pytorch and https://github.com/LAION-AI/dalle2-laion used open-clip as pretrianed text and image encoder. However, I have noticed that you used a private DALLE-2 to generate the image conditioned on audio.

Whether is it possible to use open source DALLE-2 instea of private reimplemented counterpart? Does it have some problems with open source DALLE-2? I would appreciate if you can share experience.

In my view, If it was possible to use open source DALLE-2 to adapt the ImageBind, it could directly create some very interesting applications and increase the impact of this work!

Varying the sound length

Fantastic work! I have been evaluating the model using sound files of different lengths. For sounds shorter (500ms in this example) than the 2 second audio clips used to train, I get the following warning:
WARNING:root:Large gap between audio n_frames(48) and target_length (204). Is the audio_target_length setting correct?

My question is how do sound clips of varying length affect the embedding output? In other words, can I still use embeddings from shorter clips, or should I duplicate shorter sounds to approximate the 2 seconds expected by the model?

Can someone teach me how to use this model to generate some images? Thanks

At least give some scripts.

IMU input and asset example

Really great work! I'm particularly interested by the IMU and audio modalities. Can you guys add some IMU data examples? I don't see any in the .assets folder. It would really be great to know more about the expected format so people can play around with this and explore new possibilities.

Thanks!

FileNotFoundError: [Errno 2] No such file or directory: 'bpe/bpe_simple_vocab_16e6.txt.gz'

Could not find a version that satisfies the requirement decord==0.6.0 (from versions: none)

ERROR: Could not find a version that satisfies the requirement decord==0.6.0 (from versions: none)
ERROR: No matching distribution found for decord==0.6.0
WARNING: You are using pip version 21.3.1; however, version 23.1.2 is available.

If I use decode=0.6.1, I will also throw the same exception.

The link to the paper is marked as "TBD" instead of providing a valid link

The issue is that in the introduction section of the document, the link to the paper is marked as "TBD" instead of providing a valid link. This should be fixed by adding the correct link to the paper.

**[ImageBind: One Embedding Space To Bind Them All](TBD)**

Finetuning ImageBind with LoRA

I created a simple ImageBind finetuning example using LoRA:
https://github.com/fabawi/ImageBind-LoRA

Make sure you clone it recursively to include the example dataset:
git clone --recurse-submodules -j8 [email protected]:fabawi/ImageBind-LoRA.git

Install the requirements following the instructions provided in this repo, and run train.py

This should log your checkpoints, as well as separate LoRA if you'd like to update the original model without saving all the model params. More examples and finer control to be added soon

whats the easiest way to test this code app

I cloned this app into pycharm and copied the initial file when i ran "python file.py"
it began downloading 5 gigs of data. Did I do something wrong or is this what its supposed to do ?
Thanks for helping out ?

import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

Instantiate model

model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

Load data

inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
embeddings = model(inputs)

print(
"Vision x Text: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Audio x Text: ",
torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Vision x Audio: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

Expected output:

Vision x Text:

tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],

[3.3836e-05, 9.9994e-01, 2.4118e-05],

[4.7997e-05, 1.3496e-02, 9.8646e-01]])

Audio x Text:

tensor([[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.]])

Vision x Audio:

tensor([[0.8070, 0.1088, 0.0842],

[0.1036, 0.7884, 0.1079],

[0.0018, 0.0022, 0.9960]])

ModuleNotFoundError: No module named 'models'

I've been getting this error when trying out the model:

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 3>()
1 import data
2 import torch
----> 3 from models import imagebind_model
4 from models.imagebind_model import ModalityType
5

ModuleNotFoundError: No module named 'models'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Suggest to create a setup.py or something else to ease install

Evaluation code available?

Are there any plans to release the codes used to evaluate the model in the experiments described in your paper?

How to use Depth embedding.

Thanks for great work!
I want to use Depth embedding in ImageBind, but I cannot get good results...
Please instruct how to use depth embeddings..

・depth estimator and create depth image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")

encoding = feature_extractor(image, return_tensors="pt")
    
# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  predicted_depth = outputs.predicted_depth
    
# interpolate to original size
prediction = torch.nn.functional.interpolate(
                        predicted_depth.unsqueeze(1),
                        size=image.size[::-1],
                        mode="bicubic",
                        align_corners=False,
    ).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")

・after that, inference with the following code

from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
    if depth_paths is None:
        return None

    depth_ouputs = []
    for depth_path in depth_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                # transforms.Normalize((0.5, ), (0.5, ))  # if I use this normalization, I cannot get good results...
            ]
        )
        with open(depth_path, "rb") as fopen:
            image = Image.open(fopen).convert("L")

        image = data_transform(image).to(device)
        depth_ouputs.append(image)
    return torch.stack(depth_ouputs, dim=0)


import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
    ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Depth: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Text x Depth: ",
    torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Depth x Audio: ",
    torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

・output

Vision x Depth:  tensor([[0.3444, 0.3040, 0.3516],
        [0.3451, 0.2363, 0.4186],
        [0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth:  tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
        [5.6266e-01, 4.3734e-01, 9.7014e-10],
        [4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio:  tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
        [1.5248e-02, 4.6171e-03, 9.8014e-01],
        [1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')

Please replay!

Training resources

Thanks for your wonderful work.
I am very excited about your idea. May I ask the computation budget used to train the largest Imagebind model? How many GPU hour do you use?

License mismatch

The license stated in the model card file disagrees with the other locations (README file, LICENSE file).

See pull request #4.

Extend ImageBind to 3D Point Cloud domain: Point-Bind

Thanks very much for releasing such insightful work!

We develop a project based on ImageBind by aligning 3D point cloud modality with image, text, and audio as Point-Bind. Our project exhibits four main characters:

Align 3D with ImageBind . With a joint embedding space, 3D objects can be aligned with their corresponding 2D images, textual descriptions, and audio.
3D LLM via LLaMA-Adapter. In Multi-modal LLaMA-Adapter (ImageBind-LLM), we introduce an LLM following 3D instructions in Engish/中文.
3D Zero-shot Classify/Seg/Det . Point-Bind achieves state-of-the-art performance for 3D zero-shot tasks, including classification, segmentation, and detection.
Embedding Arithmetic with 3D. We observe that 3D features from Point-Bind can be added with other modalities to compose their semantics.

The Multi-modality LLaMA-Adapter (ImageBind-LLM) with Point-Bind's 3D embeddings is as follows:

Thanks!

cuda vision

pytorch1.13.1+cuda11.6 ：

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Is OK to use cosine_similarity instead softmax for VISION x TEXT ?

Hey,

I just want to know if the cosine_similarity of sklearn can relplace the softmax.

Thanks

Vision x Vision NOT what we want

As you can see above, I use the original assets(text, image, audio) in main branch, and find that Vision x Vision is not correct when dog_image x dog_image is not 1 while the other two is 1

Cartopy install fails on Ubuntu

The cartopy install fails with the following error.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for cartopy
Failed to build cartopy
ERROR: Could not build wheels for cartopy, which is required to install pyproject.toml-based projects

The fix is to install the dependency.

sudo apt -y install libgeos-dev

Logging to have it part of the documentation.

facebookresearch / imagebind Goto Github PK

imagebind's People

Contributors

Stargazers

Watchers

Forkers

imagebind's Issues

ModuleNotFoundError: No module named 'av._core'

To view examples of installing some common dependencies, click the "Open Examples" button below.

Instantiate model

Load data

Expected output:

Vision x Text:

tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],

[3.3836e-05, 9.9994e-01, 2.4118e-05],

[4.7997e-05, 1.3496e-02, 9.8646e-01]])

Audio x Text:

tensor([[1., 0., 0.],

[0., 1., 0.],

[0., 0., 1.]])

Vision x Audio:

tensor([[0.8070, 0.1088, 0.0842],

[0.1036, 0.7884, 0.1079],

[0.0018, 0.0022, 0.9960]])

Recommend Projects

Recommend Topics

Recommend Org

To view examples of installing some common dependencies, click the
"Open Examples" button below.