facebookresearch / imagebind Goto Github PK
View Code? Open in Web Editor NEWImageBind One Embedding Space to Bind Them All
License: Other
ImageBind One Embedding Space to Bind Them All
License: Other
What does third-party dependencies refer to and what is the relevant version?
would you like to give us more examples showing how did imu data translate to text or videos in applications? Thanks
Great Job!
Will it support Text/Audio/Image > Video/3D
conversion, approximately when?
Hello,
What is the required format for IMU input embeddings? Or rather, why does T have to be 2000?
I've tried to run the code using sample embeddings as specified in the appendix of the paper.
For IMU we use a 6×T tensor to represent the sequence of IMU sensor readings over time.
Initially I tried to use the sample from the Ego-4D dataset: https://ego4d-data.org/docs/data/imu/
but this kept throwing size mismatch errors
I am trying to create a joint embedding for a single
Does this mean the model requires a minimum of 2000 time steps for IMU sensors?
Thank you for your help
The model file's size is more than 4g, so what's the minimum requirement of gpu? I have only 3060
Hey, could someone explain me (no AI/ML background) on how this model could be used to generate images or audio?
I can generate 3 x 3 tensors in code, no problem, but what's the next step to leverage these tensors?
I'm pretty sure I'm not the only one who will stand here and think to himself: "what now?"
I would appreciate a hint or anything that would explain how I could use these tensors without having to read the paper (which I tried but didn't really grasp).
Wonderful work!
In Table 2, the top-1 accuray of ImageNet1k is 77.7%, which is higher than CLIP(OpenCLIP) by 2.2%(2.0%). But ImageBind did not train the vision encoder and text encoder, so what make results different or anything I miss?
Do you have any plans to release smaller model checkpoints other than imagebind_huge
?
meta image-audio AI
Thanks a lot for release such an amazing work!
We implement a simple and interesting demo by combing ImageBind with SAM here: ImageBind-SAM which can segment things with different modalities, and the project is still under develop
This basic idea is followed with IEA: Image Editing Anything and CLIP-SAM which generate the referring mask with the following steps:
SamAutomaticMaskGenerator
And the result is shown as:
Input Model | Modality | Generate Mask |
---|---|---|
car audio | ||
"A car" |
And the threshold for each box will influence a lot on the final result, we will do more test on it!
Hi @likethesky @Celebio @neuhaus @colesbury ,
Thanks for the great work and paving way for the multimodal AI research. I am new to multimodal AI.I only worked on computer vision before. I have a small query. How we can make use of Imagebind to create a video and Video Captions(subtitles) as outputs given an input audio in another language ? Just curious to apply Imagebind in different applications .
Thanks for the awesome work!
I wonder if I have my own audio-text dataset available for example, and want to just finetune the audio-text modality, how can I achieve it?
If torch 1.13 is intended, the requirements file may need to be updated to change pip wheel to cu116 or cu117. See the links below.
https://pytorch.org/blog/PyTorch-1.13-release/#cuda10.2
https://pytorch.org/get-started/previous-versions/
I'm not sure how many clips are fed into the model. In data.py load_and_transform_video_data
loads 5 clips by default, whereas in the paper, it says 2 clips are sampled in 2 second videos (sec. 3.3). Are these referring to the same thing?
I have tried to run the code in README, it ran successfully
But how I can generate images or audio by prompt like "cat meow"
Thank you for the very cool work!
I'm having trouble finding your implementation of NCE loss, however. I know @fabawi has implemented a version of this for his LoRA fine-tuning version (kudos). However, if I wanted to train the original ImageBind model completely from scratch how would I do this?
from models import imagebind_model
ModuleNotFoundError: No module named 'models'
I tried using jupyter notebook and spyder. I've already tried to changing enviroments also.
Any idea?
error info:
D:\soft\anaconda3\envs\ImageBind\lib\site-packages\torchvision\transforms_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Traceback (most recent call last):
File "E:\github\ImageBind\test.py", line 21, in
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
File "E:\github\ImageBind\data.py", line 135, in load_and_transform_audio_data
waveform, sr = torchaudio.load(audio_path)
File "D:\soft\anaconda3\envs\ImageBind\lib\site-packages\torchaudio\backend\no_backend.py", line 16, in load
raise RuntimeError("No audio I/O backend is available.")
RuntimeError: No audio I/O backend is available.
Thanks for building it and releasing it opensource!
Such a simple idea in hindsight. It's great it works.
When I try to run the demo in Google Colab, I got the error:
import data
ModuleNotFoundError Traceback (most recent call last)
in <cell line: 2>()
1 import torch
----> 2 import data
3 from models import imagebind_model
4 from models.imagebind_model import ModalityType
5 frames
/content/ImageBind/data.py in
17 from pytorchvideo import transforms as pv_transforms
18 from pytorchvideo.data.clip_sampling import ConstantClipsPerVideoSampler
---> 19 from pytorchvideo.data.encoded_video import EncodedVideo
20
21 from torchvision import transforms
/usr/local/lib/python3.8/site-packages/pytorchvideo/data/init.py in
1 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
2
----> 3 from .ava import Ava # noqa
4 from .charades import Charades # noqa
5 from .clip_sampling import ( # noqa; noqa
/usr/local/lib/python3.8/site-packages/pytorchvideo/data/ava.py in
10 from iopath.common.file_io import g_pathmgr
11 from pytorchvideo.data.clip_sampling import ClipInfo, ClipSampler
---> 12 from pytorchvideo.data.labeled_video_dataset import LabeledVideoDataset
13
14
/usr/local/lib/python3.8/site-packages/pytorchvideo/data/labeled_video_dataset.py in
12
13 from .labeled_video_paths import LabeledVideoPaths
---> 14 from .utils import MultiProcessSampler
15
16
/usr/local/lib/python3.8/site-packages/pytorchvideo/data/utils.py in
14 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
15
---> 16 import av
17 import numpy as np
18 import torch
/usr/local/lib/python3.8/site-packages/av/init.py in
18 # MUST import the core before anything else in order to initalize the underlying
19 # library that is being wrapped.
---> 20 from av._core import time_base, library_versions
21
22 # Capture logging (by importing it).
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
I try to: !pip install av
But the problem has not been resolved.
Following issue 14, I created a small example for thermal embedding. While the Vision x Text and Thermal x Text are working properly, it seems the Vision x Thermal does not yield the correct result.
def load_and_transform_thermal_data(thermal_paths, device):
if image_paths is None:
return None
thermal_ouputs = []
for thermal_path in thermal_paths:
data_transform = transforms.Compose(
[
transforms.Resize(
224, interpolation=transforms.InterpolationMode.BICUBIC
),
transforms.CenterCrop(224),
transforms.ToTensor(),
# transforms.Normalize(
# mean=(0.5),
# std=(0.5),
# ),
]
)
with open(thermal_path, "rb") as fopen:
thermal = Image.open(fopen).convert("L")
thermal = data_transform(thermal).to(device)
thermal_ouputs.append(thermal)
return torch.stack(thermal_ouputs, dim=0)
And the results are:
Vision x Text:
[[9.9997604e-01 2.3943641e-05]
[6.0792509e-06 9.9999392e-01]]
Thermal x Text x :
[[1.0000000e+00 1.2433221e-11]
[2.8220674e-02 9.7177935e-01]]
Vision x Thermal Cosine:
[[0.1554441 0.02945926]
[0.16725276 0.03671783]]
Vision x Thermal Softmax:
[[0.7789999 0.22100005]
[0.7867338 0.21326624]]
Hi Authors,
Maybe I missed this while reading the paper: How did you tackle the dataset imbalance problem for each mode? For e.g. you'll have a lot more Image-Text pairs compared to Image-Depth or Image-IMU?
Hi,
Thanks for your great work.
I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.
In the paper, the embedding arithmetic is described as follows:
For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above.
To obtain the embedding features after temperature scaling
can I just use the following code?:
########## - step 1 - ##########
# Load data
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}
with torch.no_grad():
embeddings = model(inputs)
which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10
After obtaining the embedding features after temperature scaling
, do I need to apply another ℓ2 normalization
, something like:
########## - step 2 - ##########
img_embedding = embeddings[ModalityType.VISION]
txt_embedding = embeddings[ModalityType.TEXT]
img_embedding = img_embedding / torch.norm(img_embedding, dim=-1, keepdim=True)
txt_embedding = txt_embedding / torch.norm(txt_embedding, dim=-1, keepdim=True)
and then combine the embeddings of the two modalities?:
combined_embs = 0.5* img_embedding + 0.5* txt_embedding
Then, I just use the combined_embs
and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?
I apologize for the long post.
I greatly appreciate any tips and advice on how to approach this issue.
Many thanks!
Reference: https://github.com/ChaoningZhang/MobileSAM
Our project performs on par with the original SAM and keeps exactly the same pipeline as the original SAM except for a change on the image encode, therefore, it is easy to Integrate into any project.
MobileSAM is around 60 times smaller and around 50 times faster than original SAM, and it is around 7 times smaller and around 5 times faster than the concurrent FastSAM. The comparison of the whole pipeline is summarzed as follows:
I guess it should be
pip install -r requirement.txt
in readme.md
something seems to be wrong with the bpe_simple_vocab_16e6.txt.gz. I get this error upon executing and kind of stuck on this. ANy help will be appreciated. As am unable to move further.
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
File "/Users/FD00199/Downloads/data.py", line 109, in load_and_transform_text
tokenizer = SimpleTokenizer(bpe_path=BPE_PATH)
File "/Users/FD00199/Downloads/models/multimodal_preprocessors.py", line 505, in init
merges = gzip.open(bpe_bytes).read().decode("utf-8").split("\n")
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 479, in read
if not self._read_gzip_header():
File "/Users/FD00199/miniconda3/envs/imagebind/lib/python3.8/gzip.py", line 427, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'\n\n')
An amazing work!!!
It's well known that https://github.com/lucidrains/DALLE2-pytorch and https://github.com/LAION-AI/dalle2-laion used open-clip as pretrianed text and image encoder. However, I have noticed that you used a private DALLE-2 to generate the image conditioned on audio.
Whether is it possible to use open source DALLE-2 instea of private reimplemented counterpart? Does it have some problems with open source DALLE-2? I would appreciate if you can share experience.
In my view, If it was possible to use open source DALLE-2 to adapt the ImageBind, it could directly create some very interesting applications and increase the impact of this work!
Fantastic work! I have been evaluating the model using sound files of different lengths. For sounds shorter (500ms in this example) than the 2 second audio clips used to train, I get the following warning:
WARNING:root:Large gap between audio n_frames(48) and target_length (204). Is the audio_target_length setting correct?
My question is how do sound clips of varying length affect the embedding output? In other words, can I still use embeddings from shorter clips, or should I duplicate shorter sounds to approximate the 2 seconds expected by the model?
At least give some scripts.
Really great work! I'm particularly interested by the IMU and audio modalities. Can you guys add some IMU data examples? I don't see any in the .assets folder. It would really be great to know more about the expected format so people can play around with this and explore new possibilities.
Thanks!
The issue is that in the introduction section of the document, the link to the paper is marked as "TBD" instead of providing a valid link. This should be fixed by adding the correct link to the paper.
**[ImageBind: One Embedding Space To Bind Them All](TBD)**
I created a simple ImageBind finetuning example using LoRA:
https://github.com/fabawi/ImageBind-LoRA
Make sure you clone it recursively to include the example dataset:
git clone --recurse-submodules -j8 [email protected]:fabawi/ImageBind-LoRA.git
Install the requirements following the instructions provided in this repo, and run train.py
This should log your checkpoints, as well as separate LoRA if you'd like to update the original model without saving all the model params. More examples and finer control to be added soon
I cloned this app into pycharm and copied the initial file when i ran "python file.py"
it began downloading 5 gigs of data. Did I do something wrong or is this what its supposed to do ?
Thanks for helping out ?
import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType
text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}
with torch.no_grad():
embeddings = model(inputs)
print(
"Vision x Text: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Audio x Text: ",
torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Vision x Audio: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)
I've been getting this error when trying out the model:
ModuleNotFoundError Traceback (most recent call last)
in <cell line: 3>()
1 import data
2 import torch
----> 3 from models import imagebind_model
4 from models.imagebind_model import ModalityType
5
ModuleNotFoundError: No module named 'models'
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
Are there any plans to release the codes used to evaluate the model in the experiments described in your paper?
Thanks for great work!
I want to use Depth embedding in ImageBind, but I cannot get good results...
Please instruct how to use depth embeddings..
・depth estimator and create depth image
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")
encoding = feature_extractor(image, return_tensors="pt")
# forward pass
with torch.no_grad():
outputs = model(**encoding)
predicted_depth = outputs.predicted_depth
# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")
・after that, inference with the following code
from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
if depth_paths is None:
return None
depth_ouputs = []
for depth_path in depth_paths:
data_transform = transforms.Compose(
[
transforms.Resize(
224, interpolation=transforms.InterpolationMode.BICUBIC
),
transforms.CenterCrop(224),
transforms.ToTensor(),
# transforms.Normalize((0.5, ), (0.5, )) # if I use this normalization, I cannot get good results...
]
)
with open(depth_path, "rb") as fopen:
image = Image.open(fopen).convert("L")
image = data_transform(image).to(device)
depth_ouputs.append(image)
return torch.stack(depth_ouputs, dim=0)
import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType
text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)
# Load data
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}
with torch.no_grad():
embeddings = model(inputs)
print(
"Vision x Depth: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
"Text x Depth: ",
torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
"Depth x Audio: ",
torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)
・output
Vision x Depth: tensor([[0.3444, 0.3040, 0.3516],
[0.3451, 0.2363, 0.4186],
[0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth: tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
[5.6266e-01, 4.3734e-01, 9.7014e-10],
[4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio: tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
[1.5248e-02, 4.6171e-03, 9.8014e-01],
[1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')
Please replay!
Thanks for your wonderful work.
I am very excited about your idea. May I ask the computation budget used to train the largest Imagebind model? How many GPU hour do you use?
The license stated in the model card file disagrees with the other locations (README file, LICENSE file).
See pull request #4.
Thanks very much for releasing such insightful work!
We develop a project based on ImageBind by aligning 3D point cloud modality with image, text, and audio as Point-Bind. Our project exhibits four main characters:
The Multi-modality LLaMA-Adapter (ImageBind-LLM) with Point-Bind's 3D embeddings is as follows:
Thanks!
pytorch1.13.1+cuda11.6 :
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
Hey,
I just want to know if the cosine_similarity of sklearn can relplace the softmax.
Thanks
The cartopy install fails with the following error.
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for cartopy
Failed to build cartopy
ERROR: Could not build wheels for cartopy, which is required to install pyproject.toml-based projects
The fix is to install the dependency.
sudo apt -y install libgeos-dev
Logging to have it part of the documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.