zeqiang-lai / anything2image Goto Github PK

Generate image from anything with ImageBind and Stable Diffusion

Python 2.05% Jupyter Notebook 97.95%

anything image-generation stable-diffusion imagebind audio-to-image image-to-image text-to-image

anything2image's Introduction

Anything To Image

Generate image from anything with ImageBind's unified latent space and stable-diffusion-2-1-unclip.

TODO: Currently, we only support ImageBind-Huge with 1024 latent space. However, it might be possible to use StableDiffusionImageVariation for 768 latent space.

No training is need.
Integration with 🤗 Diffusers.
Online demo with Huggingface Gradio and Google Colab.

We need at least 22 Gb GPU memory for the demo. Therefore gradio and colab online demo might need pro account to obtain more GPU/memory to run them.

Support Tasks

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

Update

[2023/5/19]:

Anything2Image has been integrated into InternGPT.
[v1.1.4]: Support fusing audio and text in ImageBind latent space and UI improvements.

[2023/5/18]

[v1.1.3]: Support thermal to image.
[v1.1.0]: Gradio GUI - add options for controling image size, and noise scheduler.
[v1.0.8]: Gradio GUI - add options for controling noise level, audio-image embedding arithmetic strength, and number of inference steps.

anything2image.mp4

Getting Started

Requirements

Ensure you have PyTorch installed.

Python >= 3.8
PyTorch >= 1.13

Then install the anything2image.

# from pypi
pip install anything2image
# or locally install via git clone
git clone [email protected]:Zeqiang-Lai/Anything2Image.git
cd Anything2Image
pip install .

Usage

# lanuch gradio demo
python -m anything2image.app
# command line demo, see also the tasks examples below.
python -m anything2image.cli --audio assets/wav/cat.wav

Audio to Image

bird_audio.wav	dog_audio.wav	cattle.wav	cat.wav

fire_engine.wav	train.wav	motorcycle.wav	plane.wav

python -m anything2image.cli --audio assets/wav/cat.wav

Audio+Text to Image

cat.wav	cat.wav	bird_audio.wav	bird_audio.wav
A painting	A photo	A painting	A photo

python -m anything2image.cli --audio assets/wav/cat.wav --prompt "a painting"

Audio+Image to Image

Audio & Image	Output	Audio & Image	Output

wave.wav		wave.wav

python -m anything2image.cli --audio assets/wav/wave.wav --image "assets/image/bird.png"

with torch.no_grad():
    embeddings = model.forward({
        ib.ModalityType.VISION: ib.load_and_transform_vision_data(["assets/image/bird.png"], device),
    })
    img_embeddings = embeddings[ib.ModalityType.VISION]
    embeddings = model.forward({
        ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(["assets/wav/wave.wav"], device),
    }, normalize=False)
    audio_embeddings = embeddings[ib.ModalityType.AUDIO]
    embeddings = (img_embeddings + audio_embeddings)/2
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("audioimg2img.png")

Image to Image

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --image "assets/image/bird.png"

Text to Image

A photo of a car.	A sunset over the ocean.	A bird's-eye view of a cityscape.	A close-up of a flower.

It is not necessary to use ImageBind for text to image. Nervertheless, we show the alignment of ImageBind's text latent space and its image spaces.

python -m anything2image.cli --text "A sunset over the ocean."

Thermal to Image

Input	Output	Input	Output

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --thermal "assets/thermal/030419.jpg"

Citation

Latent Diffusion

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

ImageBind

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}

anything2image's People

Contributors

Stargazers

Watchers

anything2image's Issues

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

I got this error RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' when trying to run the generate image code (images = pipe(image_embeds=embeddings.half()).images) on Google Colab. Any idea how to fix it?

Status on depth2image

Hi,
I've been working a project that involves depth2image, and thought it would be interesting to try your idea, with ImageBind.
So I wonder when you will release the depth / thermal part.
Further, if you have not started developing them yet, maybe I can contribute.
Looking forward to your reply!

What?

Remotask pay lows

Remotasks

How to generate image from Image+Text?

Hi.
Thanks for the great work you have provided.
In the readme I saw that there are several supported tasks:

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

I am new to this type of applications, so I was wondering if it is possible to generate and image from image +text? For example, given an image of a dog and the text "pink flowers" I would like to generate an image that contains a dog and pink flowers.
If so, could you provide the code for an example? I was looking at the code in the api.py and I am a bit confused of the use of the prompt and text. Moreover, do I need to normalize the embeddings of the image and text before summing them together, or should I need to normalize the summed embedding?

I greatly appreciate your help.
Thanks.

https://hf-mirror.com/spaces/aaronb/Anything2Image上面显示的是Runtime Error Memory limit exceeded (16Gi)是什么意思呀？

作者你好，我打开huggingface的demo发现显示Runtime Error Memory limit exceeded (16Gi)，是因为的电脑内存不够16G么？谢谢

I want to generate audio from image or text, which model should I use? Thanks

ModuleNotFoundError: No module named 'tensorboard',ModuleNotFoundError: No module named 'google'

when running this command, `python -m anything2image.app`i met this problem

Traceback (most recent call last):
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/accelerate/tracking.py", line 43, in
from torch.utils import tensorboard
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/torch/utils/tensorboard/init.py", line 1, in
import tensorboard
ModuleNotFoundError: No module named 'tensorboard'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/dell/anaconda3/envs/i/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dell/anaconda3/envs/i/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dell/桌面/Anything2Image/anything2image/cli.py", line 3, in
from anything2image.api import Anything2Image
File "/home/dell/桌面/Anything2Image/anything2image/api.py", line 4, in
from diffusers import StableUnCLIPImg2ImgPipeline
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/diffusers/init.py", line 3, in
from .configuration_utils import ConfigMixin
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/diffusers/configuration_utils.py", line 34, in
from .utils import (
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/diffusers/utils/init.py", line 21, in
from .accelerate_utils import apply_forward_hook
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/diffusers/utils/accelerate_utils.py", line 24, in
import accelerate
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/accelerate/accelerator.py", line 41, in
from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
File "/home/dell/anaconda3/envs/i/lib/python3.8/site-packages/accelerate/tracking.py", line 45, in
import tensorboardX as tensorboard
File "/home/dell/.local/lib/python3.8/site-packages/tensorboardX-2.6.2-py3.8.egg/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/home/dell/.local/lib/python3.8/site-packages/tensorboardX-2.6.2-py3.8.egg/tensorboardX/torchvis.py", line 10, in
from .writer import SummaryWriter
File "/home/dell/.local/lib/python3.8/site-packages/tensorboardX-2.6.2-py3.8.egg/tensorboardX/writer.py", line 16, in
from .comet_utils import CometLogger
File "/home/dell/.local/lib/python3.8/site-packages/tensorboardX-2.6.2-py3.8.egg/tensorboardX/comet_utils.py", line 5, in
from google.protobuf.json_format import MessageToJson
ModuleNotFoundError: No module named 'google'

What?

Remotask

Can you work on converting IMU data to images?

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

Hello!

I've tried to run the code on Google cola but encountering the following issue:

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

This code produced the error:

! git clone https://github.com/Zeqiang-Lai/Anything2Image.git
%cd Anything2Image
! pip install -r requirements.txt
import anything2image.imagebind as ib
import torch
from diffusers import StableUnCLIPImg2ImgPipeline

# construct models
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16
).to(device)
model = ib.imagebind_huge(pretrained=True).eval().to(device)

This was however on the free Colab (T4 GPU). Not sure if switching to Pro would get rid of that error?

I tried adding this line of code:
!pip install "jax<=0.3.16" "jaxlib<=0.3.16"
but it didn't resolve the issue.

Any suggestions?
Thanks!!