Git Product home page Git Product logo

zero123plus's People

Contributors

0armaan025 avatar colin97 avatar dustinpro avatar eliphatfs avatar eltociear avatar harshhere905 avatar jd7h avatar kushal34712 avatar ootts avatar pentesterpriyanshu avatar rfeinman avatar sanyam-2026 avatar shivam250702 avatar yvrjsharma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zero123plus's Issues

camera pose

Are the new perspective images rendered in blender? Can you provide a specific rotation matrix to facilitate calculation of the camera pose?

How to complete 3D reconstruction

How to complete 3D reconstruction?
Image results cannot be reconstructed using colmap。
Can you mention some of the fake gestures?

Some questions regarding the input and output latent scaling

I noticed in the code that when performing generation from gaussian noises, we need to first unscale_latents -> divide vae.config.scaling_factor -> vae decode -> unscale images to get the final image. However, when I tried to directly denoise an input image, how should I apply the scale operations during the encoding and decoding process? I tried the following code:

renderings = scale_images(images)
particles = vae.encode(renderings).latent_dist.sample() * vae.config.scaling_factor
particles = scale_latents(particles)
t = torch.tensor([1], device=device).long()
n = torch.randn_like(particles)
y_noisy = scheduler.add_noise(particles, n, t)
n_pred = predict_noise0_diffuser(Diff_pre, y_noisy, text_embeddings, t, guidance_scale=args.cfg, cross_attention_kwargs=cross_attention_kwargs, scheduler=scheduler)

predict_x = (y_noisy - (1-scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1))**0.5 * n_pred) / scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1)**0.5
predict_x = unscale_latents(predict_x)
image = pipe.vae.decode(predict_x / vae.config.scaling_factor, return_dict=False)[0]
image = unscale_images(image)
result = pipe.image_processor.postprocess(image, output_type='pil')

However, the denoised output (left) has different color than the input image (right):
image

If I deleted all the scale, unscale functions, the results seem to be correct. So I am confused how to use these scale, unscale functions?

What is the "scaled-linear noise scheduler"?

Great work! You mention in the paper that the original noise schedule in Stable Diffusion is "scaled-linear schedule".

Could you please provide more details about the "scaled-linear schedule"?

I found that Stable Diffusion didn't release their training scripts. Both latent diffusion model (https://github.com/CompVis/latent-diffusion) and Zero123 (https://github.com/cvlab-columbia/zero123) seem to use the default linear noise schedule. Did I miss something?

Thank you.

Proposal for Collaborative 3D Reconstruction Between Two Projects

Hello,

I stumbled upon your project on GitHub (https://github.com/SUDO-AI-3D/zero123plus) and found it to be quite fascinating. An idea struck me about orchestrating a collaborative effort with another project (https://github.com/dreamgaussian/dreamgaussian) for 3D reconstruction. While I might not fully grasp the technical intricacies, I believe this collaboration could potentially enhance the quality of the output.

Specifically, the idea entails utilizing your tool to convert a single image into six images from different perspectives, then processing each of these images with the DreamGaussian tool for generation. Subsequently, these six generated models could be integrated to reconstruct a refined 3D model.

If this idea is feasible and holds the potential to improve the quality of the project, I would be thrilled if you could consider its implementation. I hope that this proposal brings value to your project, and should it pique your interest, I look forward to delving further into discussions.

Thank you for your consideration.

Warm regards,

[katyukuki]

Training details

Hello, thank you for the great work.
I would kindly ask you for some training details. In particular:

  1. How many/which type of GPUs did you use for training? How much did it take to train?
  2. Which is the exact SD version checkpoint that has been fine-tuned? It is known from the report that it is the SD 2 v-model (Sec. 2.5) but it would be helpful to have the link to the exact SD checkpoint from which the fine-tuning has been initiated.

Thanks in advance

Adding Contributors section to the readme.md

Why Contributors section:- A "Contributors" section in a repo gives credit to and acknowledges
the people who have helped with the project, fosters a sense of community, and helps others
know who to contact for questions or issues related to the project.

example:-
3

about FOV for rendering dataset

Firstly, thank you so much for sharing such an impressive work!

I'd like to know the FOV you used when rendering the training data. This way, I can quickly integrate my rendered content into new viewpoint generation, thereby achieving texture mapping.

Thank you very much!

Question about the input single image

Hi, thank you for your outstanding work.

I have always had a question regarding the 'Single Image to 3D' tasks. I've noticed that almost all images input into these systems are in a front-facing and upright position. I'm uncertain if I would get equally perfect results when inputting images that are viewed from the bottom or upside down. I believe these latter cases could play a crucial role in many completion tasks.

getting normals/depth

Hey,

I wanted to ask, is there a way in the current implantations to get the normals or depth for the generated view (or the inserted image) ?

data of depth ControlNet

Dear author:
I want to ask the format of data of depth ControlNet must be 6 subfigures? I think it is hard to get the multiview depth image.

PytorchStreamReader failed reading zip archive: failed finding central directory

when I run app.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.


  You can now view your Streamlit app in your browser.

  Network URL: http://10.119.70.148:8501
  External URL: http://144.48.107.18:8501

2023-10-24 13:47:51.466322: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
text_encoder/model.safetensors not found
Loading pipeline components...: 100%|███████████████████████████████| 8/8 [00:03<00:00,  2.25it/s]
2023-10-24 13:48:09.295 Uncaught app exception
Traceback (most recent call last):
onda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 245, in _get_or_create_cached_value
    cached_result = cache.read_result(value_key)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/homaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 293, in _handle_cache_miss
    cached_result = cache.read_result(value_key)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/homeanaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "/home//download/zero123plus/app.py", line 206, in <module>
    SAMAPI.get_instance()
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 194, in wrapper
    return cached_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 223, in __call__
    return self._get_or_create_cached_value(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 248, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 302, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//download/zero123plus/app.py", line 41, in get_instance
    sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 15, in build_sam_vit_h
    return _build_sam(
           ^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 105, in _build_sam
    state_dict = torch.load(f)
                 ^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 797, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 283, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

training data

Hi! I understand that you used Objaverse for training - I was wondering if you had a training data for already rendered view in the depth grid format to avoid redoing the entire rendering and depth estimation

Discussion/question: What to do with the generated images ? (texturing 3d models)

Hi , the tool works great. Thanx a lot!

Do you know of any good workflow for texturing my 3d model though ? i already projected the initial texture from the front, but now i need to texture the rest of my monsters body, from the side and back at least.

I thought maybe i can use stencil projection mode in texture painting -> which camera settings would i need to set in blender or substance painter to exactly match the generated images from the different angles?

if you know of any python code to do this automatically would be nice too. zero12345 wouldnt work as they generate their own mesh , i already have my mesh from which i made the first original texture (front projection).

maybe this could be partly scripted in blender to become a useful and outstanding texture pipeline tool.

EDIT: Now that i thought about it. Would it be possible to send an arbritrary (or fixed) angle to zero123+ ??
Like, we move around the object in blender, when we find a good angle (this might be different for different objects) , then send the camera position and angle (and maybe even an image for controlnet) to zero123+, generate the next image, receive the image ... then move around it again , and again send to zero123+ and again receive the image.
the intermediate steps in the diffusion could be kept in the mean time so we always get the consistent style and shading from the first/initial picture (similar to how comfyui only evaluates changed parts of the diffusion pipeline) just like it does now..
Maybe comfyUI would be the tool to make this all alot easier as it can already communicate with other progs such as blender and has a really good codebase and makes integrating new modules "easy", without having to write everything from scratch.

Is there an approach to remove the gray background purely?

Thanks for your valuable work! But I'm wondering why zero123++ generates images with a gray background, which leads to much inconvience when feeding the synthesized multiview images to image-to-3D reconstruction pipelines such as OpenLRM or NeuS (Also, Instant3D if it's open-sourced in the future), since most of these methods require a white background for the input images. Using the rembg package to remove the gray background leads to inconsistent and wrong segmentations especially when there are image regions close to the gray color. So I'm wondering if there is an approch to remove the gray background purely and generate consistent white-background images? Also, I'm curious about whether you directly feed the gray-background images to the 3D diffusion model as conditions as described in the One2345++ paper?

could camera parameters be changed?

Thank you for your excellent work!
I'm impressed by the quality of this model.
I was wondering if we could change the camera parameter.
Thanks

Adding License section in Readme

Update the LICENSE section in the README file to enhance the presentation. To update the text with a cleaner format, including a hyperlink to the LICENSE file.
I would like to work on this.

Increasing the image size in the diffuser pipeline causes the output image to collapse

The default size for ControlNet images is 640x960, but this makes each image coarse.
So we changed the image to be able to output at 1024x1536 and the image was out of order.

Image output at 640x960
output
Image output at 1024x1536
output

Is there any way to increase the size of the output image?

Here is the code we used

import torch
import requests
from PIL import Image
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler,ControlNetModel
import rembg

# Load the pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
    torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
    "sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)
# Feel free to tune the scheduler
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
    pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')
# Run the pipeline
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_cond.png", stream=True).raw)
depth = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_depth.png", stream=True).raw)

cond = cond.resize((512, 512))
depth = depth.resize((1024,1536))
result = pipeline(cond,width = 1024,height=1536,depth_image=depth, num_inference_steps=28).images[0]
#result = pipeline(cond,depth_image=depth, num_inference_steps=28).images[0]
result.show()
result.save("output.png")

training resolution 320^2 instead of 512^2?

Dear Authors,

thanks for the great effort and open-sourcing the model.

I have a question regarding the inference resolution of the image. Basically, model diffuses one image, which is 2x3 sub images for 6 views. I see that at the inference time, resolution (640x960) is used, which means the resolution of each view is 320x320. Is that also the images you used in One-2-3-45++ to construct the feature volume?

I also tried to infer higher resolution (512x2, 512x3), and it generate following image, which has 3x5 views. Is this expected? The middle column as well as second and fourth row looks a bit as interpolated camera poses, compared to the (320x2, 320x3), which has 2x3 views:
output
output_

depth control for v1.2 ?

Looking at the documentation for version 1.2, I am not seeing a depth/normal controlnet for the initial view generation. If I understand correctly, the normal controlnet only outputs normals instead of images in this version.
Does the old v1 depth controlnet still work with the new model ? Thanks

About the scheduler

Is it necessary to use the EulerAncestralDiscreteScheduler? What's its advantage over other schedulers? Why the RefOnlyUnet used a DDPMScheduler?

camera FOV

What field of view (FOV; fovy) do you use to render the 6 camera views? I see the elevation and azimuth parameters, but I cannot find the FOV. Thanks!

How do camera viewpoints work at training and inference?

Thanks for releasing the code!

I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:

  1. How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?

  2. The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
    a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
    b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?

  3. I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.

Adding code-of-conduct to the repo !

why coc ?

code-of-conduct:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts, and attracting a
wider range of perspectives.

Issue type:-

  • [✅] Docs

@eliphatfs kindly assign this issue to me ! I would love to work on it ! Thank you !

Lens distortion

Hi there,
Firstly, congratulations on this project, and thank you for all your efforts.
I have a question about the output. A use case I'd like to explore is creating assets for a isometric game from predominantly front-on images. Z++ output seems to generate a fitting camera angle, but due to perspective lens distortion, objects appear irregular comparatively to isometric environments.
eg. sofa which appears different heights on each end due to objects appearing larger the closer they are to the camera.

Do you happen to foresee options for Zero123++ which would try to generate more isometric-friendly imagery in the future?

Thank you again.

Question about ReferenceOnlyAttnProc

Hello, I'm studying the pipeline.py. The ReferenceOnlyAttnProc is the implementation of “appending the self-attention K and V”, right?
I wonder what is mode == 'm' for, since I found mode == 'w' is for storing encoder_hidden_states, and mode == 'r' is for appending. I suspect this is to ensure the completeness of the computation graph for the backward propagation.
image

How to get the textured mesh?

hi, great work!
I want to know how to get the final mesh from the generated multi view image? Like One-2-3-45, use SparseNeuS?

Reference attention

Nice work! But I have a question regarding to the reference attention.

As mentioned in your paper, in zero123 it concats the condition image to the noisy input in the feature dimension for local conditioning. This does impose an incorrect pixel wise alignment between the input and condition image. But the noisy input is also guided with the condition image via cross attention.

I am confused about how you implement your reference attention. Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?

v-prediction

great work! I see the paper said Therefore, we have opted to utilize the Stable Diffusion 2 v-prediction model as our base model for fine-tuning, but the code uses the sample call function with stablediffusion, moreover the lambdalabs/sd-image-variations-diffusers model is default as an eps prediction model, how the model is transferred?

I have tried to verify whether the release model is v-prediction or eps-prediction by adding noise and computing loss with the ground truth of eps and v, I found that the loss of eps is smaller, is that true?

scaling about reference attention

hello, I am trying to fine-tune the model. I have some questions, could you please help me answer them?

  1. in pipeline.py#L403, there are unscale_latents and unscale_image. So in training, I need to do scale_image and scale_latents to get noisy latents? If so, why condition Images are not scaled in pipeline.py, since the two branches in Reference Attention Model use the same unet.
  2. in the report, there is the model achieves the highest consistency with the conditioning image when the reference latent is scaled by a factor of 5. But I haven't seen it implemented in the pipeline.py. Does it mean using 5xCondition Image Latents?

multiple views as input

Results are crazy good, good job!!

I was wondering, is it possible somehow to input more than 1 image (front, side, 45 degrees)?

Thank you!

Add SAM for gradio.

Segmentation quality can be drastically improved by calling SAM on the mask from rembg

Following #6.

How can I change the camera pose and the size of the image?

Hello, thank you very much for releasing your code. I couldn't find a specific part in the code that defines the camera pose. May I ask how you trained or inferred to get images of the corresponding 6 poses? Did you create such a dataset yourself? If I want to generate images from other angles, do I need to create my own dataset? Additionally, I would like to know how to change the size of the generated images. I look forward to your reply, thank you!

Reference Only Local Conditioning Clarification

Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'.
I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].

Is this the intended computation?
Thanks in advance

Is the training based on LoRA or just tune the original model parameters!

Excellent work! I was ispired by it and want to try it with some other work. I understand the you for not release the training code, but I will appreciate if you could tell more about the details.

  1. The fine-tuning is with the help of LoRA or just the original parameters? will use LoRA get a relatively bad results?
  2. How could I train the controlnet if I have some other conditions? Is it a off-the-shelf one or trained one? Can I use a T2I-Adapter instead? Is the training process before or after the training of denoising UNet?
  3. How far could the noise schedule effect? Is it possible that I use \epsilon-prediction and original noise-schedule but still get a satisfying results?

Thank you!

Depth conditioning is not working

I am trying with the following example with a given depth prior.

Screenshot 2023-11-25 at 9 14 58 PM Screenshot 2023-11-25 at 9 15 39 PM

The results are quite out of expectation.

Screenshot 2023-11-25 at 9 16 07 PM

On the other hand, if no depth conditioning is used, the results are somewhat closer:

Screenshot 2023-11-25 at 9 16 34 PM

Do you have a clue what's wrong with the depth conditioning part?

Not generating input image's colors when depth maps passed

Hello Authors, first of all thanks so much for the amazing work.
I am trying to run zero123plus pipeline with depth control net. When I use the depth_image, then the input image's pixels/colors are completely disregarded by the model. The model still generate (kind of) consistent 6 images with respect to depth, but the initial input image colors are no where to be seen.
I tested with your blue and yellow kid chair and one of my test models. I am sharing the pipeline and as well as the input and output results. It'd be a great help for me if you can look into this.
`
pipeline = DiffusionPipeline.from_pretrained(
"sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
"sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)

pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')

result = pipeline(cImage, depth_image=dImage, num_inference_steps=35).images[0]

`

0_cond
0_depth

Screen Shot 2023-11-23 at 12 49 46 PM

4_image (8)
master_depth
Screen Shot 2023-11-23 at 12 54 13 PM

Erorr with current diffusers (0.22.0.dev0)

When running the example code, an error occurs:

...
result = pipeline(cond, num_inference_steps=75).images[0] # it is Zero123PlusPipeline

Error coming up:

File ".../.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File ".../.cache/huggingface/modules/diffusers_modules/local/sudo-ai--zero123plus-pipeline/5006f82f085a8a5e1131440c327cdfde09b9c41a/pipeline.py", line 364, in call
encoder_hidden_states = self._encode_prompt(
File ".../.local/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 249, in _encode_prompt
prompt_embeds = torch.cat([prompt_embeds_tuple[1], prompt_embeds_tuple[0]])

TypeError: expected Tensor as element 0 in argument 0, but got NoneType

I think this is due to evolved code in diffusers. However, the supported diffusers version (0.20) is sort of outdated. Diffusers is under heavy development and will andvance further. Thus, I kindly ask you for a hint how to fix the code (I could do on my own), or supply the fix.

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.