sudo-ai-3d / zero123plus Goto Github PK

View Code? Open in Web Editor NEW

1.6K 30.0 114.0 2.44 MB

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.

License: Apache License 2.0

Python 97.70% Dockerfile 2.30%

3d 3d-graphics aigc diffusers diffusion-models image-to-3d research-project text-to-3d

zero123plus's Introduction

Zero123++: A Single Image to Consistent Multi-view Diffusion Base Model

[Report] [Official Demo] [Demo by @yvrjsharma] [Google Colab] [Replicate demo]

UPDATES v1.2

We are thrilled to release Zero123++ v1.2! Main changes:

Camera intrinsics are handled more delibrately. The v1.2 model is more robust to a wider range of input field of views, croppings and unifies the output field of view to 30° to better reflect that of realistic close-up views.
The fixed set of elevations are changed from 30° and -20° to 20° and -10°.
In contrast with novel-view synthesis, the model focuses more for 3D generation. The model always outputs a set of views assuming a normalized object size instead of changing w.r.t. the input.

Additionally, we have a normal generator ControlNet that can generate view-space normal images. The output can also be used to obtain a more accurate mask than the SAM-based approach. Validation metrics on our validation set from Objaverse: alpha (before matting) IoU 98.81%, mean normal angular error 10.75°, normal PSNR 26.93 dB.

Usage

Use of the v1.2 base model is unchanged. Please see the sections below for usage.

Use of the normal generator: See examples/normal_gen.py.

For alpha mask generation from the normal images, please see examples/matting_postprocess.py and examples/normal_gen.py.

License

The code is released under Apache 2.0 and the model weights are released under CC-BY-NC 4.0.

This means that you cannot use the model (or its derivatives) in a commercial product pipeline, but you can still use the outputs from the model freely. And, you are accountable for the output you generate and its subsequent uses.

Get Started

You will need torch (recommended 2.0 or higher), diffusers (recommended 0.20.2), and transformers to start. If you are using torch 1.x, it is recommended to install xformers to compute attention in the model efficiently. The code also runs on older versions of diffusers, but you may see a decrease in model performance.

And you are all set! We provide a custom pipeline for diffusers, so no extra code is required.

To generate multi-view images from a single input image, you can run the following code (also see examples/img_to_mv.py):

import torch
import requests
from PIL import Image
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler

# Load the pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
    torch_dtype=torch.float16
)

# Feel free to tune the scheduler!
# `timestep_spacing` parameter is not supported in older versions of `diffusers`
# so there may be performance degradations
# We recommend using `diffusers==0.20.2`
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
    pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')

# Download an example image.
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/lysol.png", stream=True).raw)

# Run the pipeline!
result = pipeline(cond, num_inference_steps=75).images[0]
# for general real and synthetic images of general objects
# usually it is enough to have around 28 inference steps
# for images with delicate details like faces (real or anime)
# you may need 75-100 steps for the details to construct

result.show()
result.save("output.png")

The above example requires ~5GB VRAM to run. The input image needs to be square, and the recommended image resolution is >=320x320.

By default, Zero123++ generates opaque images with a gray background (the zero for Stable Diffusion VAE). You may run an extra background removal pass like rembg to remove the gray background.

# !pip install rembg
import rembg
result = rembg.remove(result)
result.show()

To run the depth ControlNet, you can use the following example (also see examples/depth_controlnet.py):

import torch
import requests
from PIL import Image
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler, ControlNetModel

# Load the pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
    torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
    "sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)
# Feel free to tune the scheduler
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
    pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')
# Run the pipeline
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_cond.png", stream=True).raw)
depth = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_depth.png", stream=True).raw)
result = pipeline(cond, depth_image=depth, num_inference_steps=36).images[0]
result.show()
result.save("output.png")

This example requires ~5.7GB VRAM to run.

Models

The models are available at https://huggingface.co/sudo-ai:

sudo-ai/zero123plus-v1.1, base Zero123++ model release (v1.1).
sudo-ai/controlnet-zp11-depth-v1, depth ControlNet checkpoint release (v1) for Zero123++ (v1.1).
sudo-ai/zero123plus-v1.2, base Zero123++ model release (v1.2).
sudo-ai/controlnet-zp12-normal-gen-v1, normal generation ControlNet checkpoint release (v1) for Zero123++ (v1.2).

The source code for the diffusers custom pipeline is available in the diffusers-support directory.

Camera Parameters

Output views are a fixed set of camera poses:

Azimuth (relative to input view): 30, 90, 150, 210, 270, 330.
v1.1 Elevation (absolute): 30, -20, 30, -20, 30, -20.
v1.2 Elevation (absolute): 20, -10, 20, -10, 20, -10.
v1.2 Field of View (absolute): 30°.

Running Demo Locally

You will need to install extra dependencies:

pip install -r requirements.txt

Then run streamlit run app.py.

For Gradio Demo, you can run python gradio_app.py.

Related Work

[One-2-3-45] [One-2-3-45++] [Zero123]

Citation

If you found Zero123++ helpful, please cite our report:

@misc{shi2023zero123plus,
      title={Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model}, 
      author={Ruoxi Shi and Hansheng Chen and Zhuoyang Zhang and Minghua Liu and Chao Xu and Xinyue Wei and Linghao Chen and Chong Zeng and Hao Su},
      year={2023},
      eprint={2310.15110},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

zero123plus's People

Contributors

Stargazers

Watchers

Forkers

eltociear alakia yvrjsharma techthiyanes wangxiaolin99 x-ck-x natlamir tanglespace universewill neomatrix25 datakami-models skyheros001 ailabteam olegjakushkin harshhere905 ootts sorokinvld dustinpro pranparth alexandor91 zfd1 liamzebedee kushal34712 saferabbit aryan4884 di-dimmasik asdlei99 edgency ripl 0x35 feixuedudiao thangdt277 aditya7302 tomchapin shashidhar-techolution chiyee ilovejs 4jeff4 guonetnet51 0armaan025 v0xie mohitd404 saulocatharino dalvishruti14 pentesterpriyanshu shiyao-huang alienishi hengyumeng peterdays yizhangliu vatsalya-vyas stevetuking bjollans leftomelas jaedukseo toandreyhse renatomedev rohan7958 arkboy1224 jags111 smartjoy-tech undercontroller hengle dicky-iskandar tonywhite11 adambear marearts sumerudataanalaytics rozgo chenhuayou steveefemsc liamlangli weykon assassindesign rfeinman ron41913 dingdang2024 alan-baylis zhianlin pranavdulepet atkallie steven-xiong mengxuyigit navezjt yuxuansnow atlantixjj kmswin1 deyh2020 baochaozhu hiyyg esmaelsaleh moonryul xuweiyichen scriptkittyx jerrypiglet shimomurakei panziqiai guozanhua218 pofice bruinxiong

zero123plus's Issues

data of depth ControlNet

Dear author:
I want to ask the format of data of depth ControlNet must be 6 subfigures? I think it is hard to get the multiview depth image.

runtime error in huggingface demo

It says "Memory limit exceeded".

depth control for v1.2 ?

Looking at the documentation for version 1.2, I am not seeing a depth/normal controlnet for the initial view generation. If I understand correctly, the normal controlnet only outputs normals instead of images in this version.
Does the old v1 depth controlnet still work with the new model ? Thanks

v-prediction

great work! I see the paper said Therefore, we have opted to utilize the Stable Diffusion 2 v-prediction model as our base model for fine-tuning, but the code uses the sample call function with stablediffusion, moreover the lambdalabs/sd-image-variations-diffusers model is default as an eps prediction model, how the model is transferred?

I have tried to verify whether the release model is v-prediction or eps-prediction by adding noise and computing loss with the ground truth of eps and v, I found that the loss of eps is smaller, is that true?

Erorr with current diffusers (0.22.0.dev0)

When running the example code, an error occurs:

...
result = pipeline(cond, num_inference_steps=75).images[0] # it is Zero123PlusPipeline

Error coming up:

File ".../.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File ".../.cache/huggingface/modules/diffusers_modules/local/sudo-ai--zero123plus-pipeline/5006f82f085a8a5e1131440c327cdfde09b9c41a/pipeline.py", line 364, in call
encoder_hidden_states = self._encode_prompt(
File ".../.local/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 249, in _encode_prompt
prompt_embeds = torch.cat([prompt_embeds_tuple[1], prompt_embeds_tuple[0]])

TypeError: expected Tensor as element 0 in argument 0, but got NoneType

I think this is due to evolved code in diffusers. However, the supported diffusers version (0.20) is sort of outdated. Diffusers is under heavy development and will andvance further. Thus, I kindly ask you for a hint how to fix the code (I could do on my own), or supply the fix.

Thank you

training data

Hi! I understand that you used Objaverse for training - I was wondering if you had a training data for already rendered view in the depth grid format to avoid redoing the entire rendering and depth estimation

Question about ReferenceOnlyAttnProc

Hello, I'm studying the pipeline.py. The ReferenceOnlyAttnProc is the implementation of “appending the self-attention K and V”, right?
I wonder what is mode == 'm' for, since I found mode == 'w' is for storing encoder_hidden_states, and mode == 'r' is for appending. I suspect this is to ensure the completeness of the computation graph for the backward propagation.

Questions of camera settings when rendering the Objaverse dataset

Thanks for your great work! But I have questions of rendering the Objaverse dataset. What's the camera intrinsic matrix? And what is the camera distance?

getting normals/depth

Hey,

I wanted to ask, is there a way in the current implantations to get the normals or depth for the generated view (or the inserted image) ?

你们是来搞笑的吗？

冰墩墩给弄成了这样，哈哈哈哈

Not generating input image's colors when depth maps passed

Hello Authors, first of all thanks so much for the amazing work.
I am trying to run zero123plus pipeline with depth control net. When I use the depth_image, then the input image's pixels/colors are completely disregarded by the model. The model still generate (kind of) consistent 6 images with respect to depth, but the initial input image colors are no where to be seen.
I tested with your blue and yellow kid chair and one of my test models. I am sharing the pipeline and as well as the input and output results. It'd be a great help for me if you can look into this.
`
pipeline = DiffusionPipeline.from_pretrained(
"sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
"sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)

pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')

result = pipeline(cImage, depth_image=dImage, num_inference_steps=35).images[0]

could camera parameters be changed?

Thank you for your excellent work!
I'm impressed by the quality of this model.
I was wondering if we could change the camera parameter.
Thanks

camera FOV

What field of view (FOV; fovy) do you use to render the 6 camera views? I see the elevation and azimuth parameters, but I cannot find the FOV. Thanks!

That‘s a so great work ,but i have a questions

Whether the final generated image can only be a collection of 6 images with a resolution of 640*960, can a higher resolution image be output?

Objaverse Dataloader: Seeking Input/Output Details for Fine-tuning

Great work! I
I'm currently facing challenges with the Objaverse dataloader while attempting to fine-tune your model. Could you please provide details on the input and output specifications for this model?

the labels on the product will be changed and unreadable

when i upload an original product image ,then run the script and the generation makes the labels like on a bottle of wine in a mess and unreadable anymore

about FOV for rendering dataset

Firstly, thank you so much for sharing such an impressive work!

I'd like to know the FOV you used when rendering the training data. This way, I can quickly integrate my rendered content into new viewpoint generation, thereby achieving texture mapping.

Thank you very much!

Question about the input single image

Hi, thank you for your outstanding work.

I have always had a question regarding the 'Single Image to 3D' tasks. I've noticed that almost all images input into these systems are in a front-facing and upright position. I'm uncertain if I would get equally perfect results when inputting images that are viewed from the bottom or upside down. I believe these latter cases could play a crucial role in many completion tasks.

something about trainning detail..

how to embedding Camera_pos(elevation,azimuth) into model?
so I can fine tune it for my own task

Reference attention

Nice work! But I have a question regarding to the reference attention.

As mentioned in your paper, in zero123 it concats the condition image to the noisy input in the feature dimension for local conditioning. This does impose an incorrect pixel wise alignment between the input and condition image. But the noisy input is also guided with the condition image via cross attention.

I am confused about how you implement your reference attention. Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?

Adding Contributors section to the readme.md

Why Contributors section:- A "Contributors" section in a repo gives credit to and acknowledges
the people who have helped with the project, fosters a sense of community, and helps others
know who to contact for questions or issues related to the project.

example:-

Add SAM for gradio.

Segmentation quality can be drastically improved by calling SAM on the mask from rembg

Following #6.

Discussion/question: What to do with the generated images ? (texturing 3d models)

Hi , the tool works great. Thanx a lot!

Do you know of any good workflow for texturing my 3d model though ? i already projected the initial texture from the front, but now i need to texture the rest of my monsters body, from the side and back at least.

I thought maybe i can use stencil projection mode in texture painting -> which camera settings would i need to set in blender or substance painter to exactly match the generated images from the different angles?

if you know of any python code to do this automatically would be nice too. zero12345 wouldnt work as they generate their own mesh , i already have my mesh from which i made the first original texture (front projection).

maybe this could be partly scripted in blender to become a useful and outstanding texture pipeline tool.

EDIT: Now that i thought about it. Would it be possible to send an arbritrary (or fixed) angle to zero123+ ??
Like, we move around the object in blender, when we find a good angle (this might be different for different objects) , then send the camera position and angle (and maybe even an image for controlnet) to zero123+, generate the next image, receive the image ... then move around it again , and again send to zero123+ and again receive the image.
the intermediate steps in the diffusion could be kept in the mean time so we always get the consistent style and shading from the first/initial picture (similar to how comfyui only evaluates changed parts of the diffusion pipeline) just like it does now..
Maybe comfyUI would be the tool to make this all alot easier as it can already communicate with other progs such as blender and has a really good codebase and makes integrating new modules "easy", without having to write everything from scratch.

Increasing the image size in the diffuser pipeline causes the output image to collapse

The default size for ControlNet images is 640x960, but this makes each image coarse.
So we changed the image to be able to output at 1024x1536 and the image was out of order.

Image output at 640x960

Image output at 1024x1536

Is there any way to increase the size of the output image?

Here is the code we used

import torch
import requests
from PIL import Image
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler,ControlNetModel
import rembg

# Load the pipeline
pipeline = DiffusionPipeline.from_pretrained(
    "sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
    torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
    "sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)
# Feel free to tune the scheduler
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
    pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')
# Run the pipeline
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_cond.png", stream=True).raw)
depth = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_depth.png", stream=True).raw)

cond = cond.resize((512, 512))
depth = depth.resize((1024,1536))
result = pipeline(cond,width = 1024,height=1536,depth_image=depth, num_inference_steps=28).images[0]
#result = pipeline(cond,depth_image=depth, num_inference_steps=28).images[0]
result.show()
result.save("output.png")

question about forward_cond

Hi, thank you for the great work! I have a question about the forward_cond function under the RefOnlyNoisedUNet class. Why is it necessary to call it in the forward function? What does it achieve?
Thank you!

https://github.com/SUDO-AI-3D/zero123plus/blob/main/diffusers-support/pipeline.py#L134

About the scheduler

Is it necessary to use the EulerAncestralDiscreteScheduler? What's its advantage over other schedulers? Why the RefOnlyUnet used a DDPMScheduler?

Some questions regarding the input and output latent scaling

I noticed in the code that when performing generation from gaussian noises, we need to first unscale_latents -> divide vae.config.scaling_factor -> vae decode -> unscale images to get the final image. However, when I tried to directly denoise an input image, how should I apply the scale operations during the encoding and decoding process? I tried the following code:

renderings = scale_images(images)
particles = vae.encode(renderings).latent_dist.sample() * vae.config.scaling_factor
particles = scale_latents(particles)
t = torch.tensor([1], device=device).long()
n = torch.randn_like(particles)
y_noisy = scheduler.add_noise(particles, n, t)
n_pred = predict_noise0_diffuser(Diff_pre, y_noisy, text_embeddings, t, guidance_scale=args.cfg, cross_attention_kwargs=cross_attention_kwargs, scheduler=scheduler)

predict_x = (y_noisy - (1-scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1))**0.5 * n_pred) / scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1)**0.5
predict_x = unscale_latents(predict_x)
image = pipe.vae.decode(predict_x / vae.config.scaling_factor, return_dict=False)[0]
image = unscale_images(image)
result = pipe.image_processor.postprocess(image, output_type='pil')

However, the denoised output (left) has different color than the input image (right):

If I deleted all the scale, unscale functions, the results seem to be correct. So I am confused how to use these scale, unscale functions?

What dataset was used?

Was it regular objaverse or objaverse-xl?

Is the training based on LoRA or just tune the original model parameters!

Excellent work! I was ispired by it and want to try it with some other work. I understand the you for not release the training code, but I will appreciate if you could tell more about the details.

The fine-tuning is with the help of LoRA or just the original parameters? will use LoRA get a relatively bad results?
How could I train the controlnet if I have some other conditions? Is it a off-the-shelf one or trained one? Can I use a T2I-Adapter instead? Is the training process before or after the training of denoising UNet?
How far could the noise schedule effect? Is it possible that I use \epsilon-prediction and original noise-schedule but still get a satisfying results?

Thank you!

Is there an approach to remove the gray background purely?

Thanks for your valuable work! But I'm wondering why zero123++ generates images with a gray background, which leads to much inconvience when feeding the synthesized multiview images to image-to-3D reconstruction pipelines such as OpenLRM or NeuS (Also, Instant3D if it's open-sourced in the future), since most of these methods require a white background for the input images. Using the rembg package to remove the gray background leads to inconsistent and wrong segmentations especially when there are image regions close to the gray color. So I'm wondering if there is an approch to remove the gray background purely and generate consistent white-background images? Also, I'm curious about whether you directly feed the gray-background images to the 3D diffusion model as conditions as described in the One2345++ paper?

Proposal for Collaborative 3D Reconstruction Between Two Projects

Hello,

I stumbled upon your project on GitHub (https://github.com/SUDO-AI-3D/zero123plus) and found it to be quite fascinating. An idea struck me about orchestrating a collaborative effort with another project (https://github.com/dreamgaussian/dreamgaussian) for 3D reconstruction. While I might not fully grasp the technical intricacies, I believe this collaboration could potentially enhance the quality of the output.

Specifically, the idea entails utilizing your tool to convert a single image into six images from different perspectives, then processing each of these images with the DreamGaussian tool for generation. Subsequently, these six generated models could be integrated to reconstruct a refined 3D model.

If this idea is feasible and holds the potential to improve the quality of the project, I would be thrilled if you could consider its implementation. I hope that this proposal brings value to your project, and should it pique your interest, I look forward to delving further into discussions.

Thank you for your consideration.

Warm regards,

[katyukuki]

How to control the output image to be in those six viewpoints without showing the input camera position？

Thanks for releasing the code!
I would like to know how to control the output image to be in those six viewpoints without showing the input camera position.

Any way to do actual 3d preview in gui so you can rotate and export mesh?

how to control what angles it generates ?

Lens distortion

Hi there,
Firstly, congratulations on this project, and thank you for all your efforts.
I have a question about the output. A use case I'd like to explore is creating assets for a isometric game from predominantly front-on images. Z++ output seems to generate a fitting camera angle, but due to perspective lens distortion, objects appear irregular comparatively to isometric environments.
eg. sofa which appears different heights on each end due to objects appearing larger the closer they are to the camera.

Do you happen to foresee options for Zero123++ which would try to generate more isometric-friendly imagery in the future?

Thank you again.

PytorchStreamReader failed reading zip archive: failed finding central directory

when I run app.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.


  You can now view your Streamlit app in your browser.

  Network URL: http://10.119.70.148:8501
  External URL: http://144.48.107.18:8501

2023-10-24 13:47:51.466322: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
text_encoder/model.safetensors not found
Loading pipeline components...: 100%|███████████████████████████████| 8/8 [00:03<00:00,  2.25it/s]
2023-10-24 13:48:09.295 Uncaught app exception
Traceback (most recent call last):
onda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 245, in _get_or_create_cached_value
    cached_result = cache.read_result(value_key)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/homaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 293, in _handle_cache_miss
    cached_result = cache.read_result(value_key)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/homeanaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "/home//download/zero123plus/app.py", line 206, in <module>
    SAMAPI.get_instance()
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 194, in wrapper
    return cached_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 223, in __call__
    return self._get_or_create_cached_value(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 248, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 302, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//download/zero123plus/app.py", line 41, in get_instance
    sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 15, in build_sam_vit_h
    return _build_sam(
           ^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 105, in _build_sam
    state_dict = torch.load(f)
                 ^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 797, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 283, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Reference Only Local Conditioning Clarification

Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'.
I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].

Is this the intended computation?
Thanks in advance

How to complete 3D reconstruction

How to complete 3D reconstruction？
Image results cannot be reconstructed using colmap。
Can you mention some of the fake gestures?

Adding code-of-conduct to the repo !

why coc ?

code-of-conduct:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts, and attracting a
wider range of perspectives.

Issue type:-

[✅] Docs

@eliphatfs kindly assign this issue to me ! I would love to work on it ! Thank you !

Adding License section in Readme

Update the LICENSE section in the README file to enhance the presentation. To update the text with a cleaner format, including a hyperlink to the LICENSE file.
I would like to work on this.

How do camera viewpoints work at training and inference?

Thanks for releasing the code!

I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:

How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?
The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?
I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.

How to get the textured mesh?

hi, great work!
I want to know how to get the final mesh from the generated multi view image? Like One-2-3-45, use SparseNeuS?

Will the training code be published?

I would like to make a model by having it learn in an original camera pose.
Hopefully the training code will be made available.

Depth conditioning is not working

I am trying with the following example with a given depth prior.

The results are quite out of expectation.

On the other hand, if no depth conditioning is used, the results are somewhat closer:

Do you have a clue what's wrong with the depth conditioning part?

How can I change the camera pose and the size of the image?

Hello, thank you very much for releasing your code. I couldn't find a specific part in the code that defines the camera pose. May I ask how you trained or inferred to get images of the corresponding 6 poses? Did you create such a dataset yourself? If I want to generate images from other angles, do I need to create my own dataset? Additionally, I would like to know how to change the size of the generated images. I look forward to your reply, thank you!

Training details

Hello, thank you for the great work.
I would kindly ask you for some training details. In particular:

How many/which type of GPUs did you use for training? How much did it take to train?
Which is the exact SD version checkpoint that has been fine-tuned? It is known from the report that it is the SD 2 v-model (Sec. 2.5) but it would be helpful to have the link to the exact SD checkpoint from which the fine-tuning has been initiated.

Thanks in advance

scaling about reference attention

hello, I am trying to fine-tune the model. I have some questions, could you please help me answer them?

in pipeline.py#L403, there are unscale_latents and unscale_image. So in training, I need to do scale_image and scale_latents to get noisy latents? If so, why condition Images are not scaled in pipeline.py, since the two branches in Reference Attention Model use the same unet.
in the report, there is the model achieves the highest consistency with the conditioning image when the reference latent is scaled by a factor of 5. But I haven't seen it implemented in the pipeline.py. Does it mean using 5xCondition Image Latents?

training resolution 320^2 instead of 512^2?

Dear Authors,

thanks for the great effort and open-sourcing the model.

I have a question regarding the inference resolution of the image. Basically, model diffuses one image, which is 2x3 sub images for 6 views. I see that at the inference time, resolution (640x960) is used, which means the resolution of each view is 320x320. Is that also the images you used in One-2-3-45++ to construct the feature volume?

I also tried to infer higher resolution (512x2, 512x3), and it generate following image, which has 3x5 views. Is this expected? The middle column as well as second and fourth row looks a bit as interpolated camera poses, compared to the (320x2, 320x3), which has 2x3 views:

What is the "scaled-linear noise scheduler"?

Great work! You mention in the paper that the original noise schedule in Stable Diffusion is "scaled-linear schedule".

Could you please provide more details about the "scaled-linear schedule"?

I found that Stable Diffusion didn't release their training scripts. Both latent diffusion model (https://github.com/CompVis/latent-diffusion) and Zero123 (https://github.com/cvlab-columbia/zero123) seem to use the default linear noise schedule. Did I miss something?

Thank you.

multiple views as input

Results are crazy good, good job!!

I was wondering, is it possible somehow to input more than 1 image (front, side, 45 degrees)?