sudo-ai-3d / zero123plus Goto Github PK
View Code? Open in Web Editor NEWCode repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
License: Apache License 2.0
Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
License: Apache License 2.0
Are the new perspective images rendered in blender? Can you provide a specific rotation matrix to facilitate calculation of the camera pose?
Whether the final generated image can only be a collection of 6 images with a resolution of 640*960, can a higher resolution image be output?
How to complete 3D reconstruction?
Image results cannot be reconstructed using colmap。
Can you mention some of the fake gestures?
Thanks for releasing the code!
I would like to know how to control the output image to be in those six viewpoints without showing the input camera position.
I noticed in the code that when performing generation from gaussian noises, we need to first unscale_latents -> divide vae.config.scaling_factor -> vae decode -> unscale images to get the final image. However, when I tried to directly denoise an input image, how should I apply the scale operations during the encoding and decoding process? I tried the following code:
renderings = scale_images(images)
particles = vae.encode(renderings).latent_dist.sample() * vae.config.scaling_factor
particles = scale_latents(particles)
t = torch.tensor([1], device=device).long()
n = torch.randn_like(particles)
y_noisy = scheduler.add_noise(particles, n, t)
n_pred = predict_noise0_diffuser(Diff_pre, y_noisy, text_embeddings, t, guidance_scale=args.cfg, cross_attention_kwargs=cross_attention_kwargs, scheduler=scheduler)
predict_x = (y_noisy - (1-scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1))**0.5 * n_pred) / scheduler.alphas_cumprod.to(device)[t].reshape(-1, 1, 1, 1)**0.5
predict_x = unscale_latents(predict_x)
image = pipe.vae.decode(predict_x / vae.config.scaling_factor, return_dict=False)[0]
image = unscale_images(image)
result = pipe.image_processor.postprocess(image, output_type='pil')
However, the denoised output (left) has different color than the input image (right):
If I deleted all the scale, unscale functions, the results seem to be correct. So I am confused how to use these scale, unscale functions?
Great work! You mention in the paper that the original noise schedule in Stable Diffusion is "scaled-linear schedule".
Could you please provide more details about the "scaled-linear schedule"?
I found that Stable Diffusion didn't release their training scripts. Both latent diffusion model (https://github.com/CompVis/latent-diffusion) and Zero123 (https://github.com/cvlab-columbia/zero123) seem to use the default linear noise schedule. Did I miss something?
Thank you.
when i upload an original product image ,then run the script and the generation makes the labels like on a bottle of wine in a mess and unreadable anymore
Great work! I
I'm currently facing challenges with the Objaverse dataloader while attempting to fine-tune your model. Could you please provide details on the input and output specifications for this model?
Hello,
I stumbled upon your project on GitHub (https://github.com/SUDO-AI-3D/zero123plus) and found it to be quite fascinating. An idea struck me about orchestrating a collaborative effort with another project (https://github.com/dreamgaussian/dreamgaussian) for 3D reconstruction. While I might not fully grasp the technical intricacies, I believe this collaboration could potentially enhance the quality of the output.
Specifically, the idea entails utilizing your tool to convert a single image into six images from different perspectives, then processing each of these images with the DreamGaussian tool for generation. Subsequently, these six generated models could be integrated to reconstruct a refined 3D model.
If this idea is feasible and holds the potential to improve the quality of the project, I would be thrilled if you could consider its implementation. I hope that this proposal brings value to your project, and should it pique your interest, I look forward to delving further into discussions.
Thank you for your consideration.
Warm regards,
[katyukuki]
Hello, thank you for the great work.
I would kindly ask you for some training details. In particular:
Thanks in advance
Firstly, thank you so much for sharing such an impressive work!
I'd like to know the FOV you used when rendering the training data. This way, I can quickly integrate my rendered content into new viewpoint generation, thereby achieving texture mapping.
Thank you very much!
Hi, thank you for your outstanding work.
I have always had a question regarding the 'Single Image to 3D' tasks. I've noticed that almost all images input into these systems are in a front-facing and upright position. I'm uncertain if I would get equally perfect results when inputting images that are viewed from the bottom or upside down. I believe these latter cases could play a crucial role in many completion tasks.
Hey,
I wanted to ask, is there a way in the current implantations to get the normals or depth for the generated view (or the inserted image) ?
Dear author:
I want to ask the format of data of depth ControlNet must be 6 subfigures? I think it is hard to get the multiview depth image.
when I run app.py
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
You can now view your Streamlit app in your browser.
Network URL: http://10.119.70.148:8501
External URL: http://144.48.107.18:8501
2023-10-24 13:47:51.466322: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
text_encoder/model.safetensors not found
Loading pipeline components...: 100%|███████████████████████████████| 8/8 [00:03<00:00, 2.25it/s]
2023-10-24 13:48:09.295 Uncaught app exception
Traceback (most recent call last):
onda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 245, in _get_or_create_cached_value
cached_result = cache.read_result(value_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "aconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/homaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 293, in _handle_cache_miss
cached_result = cache.read_result(value_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/homeanaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 447, in read_result
raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.__dict__)
File "/home//download/zero123plus/app.py", line 206, in <module>
SAMAPI.get_instance()
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 194, in wrapper
return cached_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 223, in __call__
return self._get_or_create_cached_value(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 248, in _get_or_create_cached_value
return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 302, in _handle_cache_miss
computed_value = self._info.func(*func_args, **func_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//download/zero123plus/app.py", line 41, in get_instance
sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 15, in build_sam_vit_h
return _build_sam(
^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/segment_anything/build_sam.py", line 105, in _build_sam
state_dict = torch.load(f)
^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 797, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/envs/train_sd/lib/python3.11/site-packages/torch/serialization.py", line 283, in __init__
super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Hi! I understand that you used Objaverse for training - I was wondering if you had a training data for already rendered view in the depth grid format to avoid redoing the entire rendering and depth estimation
Hi , the tool works great. Thanx a lot!
Do you know of any good workflow for texturing my 3d model though ? i already projected the initial texture from the front, but now i need to texture the rest of my monsters body, from the side and back at least.
I thought maybe i can use stencil projection mode in texture painting -> which camera settings would i need to set in blender or substance painter to exactly match the generated images from the different angles?
if you know of any python code to do this automatically would be nice too. zero12345 wouldnt work as they generate their own mesh , i already have my mesh from which i made the first original texture (front projection).
maybe this could be partly scripted in blender to become a useful and outstanding texture pipeline tool.
EDIT: Now that i thought about it. Would it be possible to send an arbritrary (or fixed) angle to zero123+ ??
Like, we move around the object in blender, when we find a good angle (this might be different for different objects) , then send the camera position and angle (and maybe even an image for controlnet) to zero123+, generate the next image, receive the image ... then move around it again , and again send to zero123+ and again receive the image.
the intermediate steps in the diffusion could be kept in the mean time so we always get the consistent style and shading from the first/initial picture (similar to how comfyui only evaluates changed parts of the diffusion pipeline) just like it does now..
Maybe comfyUI would be the tool to make this all alot easier as it can already communicate with other progs such as blender and has a really good codebase and makes integrating new modules "easy", without having to write everything from scratch.
Was it regular objaverse or objaverse-xl?
Thanks for your valuable work! But I'm wondering why zero123++ generates images with a gray background, which leads to much inconvience when feeding the synthesized multiview images to image-to-3D reconstruction pipelines such as OpenLRM or NeuS (Also, Instant3D if it's open-sourced in the future), since most of these methods require a white background for the input images. Using the rembg
package to remove the gray background leads to inconsistent and wrong segmentations especially when there are image regions close to the gray color. So I'm wondering if there is an approch to remove the gray background purely and generate consistent white-background images? Also, I'm curious about whether you directly feed the gray-background images to the 3D diffusion model as conditions as described in the One2345++ paper?
Thank you for your excellent work!
I'm impressed by the quality of this model.
I was wondering if we could change the camera parameter.
Thanks
Thanks for your great work! But I have questions of rendering the Objaverse dataset. What's the camera intrinsic matrix? And what is the camera distance?
It says "Memory limit exceeded".
Update the LICENSE section in the README file to enhance the presentation. To update the text with a cleaner format, including a hyperlink to the LICENSE file.
I would like to work on this.
The default size for ControlNet images is 640x960, but this makes each image coarse.
So we changed the image to be able to output at 1024x1536 and the image was out of order.
Image output at 640x960
Image output at 1024x1536
Is there any way to increase the size of the output image?
Here is the code we used
import torch
import requests
from PIL import Image
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler,ControlNetModel
import rembg
# Load the pipeline
pipeline = DiffusionPipeline.from_pretrained(
"sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
"sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)
# Feel free to tune the scheduler
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')
# Run the pipeline
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_cond.png", stream=True).raw)
depth = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/0_depth.png", stream=True).raw)
cond = cond.resize((512, 512))
depth = depth.resize((1024,1536))
result = pipeline(cond,width = 1024,height=1536,depth_image=depth, num_inference_steps=28).images[0]
#result = pipeline(cond,depth_image=depth, num_inference_steps=28).images[0]
result.show()
result.save("output.png")
Dear Authors,
thanks for the great effort and open-sourcing the model.
I have a question regarding the inference resolution of the image. Basically, model diffuses one image, which is 2x3 sub images for 6 views. I see that at the inference time, resolution (640x960) is used, which means the resolution of each view is 320x320. Is that also the images you used in One-2-3-45++ to construct the feature volume?
I also tried to infer higher resolution (512x2, 512x3), and it generate following image, which has 3x5 views. Is this expected? The middle column as well as second and fourth row looks a bit as interpolated camera poses, compared to the (320x2, 320x3), which has 2x3 views:
Looking at the documentation for version 1.2, I am not seeing a depth/normal controlnet for the initial view generation. If I understand correctly, the normal controlnet only outputs normals instead of images in this version.
Does the old v1 depth controlnet still work with the new model ? Thanks
Is it necessary to use the EulerAncestralDiscreteScheduler? What's its advantage over other schedulers? Why the RefOnlyUnet used a DDPMScheduler?
What field of view (FOV; fovy) do you use to render the 6 camera views? I see the elevation and azimuth parameters, but I cannot find the FOV. Thanks!
Thanks for releasing the code!
I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:
How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?
The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?
I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.
why coc ?
code-of-conduct:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts, and attracting a
wider range of perspectives.
Issue type:-
@eliphatfs kindly assign this issue to me ! I would love to work on it ! Thank you !
Hi there,
Firstly, congratulations on this project, and thank you for all your efforts.
I have a question about the output. A use case I'd like to explore is creating assets for a isometric game from predominantly front-on images. Z++ output seems to generate a fitting camera angle, but due to perspective lens distortion, objects appear irregular comparatively to isometric environments.
eg. sofa which appears different heights on each end due to objects appearing larger the closer they are to the camera.
Do you happen to foresee options for Zero123++ which would try to generate more isometric-friendly imagery in the future?
Thank you again.
Hello, I'm studying the pipeline.py
. The ReferenceOnlyAttnProc
is the implementation of “appending the self-attention K and V”, right?
I wonder what is mode == 'm'
for, since I found mode == 'w'
is for storing encoder_hidden_states, and mode == 'r'
is for appending. I suspect this is to ensure the completeness of the computation graph for the backward propagation.
hi, great work!
I want to know how to get the final mesh from the generated multi view image? Like One-2-3-45, use SparseNeuS?
Nice work! But I have a question regarding to the reference attention.
As mentioned in your paper, in zero123 it concats the condition image to the noisy input in the feature dimension for local conditioning. This does impose an incorrect pixel wise alignment between the input and condition image. But the noisy input is also guided with the condition image via cross attention.
I am confused about how you implement your reference attention. Do you apply the self attention on the input and condition image independently and then concats their K+V matrices? Do you mind providing some advices?
great work! I see the paper said Therefore, we have opted to utilize the Stable Diffusion 2 v-prediction model as our base model for fine-tuning
, but the code uses the sample call function with stablediffusion, moreover the lambdalabs/sd-image-variations-diffusers model is default as an eps prediction model, how the model is transferred?
I have tried to verify whether the release model is v-prediction or eps-prediction by adding noise and computing loss with the ground truth of eps and v, I found that the loss of eps is smaller, is that true?
Hi, thank you for the great work! I have a question about the forward_cond function under the RefOnlyNoisedUNet class. Why is it necessary to call it in the forward function? What does it achieve?
Thank you!
https://github.com/SUDO-AI-3D/zero123plus/blob/main/diffusers-support/pipeline.py#L134
how to embedding Camera_pos(elevation,azimuth) into model?
so I can fine tune it for my own task
hello, I am trying to fine-tune the model. I have some questions, could you please help me answer them?
unscale_latents
and unscale_image
. So in training, I need to do scale_image
and scale_latents
to get noisy latents? If so, why condition Images are not scaled in pipeline.py
, since the two branches in Reference Attention Model use the same unet.the model achieves the highest consistency with the conditioning image when the reference latent is scaled by a factor of 5
. But I haven't seen it implemented in the pipeline.py
. Does it mean using 5xCondition Image Latents?Results are crazy good, good job!!
I was wondering, is it possible somehow to input more than 1 image (front, side, 45 degrees)?
Thank you!
I would like to make a model by having it learn in an original camera pose.
Hopefully the training code will be made available.
Segmentation quality can be drastically improved by calling SAM on the mask from rembg
Following #6.
Hello, thank you very much for releasing your code. I couldn't find a specific part in the code that defines the camera pose. May I ask how you trained or inferred to get images of the corresponding 6 poses? Did you create such a dataset yourself? If I want to generate images from other angles, do I need to create my own dataset? Additionally, I would like to know how to change the size of the generated images. I look forward to your reply, thank you!
Great work! I have a question regarding the reference-only attention implementation. In the paper it is written that 'Reference Attention refers to the operation of running the denoising UNet model on an extra reference image and appending the self-attention key and value matrices from the reference image to the corresponding attention layers when denoising the model input'.
I would kindly ask if the "append" operation should be intended as a concatenation between the tensors. I mean: if for example both conditioning and input latents are [1, 4, 32] and the Q,K,V project both in e.g. [1, 4, 5], then the concatenation along first dimension should give us a result of [1,4,5] for Query and [1,8,5] for Key and Value. Self-attention matrix should finally result in [1,4,8].
Is this the intended computation?
Thanks in advance
Excellent work! I was ispired by it and want to try it with some other work. I understand the you for not release the training code, but I will appreciate if you could tell more about the details.
Thank you!
Hello Authors, first of all thanks so much for the amazing work.
I am trying to run zero123plus pipeline with depth control net. When I use the depth_image, then the input image's pixels/colors are completely disregarded by the model. The model still generate (kind of) consistent 6 images with respect to depth, but the initial input image colors are no where to be seen.
I tested with your blue and yellow kid chair and one of my test models. I am sharing the pipeline and as well as the input and output results. It'd be a great help for me if you can look into this.
`
pipeline = DiffusionPipeline.from_pretrained(
"sudo-ai/zero123plus-v1.1", custom_pipeline="sudo-ai/zero123plus-pipeline",
torch_dtype=torch.float16
)
pipeline.add_controlnet(ControlNetModel.from_pretrained(
"sudo-ai/controlnet-zp11-depth-v1", torch_dtype=torch.float16
), conditioning_scale=0.75)
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(
pipeline.scheduler.config, timestep_spacing='trailing'
)
pipeline.to('cuda:0')
result = pipeline(cImage, depth_image=dImage, num_inference_steps=35).images[0]
`
how to control what angles it generates ?
When running the example code, an error occurs:
...
result = pipeline(cond, num_inference_steps=75).images[0] # it is Zero123PlusPipeline
File ".../.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File ".../.cache/huggingface/modules/diffusers_modules/local/sudo-ai--zero123plus-pipeline/5006f82f085a8a5e1131440c327cdfde09b9c41a/pipeline.py", line 364, in call
encoder_hidden_states = self._encode_prompt(
File ".../.local/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 249, in _encode_prompt
prompt_embeds = torch.cat([prompt_embeds_tuple[1], prompt_embeds_tuple[0]])
I think this is due to evolved code in diffusers. However, the supported diffusers version (0.20) is sort of outdated. Diffusers is under heavy development and will andvance further. Thus, I kindly ask you for a hint how to fix the code (I could do on my own), or supply the fix.
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.