prs-eth / marigold Goto Github PK

[CVPR 2024 - Oral, Best Paper Award Candidate] Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Home Page: https://marigoldmonodepth.github.io

License: Apache License 2.0

Python 97.29% Shell 2.71%

monocular-depth-estimation diffusion in-the-wild zero-shot

marigold's Introduction

Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

CVPR 2024 (Oral, Best Paper Award Candidate)

This repository represents the official implementation of the paper titled "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation".

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler

We present Marigold, a diffusion model, and associated fine-tuning protocol for monocular depth estimation. Its core principle is to leverage the rich visual knowledge stored in modern generative image models. Our model, derived from Stable Diffusion and fine-tuned with synthetic data, can zero-shot transfer to unseen data, offering state-of-the-art monocular depth estimation results.

📢 News

2024-05-28: Training code is released.
2024-03-23: Added LCM v1.0 for faster inference - try it out at
2024-03-04: Accepted to CVPR 2024.
2023-12-22: Contributed to Diffusers community pipeline.
2023-12-19: Updated license to Apache License, Version 2.0.
2023-12-08: Added - try it out with your images for free!
2023-12-05: Added - dive deeper into our inference pipeline!
2023-12-04: Added paper and inference code (this repository).

🚀 Usage

We offer several ways to interact with Marigold:

We integrated Marigold Pipelines into diffusers 🧨. Check out many exciting usage scenarios in this diffusers tutorial.
A free online interactive demo is available here: (kudos to the HF team for the GPU grant)
Run the demo locally (requires a GPU and an nvidia-docker2, see Installation Guide):
1. Paper version: docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all registry.hf.space/toshas-marigold:latest python app.py
2. LCM version: docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all registry.hf.space/prs-eth-marigold-lcm:latest python app.py
Extended demo on a Google Colab:
If you just want to see the examples, visit our gallery:
Finally, local development instructions with this codebase are given below.

🛠️ Setup

The inference code was tested on:

Ubuntu 22.04 LTS, Python 3.10.12, CUDA 11.7, GeForce RTX 3090 (pip, Mamba)
CentOS Linux 7, Python 3.10.4, CUDA 11.7, GeForce RTX 4090 (pip)
Windows 11 22H2, Python 3.10.12, CUDA 12.3, GeForce RTX 3080 (Mamba)
MacOS 14.2, Python 3.10.12, M1 16G (pip)

🪧 A Note for Windows users

We recommend running the code in WSL2:

Install WSL following installation guide.
Install CUDA support for WSL following installation guide.
Find your drives in /mnt/<drive letter>/; check WSL FAQ for more details. Navigate to the working directory of choice.

📦 Repository

Clone the repository (requires git):

git clone https://github.com/prs-eth/Marigold.git
cd Marigold

💻 Dependencies

We provide several ways to install the dependencies.

Using Mamba, which can installed together with Miniforge3.

Windows users: Install the Linux version into the WSL.

After the installation, Miniforge needs to be activated first: source /home/$USER/miniforge3/bin/activate.

Create the environment and install dependencies into it:
```
mamba env create -n marigold --file environment.yaml
conda activate marigold
```
Using pip: Alternatively, create a Python native virtual environment and install dependencies into it:
```
python -m venv venv/marigold
source venv/marigold/bin/activate
pip install -r requirements.txt
```

Keep the environment activated before running the inference script. Activate the environment again after restarting the terminal session.

🏃 Testing on your images

📷 Prepare images

Use selected images from our paper:
```
bash script/download_sample_data.sh
```
Or place your images in a directory, for example, under input/in-the-wild_example, and run the following inference command.

🚀 Run inference with LCM (faster)

The LCM checkpoint is distilled from our original checkpoint towards faster inference speed (by reducing inference steps). The inference steps can be as few as 1 (default) to 4. Run with default LCM setting:

 python run.py \
     --input_rgb_dir input/in-the-wild_example \
     --output_dir output/in-the-wild_example_lcm

🎮 Run inference with DDIM (paper setting)

This setting corresponds to our paper. For academic comparison, please run with this setting.

python run.py \
    --checkpoint prs-eth/marigold-v1-0 \
    --denoise_steps 50 \
    --ensemble_size 10 \
    --input_rgb_dir input/in-the-wild_example \
    --output_dir output/in-the-wild_example

You can find all results in output/in-the-wild_example. Enjoy!

⚙️ Inference settings

The default settings are optimized for the best result. However, the behavior of the code can be customized:

Trade-offs between the accuracy and speed (for both options, larger values result in better accuracy at the cost of slower inference.)
- --ensemble_size: Number of inference passes in the ensemble. For LCM ensemble_size is more important than denoise_steps. Default: 10 5 (for LCM).
- --denoise_steps: Number of denoising steps of each inference pass. For the original (DDIM) version, it's recommended to use 10-50 steps, while for LCM 1-4 steps. When unassigned (None), will read default setting from model config. Default: ~~10 4 (for LCM)~~ None.
By default, the inference script resizes input images to the processing resolution, and then resizes the prediction back to the original resolution. This gives the best quality, as Stable Diffusion, from which Marigold is derived, performs best at 768x768 resolution.
- --processing_res: the processing resolution; set as 0 to process the input resolution directly. When unassigned (None), will read default setting from model config. Default: ~~768~~ None.
- --output_processing_res: produce output at the processing resolution instead of upsampling it to the input resolution. Default: False.
- --resample_method: the resampling method used to resize images and depth predictions. This can be one of bilinear, bicubic, or nearest. Default: bilinear.
--half_precision or --fp16: Run with half-precision (16-bit float) to have faster speed and reduced VRAM usage, but might lead to suboptimal results.
--seed: Random seed can be set to ensure additional reproducibility. Default: None (unseeded). Note: forcing --batch_size 1 helps to increase reproducibility. To ensure full reproducibility, deterministic mode needs to be used.
--batch_size: Batch size of repeated inference. Default: 0 (best value determined automatically).
--color_map: Colormap used to colorize the depth prediction. Default: Spectral. Set to None to skip colored depth map generation.
--apple_silicon: Use Apple Silicon MPS acceleration.

⬇ Checkpoint cache

By default, the checkpoint is stored in the Hugging Face cache. The HF_HOME environment variable defines its location and can be overridden, e.g.:

export HF_HOME=$(pwd)/cache

Alternatively, use the following script to download the checkpoint weights locally:

bash script/download_weights.sh marigold-v1-0
# or LCM checkpoint
bash script/download_weights.sh marigold-lcm-v1-0

At inference, specify the checkpoint path:

python run.py \
    --checkpoint checkpoint/marigold-v1-0 \
    --denoise_steps 50 \
    --ensemble_size 10 \
    --input_rgb_dir input/in-the-wild_example\
    --output_dir output/in-the-wild_example

🦿 Evaluation on test datasets

Install additional dependencies:

pip install -r requirements+.txt -r requirements.txt

Set data directory variable (also needed in evaluation scripts) and download evaluation datasets into corresponding subfolders:

export BASE_DATA_DIR=<YOUR_DATA_DIR>  # Set target data directory

wget -r -np -nH --cut-dirs=4 -R "index.html*" -P ${BASE_DATA_DIR} https://share.phys.ethz.ch/~pf/bingkedata/marigold/evaluation_dataset/

Run inference and evaluation scripts, for example:

# Run inference
bash script/eval/11_infer_nyu.sh

# Evaluate predictions
bash script/eval/12_eval_nyu.sh

Note: although the seed has been set, the results might still be slightly different on different hardware.

🏋️ Training

Based on the previously created environment, install extended requirements:

pip install -r requirements++.txt -r requirements+.txt -r requirements.txt

Set environment parameters for the data directory:

export BASE_DATA_DIR=YOUR_DATA_DIR  # directory of training data
export BASE_CKPT_DIR=YOUR_CHECKPOINT_DIR  # directory of pretrained checkpoint

Download Stable Diffusion v2 checkpoint into ${BASE_CKPT_DIR}

Prepare for Hypersim and Virtual KITTI 2 datasets and save into ${BASE_DATA_DIR}. Please refer to this README for Hypersim preprocessing.

Run training script

python train.py --config config/train_marigold.yaml

Resume from a checkpoint, e.g.

python train.py --resume_run output/marigold_base/checkpoint/latest

Evaluating results

Only the U-Net is updated and saved during training. To use the inference pipeline with your training result, replace unet folder in Marigold checkpoints with that in the checkpoint output folder. Then refer to this section for evaluation.

Note: Although random seeds have been set, the training result might be slightly different on different hardwares. It's recommended to train without interruption.

✏️ Contributing

Please refer to this instruction.

🤔 Troubleshooting

Problem	Solution
(Windows) Invalid DOS bash script on WSL	Run `dos2unix <script_name>` to convert script format
(Windows) error on WSL: `Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory`	Run `export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH`

🎓 Citation

Please cite our paper:

@InProceedings{ke2023repurposing,
      title={Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation},
      author={Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
}

🎫 License

This work is licensed under the Apache License, Version 2.0 (as defined in the LICENSE).

By downloading and using the code and model you agree to the terms in the LICENSE.

marigold's People

Contributors

Stargazers

Watchers

Forkers

rb-synth sweetpand mohamedalirashad liannice nandometzger jerry-master eltociear veryvanya abhishekmonogram ssghost jensinjames yahskapar redcalabash thinklikeanarchitect deepuav semjon00 sung206 zuodexin rupher3222 zhaokezzz techthiyanes jaedukseo jwarnergithub alakia jk4011 mavende pablodawson ricklentz philientaylor atumcell ngdhung31 hjxwhy fastflair fishfishson chisarie benjamesbabala ohhhyeahhh ssusantachary parallelsystems graemeniedermayer emanuelegiacomini arupsankarroy ucsd-comp-imaging c-nr rsundar daydreamer2023 butterk3ks kaihsiangl kylesargent yuqunw andupotorac 53756e4c69 gkmocastro supavision csvk20 k2m5t2 jags111 markkua mcx snackferret timdesrochers zcfrank1st luckyadugithub sunpihai-up peterzs jackzhousz michaeltan53 wioponsen zhizhangxian chenhaomingbob bananaman1983 agu18dec wuzhongdehua jx-sony pciodyuc samxrx laurentdupin henryjliu dmytroivakhnenkov beatsmasterz nheyr08 fslaser larissa0829 domino2015 jerrypiglet yacinedeghaies saltyfishalpha junyang0412 municef1 brandnewx paulroudier user074 sarahanberlin dhsoon02 hengloveyi diliash musicnova soyngyoo yetiiil zhynwng

marigold's Issues

Even the LCM is slow, is that how it should be? or something wrong

Hello dear Marigold developers!

This is my first time doing this

I ran the command on my PC with RTX3090

python run.py
--denoise_steps 4
--ensemble_size 5
--input_rgb_dir input/in-the-wild_example
--output_dir output/in-the-wild_example_lcm

I prepared a png sequence of 1 minute video (24 fps) size 768 x 768 px, and it will take about 3 hours to process all the frames, is this how it should be or did I do something wrong? I tried LCM Demo on HuggingFace and it was many times faster

Train a ControlNet plugin instead of full-scale fine-tuning?

This work is very inspiring and exciting. Marigold makes huge progress in discriminative diffusion models by showing that general-purpose pre-training can benefit later fine-tuning for discrimination, so that we no longer train discriminative diffusion models from scratches.
Now the problem is the FULL-SCALE fine-tuning. In fact there are alternative ways in generative diffusion models. For example, ControlNet keeps the backbone U-Net frozen and trains a plugin instead, where the plugin can toggle the behavior of the backbone to certain purposes. This approach is more efficient and more flexible.
So I wonder if you can train a plugin-Marigold with all the other settings unchanged? If this approach can be demonstrated feasible (or even infeasible), the community can get very useful insights.

Colab NameError: name 'pipe' is not defined

this is the error I am getting while using the Google colab notebook, Please help

---> 29         pipeline_output = pipe(
     30             input_image,
     31             denoising_steps=10,     # optional

NameError: name 'pipe' is not defined

About evaluation protocol

Hi, I'm new to this field, and I am impressed by your outstanding work. Thank you for sharing your code! I have a question regarding the evaluation protocol details.

From the paper in the evaluation protocol section, the evaluation method of Marigold is described as follows.
"When we first align the estimated merged prediction m to the ground truth d with the least squares fitting, this step gives us the absolute aligned depth map a = m × s + t, in the same units as the ground truth."

Does this mean that the process described above proceeds in the following sequence?

(1) Obtain the depth map 'm' from the Diffusion model.
(2) Refine 'm' through least squares fitting with the ground truth.
(3) Estimate s and t to obtain the aligned depth map, a = m × s + t.

Thank you for reading my question. I am confused about whether the equation a = m × s + t is directly derived through least squares fitting, or if it requires separate calculation. If it's not too much trouble, could you please share the code for the evaluation protocol? Thank you!

"--seed" not making results reproducible

I did a test, duplicated a jpeg in a directory, so there's just 2 image files, same file just different names. I ran this with the --seed and each image was different. It seems a random seed is being generated for each image regardless of this argument. So specifying a seed does NOT make the results reproducible.

How to convert depth to density

Thanks for your great work! This effect is currently the best and most effective model I have ever seen. I am not familiar with depth estimation work, so how can I convert a depth ranging from 0 to 1 to a density ranging from 0 to 1 in order to match the output of other models.

Out of memory when training with RTX4090, seeking guidance on training details

I am currently attempting to reproduce the training process described in your paper using Stable Diffusion v2. However, my RTX 4090 ran out of memory when training with batchsize 32 , as mentioned in the paper. I use a resolution of 768x768(same as Stable Diffusion v2) and I am uncertain whether this setting is appropriate.

how to inference one image on multi-gpu for faster processing?

torchvision is missing from requirements.txt

In order to use the provided run.py, torchvision is required, which is missing in the requirements.txt. The repo settings forbid me to submit any branches and pull requests, so I guess you'll fix it yourself at one point

finetune on other domain, the validation is a bit noisy

Awesome work!! I'm trying to use the same fine-tuned protocol over another domain.

However, I suffered from the noisy results. The training lasts for 2 days on A100.

Is there any chance I can get some insights to improve the results?

Best

Can this algorithm be used to obtain real-world distance to an object?

Any plan to train specialized ControlNet/ControlLora?

Thanks for the work you've done on Marigold. This is a revolutionary depth map estimator with extreme details compared to previous methods. Imo, one noticable application of it is ControlNet preprocessor. The problem tho is that current depth map ControlNet was only trained on less detailed dataset so it can't get all details from depth map generated by Marigold.

Predicted Depth to Colorful Point Cloud

Hi,

nice work! I always want to compare the predicted depth with the colorful unprojected point cloud. I compared Marigold, ZoeDepth, OmniDatav2. I tried the following image.

Marigold:

OmniDatav2:

ZoeN:

It's interesting that previous method predicts more likely to a flat geometry, while Marigold can preserve it better :-)

Try to use it in a real-world scenario

Thank you for sharing, I tried to use it in the real scene, but the depth map I got is very strange after converting it into a point cloud, I would like to ask how you converted the depth map into a point cloud file.

Video depth map deflickering

Depth maps generated from video frames by Marigold is obiviously flickering. While this can be partially fixed by setting high n_repeat and fixed seed, such method is inefficent imo. Is there any way to use previous video frame/depth map to condition the diffusion process?

Why, when estimation a sequence, the depth between the frames is always different?

Why, when estimation a sequence, the depth between the frames is always different. Is it possible to averaged the value of the depth, like a implemented in Zoedepth?

How to write depthmap in pfm

Thank for your amazing contributions.
By the way, since my code which can generate 3d point clouds using depthmap needs pfm format of depthmap, so I want to write depth_pred in pfm. How can I do this?

Stable Diffusion fine-tuning problem

During fine-tuning for depth estimateaion conditioned on input image, how to deal with the text prompt required in the original pre_trained text-to-image Stable Diffusion model?

the test results seem have much noise

Hi, thanks for sharing this wonderful work, I tried to test but the depth map seems have much noise.

Why the inference sometimes outputs a depth map and sometimes outputs a parallax map?

Acknowledgment and Concerns :Training code and License Ambiguities

First and foremost, I'd like to express my sincere appreciation for the remarkable work you've done with Marigold. I view this project as a significant breakthrough in the field of computer graphics.

This endeavor has the potential to extend beyond its current capabilities with depth generation, opening doors for diverse applications given the right dataset. However, there is a critical need for training code to adapt this model for tasks such as generating normal, displacement, metallic maps, and more.

Regrettably, the project's license is presently unclear, posing a hindrance to its utilization in our projects. The ambiguity surrounding whether the generated depth maps can be employed commercially, coupled with the restriction on the commercial use of the code, renders the project impractical for our needs.

Maxing out VRAM at 24GB

I was told that Marigold maxing out my 24gb VRAM isn't supposed to happen and was told to post my settings here.

I have a bat file that I run with the following:

@echo off
call venv/marigold/Scripts/activate.bat
python run.py --checkpoint checkpoint/Marigold_v1_merged --input_rgb_dir Input --output_dir Output
pause

Weight files take forever to download

bash script/download_weights.sh marigold-v1-0
# or LCM checkpoint
bash script/download_weights.sh marigold-lcm-v1-0

Do you have a lighter model?

How to generate the Point Clouds shown in the paper ?

Can you share how the depth maps were used to generate the point clouds?
I checked this issue ( #6 ), but it doesn't specify any method on how to generate the points clouds and visualize them

Also, is it possible to use multiple images showing different individual parts of a scene, and generate depth maps from them? And then use those multiple-depth maps to connect the different point clouds to form the whole scene together without using stitching ?

problems about training dataset

Hi authors! I find that you probabilistically choose the KITTI or Hypersim dataset and draw the mini batch from it when you are training. Could you explain the reason for that? Why don't you resize or crop the training dataset to the same size and mix the two datasets to draw the mini batch?

why the paper was named Marigold?

As the title said, why the paper was named Marigold? I'm curious about this.

About the predictions of reproduced model

Hi, we reproduced Marigold according to the paper. The first figure is the depth predicted by the model we reproduced. The second figure is the prediction of Marigold's official weight. Compared with Marigold's prediction, our prediction is not smooth. Actually, there is a lot of noise Besides, the model seems to only focus on the foreground, and the depth prediction range is small. Do you have any suggestions on this?

Question on the output

Is the npy file a numpy array? If so what are the values in? If the model is trained off synthetic data and if the npy is what I think it is, does that mean it'll be in real world or some other linear dimensional scaled value? BTW it works great! takes a long time though.

Here's me face

Pretty amazing it's able to figure out all the hair. Still trying to fully understand from the paper how it works.

About training on real images

Thanks for sharing your code and model. The depth visualization is really awesome, especially sharp edges.

I noticed that both training datasets(Hypersim and Virtual KITTI) were synthetic datasets. Have you ever tried to train on real dataset?

how to convert a HF Diffusers saved pipeline to a Stable Diffusion checkpoint?

I wanna to convert the current UNet, VAE, scheduler, tokenizer and Text Encoder diffusers model structure to a single safetensors checkpoint file then i can load in comfyUI,any advice or solution?

Request: Need models in onnx and .pt format (not just .bin and .config)

I was requesting the model in another format because I cannot convert it without the proper model configuration file (I've tried) Need them in onnx or .pt format specifically for a Unity application called Depthviewer. Can we make this happen?

Here is a list of models and their formats available, as you can see depth-anything has onnx, I was hoping marigold could profile this also https://airtable.com/appjWiS91OlaXXtf0/shrchKmROzpsq0HFw/tblviBOLphAw5Befd

Can Marigold be used as segmentation?

Awesome Work~
The foreground effect of the depth estimation is very good, even better than SAM ，Have you considered applying it to the field of saliency detection?

LCM support

I was wondering if lcm support would be possible

How should I set up the dataset

I set up a new folder in the root , named kitti_data, and placed the KITTI dataset in it
BUT when I run the code:
bash script/eval/21_infer_kitti.sh
and there is a KeyError: "filename './2011_09_26/2011_09_26_drive_0002_sync/image_02/data/0000000069.png' not found".

I'm a rookie, u know, and I can't figure this out.

Possible Erroneous Depth Map Normalization at Inference Time

Hello everybody,

I very much appreciate the work you all have done on Marigold. Leveraging a strong diffusion prior, like Stable Diffusion v2, to fine-tune the model on the task of monocular depth estimation with exclusively synthetic data indeed allows for strong zero-shot generalization to other domains.

I was experimenting with the Marigold model and its components, and seem to have stumbled upon a slight error in the normalization operation at the end of the inference function ‘single_infer’. After generating the latent encoding depth_latent, decoding, and clipping the values to the usual diffusion ranges [-1,1], the depth map is finally normalized to [0,1]. However, it appears that the wrong operation was applied. The formula depth = depth * 2.0 - 1.0 is meant to normalize from ranges [0,1] to [-1,1]. Instead, it should be depth = (depth+1) / 2.0.

This causes some generated depth maps to be in the ranges of [-3,1] before applying the ensemble optimization step. With regards to the complete inference pipeline, I presume that this doesn't harm the model's performance since the ensembling step normalizes the aggregated depth map to [0,1]. However, using ‘single_infer’ on its own may lead to undesired behavior.

Please let me know if I am missing something.

License

Does the license mean that the software itself can’t be used for commercial purposes— as in, I can’t sell it or sell products using the repo— or does it mean the depth maps themselves can’t be used within a project— as in, the depth maps couldn’t be used to create an asset in VFX or gaming for commercial purposes?

The main difference between this paper and other density visual prediction methods utilizing the diffusion model

Hello, thank you for your excellent work. I've noticed several papers on density visual prediction utilizing the diffusion model, and I'm interested in understanding how your paper differs from them. Is the main distinction that other papers do not utilize the pretrained SD? The discussion in your paper seems a bit brief. If you could provide further clarification, it would be greatly appreciated. Thank you!

Different results from huggingface demo and github repo

Hi, I am trying to produce a depth map from a single image using Marigold. I have tried the huggingface demo and also used github repo to run the model locally with the save single photo. But I got different results:

the former one's from higgingface demo, while the latter one's from github repo running locally. I wanna know if there's any difference like config between them?

Hypersim Data Pre-processing

Would you be able to release your code for pre-processing the Hypersim depth data?

Broken Arguments

https://github.com/prs-eth/Marigold/blob/cc78ff3033f5804cadf8523ed11b6bbf0d025077/run.py#L64C16-L64C16
The bitwise NOT operator ~ should not be used with boolean values the same way ! is used in other languages.
It turns it into a number and doesn't work right when checked later.

Worse than that:
When you fix the above issue and it actually returns false when checking resize_input, "image" never actually gets assigned and will throw an error.

I'm too lazy to make a pull request though.

Inconsistent results in Table 1 and ablation study.

Hi, I notice that the results reported in Table 1 and those reported in ablation studies are inconsistent.

Could you please tell the difference between the two models?

Question about applying SDS Loss using Marigold

Hi, thanks for your interesting work!
I'm trying to apply sds loss(a loss used in text/image to 3D) using marigold in my work, here I have some question about training depth data details.
First, what's numerical range of depth map, is nearer depth number smaller? Did you normalize the training depth to [-1,1]?
Second, how to process single channel depth to three channel for vae extractor, just repeat it?
Following is my code about sds loss, it has some problem currently.

    def sds_loss(self, pred_depth, rgb_in, bs, view_num, guidance_scale=100, as_latent=False, grad_scale=1,
                 save_guidance_path=None):
        if self.alphas is None:
            self.alphas = self.scheduler.alphas_cumprod.to(self.device)
        """
        pred_depth: the predicted depth, normalized to [-1,1], and nearer is smaller, size of (bs,1,h,w)
        rgb_in: the conditioned image, normalized to [-1,1], size of (bs,3,h,w)
        bs: batch size
        view_num: used for reshaping
        """
        device = self.device
        # Encode image
        pred_depth = pred_depth.repeat(1, 3, 1, 1)
        rgb_latent = self.encode_rgb(rgb_in)
        depth_latent = self.encode_rgb(pred_depth)
        # Set timesteps
        t = torch.randint(self.min_step, self.max_step + 1, (bs,), dtype=torch.long,
                          device=self.device)
        t = t.unsqueeze(-1).repeat(1, view_num).view(-1)

        with torch.no_grad():
            # Initial depth map (noise)
            latent_noise = torch.randn(
                rgb_latent.shape,
                device=device,
                dtype=self.dtype,
                generator=None,
            )  # [B, 4, h, w]

            latents_noisy = self.scheduler.add_noise(depth_latent, latent_noise, t)
            # pred noise
            uncon_latent=torch.cat([torch.zeros_like(rgb_latent).to(rgb_latent), latents_noisy], dim=1)
            con_latent = torch.cat([rgb_latent, latents_noisy], dim=1)
            latent_model_input=torch.cat([uncon_latent,con_latent],dim=0)
            tt = torch.cat([t] * 2)
            # Batched empty text embedding
            if self.empty_text_embed is None:
                self.encode_empty_text()
            batch_empty_text_embed = self.empty_text_embed.repeat(
                (latent_model_input.shape[0], 1, 1)
            ).to(device)  # [B, 2, 1024]

            noise_pred = self.unet(
                latent_model_input, tt, encoder_hidden_states=batch_empty_text_embed
            ).sample  # [B, 4, h, w]

            # perform guidance (high scale from paper!)
            noise_pred_uncond, noise_pred_pos = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_pos - noise_pred_uncond)

        # w(t), sigma_t^2
        w = (1 - self.alphas[t])
        grad = grad_scale * w[:, None, None, None] * (noise_pred - latent_noise)
        grad = torch.nan_to_num(grad)

        targets = (depth_latent - grad).detach()
        loss = 0.5 * F.mse_loss(depth_latent.float(), targets, reduction='sum') / depth_latent.shape[0]

        return loss

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps

Hello everybody,

I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.

At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).

This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.

Please let me know if I am missing something.
Best.

Vector displacement? [feature request]

Hi I want to know if it's possible to add vector displacement ? it's allow to get better result than depth map on 3d objects
depth vs vector :

About Recovering the Depth with Metirc.

Thank you for your outstanding work! It is very impressive to deploy a diffusion pipeline into monocular depth estimation.

As stated in the paper, the model works for affine-invariance depth estimation, since the depth normalization is not revertible, I wonder if I want to recover the depth with metric, what can I do?

In other words, all the affine-invariance depth has a global scale or offset factor, according to the Eq3 of your paper, the d2 and d98 depth values from the given image, which is instance-independent I guess. Is there any method to recover the true depth with the assistance of extra information, like camera intrinsic or stereo images baseline?

Try to convert depth maps to normal maps

Hello, thank you for your brilliant work!
I am a student from USTC, who is a novice in 3D Vision. I run your code on my DIY cases and it works well. Upon your impressive results, I want to make some exploration related to normal maps. In your paper, I notice that you mentioned colored as normals in Figure 5. I wonder whether those are authentic surface normals that we usually use?
If yes, how can I get those normals based on your code?
Thanks a lot!!! :)

Would you be willing to share the code used for your training process?

Request: Make model available in Onnx and .pt format.

I am working with a custom version of a program called depthviewer and am currently trying to help the dev integrate the onnx version of depth-anything (tiktok model that kust released) he now has marigold working but it works outside of Unity.

I have found marigold results to be much superior for single images when converting to 3d than depth-anything, the issue is I have tried python conversion scripts to convert to a .pt and they do not work due to missing config.

Is there any way you guys can release the model in onmx format and also pt format?

Any plan of releasing training code?

Thank you for the great work.

I am planning to train this model on depth images from different domains and also try training on other problem statements like semantic segmentation.

So, it would be very helpful if you release training code as well.

Regarding the Stochastic Nature of the Stable Diffusion v2 VAE's Encoder

Hello everybody,

The Stable Diffusion v2 VAE encoder outputs a mean and log variance of a Gaussian distribution, from which the latent encoding is drawn. In the field of generative AI, this process adds another stochastic element to the sampling process, resulting in a greater variety of generated images.

For the case of Marigold, when applying the Stable Diffusion Encoder, the reparameterization trick is made deterministic by directly taking the mean. I presume this is a step to remove randomness from the sampling process, as the task is to estimate depth maps, and we are interested in minimizing the variance of the generated maps as much as possible. On the other hand, Marigold already performs an optimization ensemble step, which might benefit from a variety of feasible estimates.

Was this a deliberate change, or would it have been something like rgb_latent = (mean + torch.exp(0.5 * logvar)*torch.randn(mean.shape).to(self.device)) * self.rgb_latent_scale_factor?

Thanks

prs-eth / marigold Goto Github PK

marigold's Introduction

Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

📢 News

🚀 Usage

🛠️ Setup

🪧 A Note for Windows users

📦 Repository

💻 Dependencies

🏃 Testing on your images

📷 Prepare images

🚀 Run inference with LCM (faster)

🎮 Run inference with DDIM (paper setting)

⚙️ Inference settings

⬇ Checkpoint cache

🦿 Evaluation on test datasets

🏋️ Training

✏️ Contributing

🤔 Troubleshooting

🎓 Citation

🎫 License

marigold's People

Contributors

Stargazers

Watchers

Forkers

marigold's Issues

Recommend Projects

Recommend Topics

Recommend Org