isl-org / midas Goto Github PK

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

License: MIT License

Python 40.16% Dockerfile 0.29% Java 35.07% Swift 18.10% Ruby 0.13% Shell 1.15% CMake 2.05% C++ 3.06%

deeplearning monocular-depth-estimation single-image-depth-prediction

midas's Introduction

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

This repository contains code to compute depth from a single image. It accompanies our paper:

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun

and our preprint:

Vision Transformers for Dense Prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

For the latest release MiDaS 3.1, a technical report and video are available.

MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with multi-objective optimization. The original model that was trained on 5 datasets (MIX 5 in the paper) can be found here. The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.

Setup

Pick one or more models and download the corresponding weights to the weights folder:

MiDaS 3.1

For highest quality: dpt_beit_large_512
For moderately less quality, but better speed-performance trade-off: dpt_swin2_large_384
For embedded devices: dpt_swin2_tiny_256, dpt_levit_224
For inference on Intel CPUs, OpenVINO may be used for the small legacy model: openvino_midas_v21_small .xml, .bin

MiDaS 3.0: Legacy transformer models dpt_large_384 and dpt_hybrid_384

MiDaS 2.1: Legacy convolutional models midas_v21_384 and midas_v21_small_256

Set up dependencies:

conda env create -f environment.yaml
conda activate midas-py310

optional

For the Next-ViT model, execute

git submodule add https://github.com/isl-org/Next-ViT midas/external/next_vit

For the OpenVINO model, install

pip install openvino

Usage

Place one or more input images in the folder input.
Run the model with
```
python run.py --model_type <model_type> --input_path input --output_path output
```
where <model_type> is chosen from dpt_beit_large_512, dpt_beit_large_384, dpt_beit_base_384, dpt_swin2_large_384, dpt_swin2_base_384, dpt_swin2_tiny_256, dpt_swin_large_384, dpt_next_vit_large_384, dpt_levit_224, dpt_large_384, dpt_hybrid_384, midas_v21_384, midas_v21_small_256, openvino_midas_v21_small_256.
The resulting depth maps are written to the output folder.

optional

By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This size is given by the numbers in the model names of the accuracy table. Some models do not only support a single inference height but a range of different heights. Feel free to explore different heights by appending the extra command line argument --height. Unsupported height values will throw an error. Note that using this argument may decrease the model accuracy.
By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution, disregarding the aspect ratio while preserving the height, use the command line argument --square.

via Camera

If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths away and choose a model type as shown above:

python run.py --model_type <model_type> --side

The argument --side is optional and causes both the input RGB image and the output depth map to be shown side-by-side for comparison.

via Docker

Make sure you have installed Docker and the NVIDIA Docker runtime.
Build the Docker image:
```
docker build -t midas .
```
Run inference:
```
docker run --rm --gpus all -v $PWD/input:/opt/MiDaS/input -v $PWD/output:/opt/MiDaS/output -v $PWD/weights:/opt/MiDaS/weights midas
```
This command passes through all of your NVIDIA GPUs to the container, mounts the input and output directories and then runs the inference.

via PyTorch Hub

The pretrained model is also available on PyTorch Hub

via TensorFlow or ONNX

See README in the tf subdirectory.

Currently only supports MiDaS v2.1.

via Mobile (iOS / Android)

See README in the mobile subdirectory.

via ROS1 (Robot Operating System)

See README in the ros subdirectory.

Currently only supports MiDaS v2.1. DPT-based models to be added.

Accuracy

We provide a zero-shot error $\epsilon_d$ which is evaluated for 6 different datasets (see paper). Lower error values are better. $\color{green}{\textsf{Overall model quality is represented by the improvement}}$ (Imp.) with respect to MiDaS 3.0 DPT_L-384. The models are grouped by the height used for inference, whereas the square training resolution is given by the numbers in the model names. The table also shows the number of parameters (in millions) and the frames per second for inference at the training resolution (for GPU RTX 3090):

MiDaS Model	DIW ^WHDR	Eth3d ^AbsRel	Sintel ^AbsRel	TUM ^δ1	KITTI ^δ1	NYUv2 ^δ1	$\color{green}{\textsf{Imp.}}$ ^%	Par. ^M	FPS
Inference height 512
v3.1 BEiT_L-512	0.1137	0.0659	0.2366	6.13	11.56*	1.86*	$\color{green}{\textsf{19}}$	345	5.7
v3.1 BEiT_L-512$\tiny{\square}$	0.1121	0.0614	0.2090	6.46	5.00*	1.90*	$\color{green}{\textsf{34}}$	345	5.7

Inference height 384
v3.1 BEiT_L-512	0.1245	0.0681	0.2176	6.13	6.28*	2.16*	$\color{green}{\textsf{28}}$	345	12
v3.1 Swin2_L-384$\tiny{\square}$	0.1106	0.0732	0.2442	8.87	5.84*	2.92*	$\color{green}{\textsf{22}}$	213	41
v3.1 Swin2_B-384$\tiny{\square}$	0.1095	0.0790	0.2404	8.93	5.97*	3.28*	$\color{green}{\textsf{22}}$	102	39
v3.1 Swin_L-384$\tiny{\square}$	0.1126	0.0853	0.2428	8.74	6.60*	3.34*	$\color{green}{\textsf{17}}$	213	49
v3.1 BEiT_L-384	0.1239	0.0667	0.2545	7.17	9.84*	2.21*	$\color{green}{\textsf{17}}$	344	13
v3.1 Next-ViT_L-384	0.1031	0.0954	0.2295	9.21	6.89*	3.47*	$\color{green}{\textsf{16}}$	72	30
v3.1 BEiT_B-384	0.1159	0.0967	0.2901	9.88	26.60*	3.91*	$\color{green}{\textsf{-31}}$	112	31
v3.0 DPT_L-384	0.1082	0.0888	0.2697	9.97	8.46	8.32	$\color{green}{\textsf{0}}$	344	61
v3.0 DPT_H-384	0.1106	0.0934	0.2741	10.89	11.56	8.69	$\color{green}{\textsf{-10}}$	123	50
v2.1 Large₃₈₄	0.1295	0.1155	0.3285	12.51	16.08	8.71	$\color{green}{\textsf{-32}}$	105	47

Inference height 256
v3.1 Swin2_T-256$\tiny{\square}$	0.1211	0.1106	0.2868	13.43	10.13*	5.55*	$\color{green}{\textsf{-11}}$	42	64
v2.1 Small₂₅₆	0.1344	0.1344	0.3370	14.53	29.27	13.43	$\color{green}{\textsf{-76}}$	21	90

Inference height 224
v3.1 LeViT₂₂₄$\tiny{\square}$	0.1314	0.1206	0.3148	18.21	15.27*	8.64*	$\color{green}{\textsf{-40}}$	51	73

* No zero-shot error, because models are also trained on KITTI and NYU Depth V2
$\square$ Validation performed at square resolution, either because the transformer encoder backbone of a model does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the improvement, because these quantities are averages over the pixels of an image and do not take into account the advantage of more details due to a higher resolution.
Best values per column and same validation height in bold

Improvement

The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0 DPT_L-384 and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%.

Note that the improvements of 10% for MiDaS v2.0 → v2.1 and 21% for MiDaS v2.1 → v3.0 are not visible from the improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large₃₈₄ and v2.0 Large₃₈₄ respectively instead of v3.0 DPT_L-384.

Depth map comparison

Zoom in for better visibility

Speed on Camera Feed

Test configuration

Windows 10
11th Gen Intel Core i7-1185G7 3.00GHz
16GB RAM
Camera resolution 640x480
openvino_midas_v21_small_256

Speed: 22 FPS

Applications

MiDaS is used in the following other projects from Intel Labs:

ZoeDepth (code available here): MiDaS computes the relative depth map given an image. For metric depth estimation, ZoeDepth can be used, which combines MiDaS with a metric depth binning module appended to the decoder.
LDM3D (Hugging Face model available here): LDM3D is an extension of vanilla stable diffusion designed to generate joint image and depth data from a text prompt. The depth maps used for supervision when training LDM3D have been computed using MiDaS.

Changelog

[Dec 2022] Released MiDaS v3.1:
- New models based on 5 different types of transformers (BEiT, Swin2, Swin, Next-ViT, LeViT)
- Training datasets extended from 10 to 12, including also KITTI and NYU Depth V2 using BTS split
- Best model, BEiT_{Large 512}, with resolution 512x512, is on average about 28% more accurate than MiDaS v3.0
- Integrated live depth estimation from camera feed
[Sep 2021] Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.
[Apr 2021] Released MiDaS v3.0:
- New models based on Dense Prediction Transformers are on average 21% more accurate than MiDaS v2.1
- Additional models can be found here
[Nov 2020] Released MiDaS v2.1:
- New model that was trained on 10 datasets and is on average about 10% more accurate than MiDaS v2.0
- New light-weight model that achieves real-time performance on mobile platforms.
- Sample applications for iOS and Android
- ROS package for easy deployment on robots
[Jul 2020] Added TensorFlow and ONNX code. Added online demo.
[Dec 2019] Released new version of MiDaS - the new model is significantly more accurate and robust
[Jul 2019] Initial release of MiDaS (Link)

Citation

Please cite our paper if you use this code or any of the models:

@ARTICLE {Ranftl2022,
    author  = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
    title   = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
    journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
    year    = "2022",
    volume  = "44",
    number  = "3"
}

If you use a DPT-based model, please also cite:

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ICCV},
	year      = {2021},
}

Please cite the technical report for MiDaS 3.1 models:

@article{birkl2023midas,
      title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
      author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
      journal={arXiv preprint arXiv:2307.14460},
      year={2023}
}

For ZoeDepth, please use

@article{bhat2023zoedepth,
  title={Zoedepth: Zero-shot transfer by combining relative and metric depth},
  author={Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M{\"u}ller, Matthias},
  journal={arXiv preprint arXiv:2302.12288},
  year={2023}
}

and for LDM3D

@article{stan2023ldm3d,
  title={LDM3D: Latent Diffusion Model for 3D},
  author={Stan, Gabriela Ben Melech and Wofk, Diana and Fox, Scottie and Redden, Alex and Saxton, Will and Yu, Jean and Aflalo, Estelle and Tseng, Shao-Yen and Nonato, Fabio and Muller, Matthias and others},
  journal={arXiv preprint arXiv:2305.10853},
  year={2023}
}

Acknowledgements

Our work builds on and uses code from timm and Next-ViT. We'd like to thank the authors for making these libraries available.

License

MIT License

midas's People

Contributors

Stargazers

Watchers

Forkers

leonzfa umariqb mtlong dmechea dstarer lbawmy bingjietang songya perrywu1989 zebrajack deblinaml romanovmikev peterzs liuguoyou southwestjiaotonguniversity yinxuping sharypovandrey surfndez whigg peterzhousz stalin18 anthonydickson roxanneluo m3at ideaplexus filippoaleotti jchetboun goudanstar raghavnagpal safijari klmj3214 jjandnn zherenx confidentjun mrkulk adrianmargin jiatianwu riwaly tuskaw ack4 miangoleh chomolungma alexeyab wolflion521 trevol vxgu86 luiz-monad peternara freedomtan zigchang kopetri aic5 satoshirobatofujimoto deepandy huynhlam assassindesign songhwanjun instaboi slamplus anthonyrs06 daolinma yangtong1989 kuersatp ccj5351 wang-kx cock-puncher n1ckfg namseungyoon roar-robotics cymkd nikshrimali sarvan0506 tamwaiban anubhabpanda viniciusguigo firekind jadewang123 gochuchamchi colorfulcloud nagapavan525 mfkiwl apollokit nmeva navidsx xubin1994 gaopeng91 hs0805 phamhieu lulu1315 ak9250 zeta1999 joyce725 izzetemre droter hzy5000 disorn-inc bediryilmaz mixedworld abecadel jlocke33

midas's Issues

how to train the groudtrue disparity from PWC

Hello,

Thanks for releasing the code.What an amazing project you did!
There I have some questions. I can not get perfect groundtrue disparity maps as you did. I hope to have your help.

how to modify the pwc-net code. replace the 2dcorrealtion with 1dcorrlation or not?
train the pwc-net code in supervised mode or unsupervised mode?
if train pwc-net code in unsupervised mode, what is the unsupervised loss you used?
can you release the trained model of your pwc-net?
Thanks very much.

Perry

Question: what will be new in v2?

I see the v2 pre-release, but it's the same code and the same pretrained model as before. What new stuff to expect?

What does run.py script return?

Hi! I am trying to get your repository working in a simple inference mode to be able to estimate the quality on NYUv2 dataset. As far as I understand, your run.py script returns inverse logarithm of depth scaled by some coefficient. Am I right? (At least that gives the best metrics, though I saw that you said that you predict the inverse depth)

Also, I have yet another question: you take as a backbone ResNet network. But, as far as I understand, you use it for the unnormalized images (i.e, images that do not have 0 mean intensities and unit stds), while the ResNet was trained on normalized images. Is this right and why do you do that?

How to covert RedWeb dataset label to disparity [0, 1]?

Hi,
In redweb dataset, the label is gived by a png file. nearest object depth is 0 and background depth is 255 (sky or something).
How to convert it to disparity [0, 1] as suggested in the paper?
What about other dataset like MegaDepth?

Is my code correct?

eps = 0.1
label = cv2.imread(label_path, cv2.IMREAD_GRAYSCALE).astype(np.float32)
sky_mask = (label == 255)
disparity = 1 / (label + eps)
disparity[sky_mask] = 0
disparity= (disparity - disparity.min()) / (disparity.max() - disparity.min())

Pytorch Errors

I got this error after running MiDaS 2.1

initialize
device: cpu
Loading weights: model-f6b98070.pt
Using cache found in C:\Users\gregb/.cache\torch\hub\facebookresearch_WSL-Images_master
Traceback (most recent call last):
File "run.py", line 151, in
run(args.input_path, args.output_path, args.model_weights, args.model_type, args.optimize)
File "run.py", line 32, in run
model = MidasNet(model_path, non_negative=True)
File "C:\Users\gregb\Documents\Python\MiDaS\midas\midas_net.py", line 47, in init
self.load(path)
File "C:\Users\gregb\Documents\Python\MiDaS\midas\base_model.py", line 11, in load
parameters = torch.load(path, map_location=torch.device('cpu'))
File "C:\Users\gregb\anaconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "C:\Users\gregb\anaconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at ..\caffe2\serialize\inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at ..\caffe2\serialize\inline_container.cc:132)
(no backtrace available)

Will you release the dataset?

Great job! I wonder if the dataset will be released. It would help me a lot. Thanks very much.

About getting results in meters unit

@ranftlr Thank you for the work. I'm trying to apply it with Myriad X VPU.
So I would like to ask whether the unknown scale and shift mentioned in #36 are linear parameters?
For example, in each frame, I can find a linear equation like "P = D * scale + shift" to project the values of depth maps "D" to the physical absolute measurements "P" according to putting a known scale ruler in the view, right ?

Can't run script from another folder

when running Midas from another folder:

python ../MiDaS/run.py 
initialize
device: cuda
Loading weights:  model-f46da743.pt
Using cache found in /home/3dsf/.cache/torch/hub/facebookresearch_WSL-Images_master
Traceback (most recent call last):
  File "../MiDaS/run.py", line 105, in <module>
    run(INPUT_PATH, OUTPUT_PATH, MODEL_PATH)
  File "../MiDaS/run.py", line 29, in run
    model = MidasNet(model_path, non_negative=True)
  File "/home/3dsf/MiDaS/midas/midas_net.py", line 47, in __init__
    self.load(path)
  File "/home/3dsf/MiDaS/midas/base_model.py", line 11, in load
    parameters = torch.load(path)
  File "/home/3dsf/MiDaS/envs/lib/python3.7/site-packages/torch/serialization.py", line 381, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'model-f46da743.pt'

This could be a feature, I guess. Anyways, great job, I've tested it several times and here is a magicEye Video made using MiDas

The code of scale- and shift- invariant loss

Hi! Thanks for your excellent paper!
I want to repeat your training procedure, but there's no ssi-loss implementation here.
Can I ask for your pytorch implementation for scale- and shift-invariant loss?

How could I convert it to onnx model?

Trying to convert the model to onnx model, but got error

File "to_onnx.py", line 72, in
export_model(model, img_input, export_model_name)
File "to_onnx.py", line 30, in export_model
torch.onnx.export(model, input, export_model_name, verbose=False, export_params=True, opset_version=11)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\onnx_init_.py", line 148, in export
strip_doc_string, dynamic_axes, keep_initializers_as_inputs)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\onnx\utils.py", line 66, in export
dynamic_axes=dynamic_axes, keep_initializers_as_inputs=keep_initializers_as_inputs)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\onnx\utils.py", line 416, in _export
fixed_batch_size=fixed_batch_size)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\onnx\utils.py", line 279, in _model_to_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args, training)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\onnx\utils.py", line 236, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(model, args, _force_outplace=True, return_inputs_states=True)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\jit_init.py", line 277, in _get_trace_graph
outs = ONNXTracedModule(f, _force_outplace, return_inputs, return_inputs_states)(*args, **kwargs)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\nn\modules\module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "C:\Users\yyyy\Anaconda3\envs\torchreid\lib\site-packages\torch\jit_init.py", line 332, in forward
in_vars, in_desc = _flatten(args)
RuntimeError: Only tuples, lists and Variables supported as JIT inputs/outputs. Dictionaries and strings are also accepted but their usage is not recommended. But got unsupported type numpy.ndarray

to_onnx.py

import os
import glob
import torch
import utils
import cv2

from torchvision.transforms import Compose
from models.midas_net import MidasNet
from models.transforms import Resize, NormalizeImage, PrepareForNet

import onnx
import onnxruntime

def test_model_accuracy(export_model_name, raw_output, input):    
    ort_session = onnxruntime.InferenceSession(export_model_name)

    def to_numpy(tensor):
        return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

    # compute ONNX Runtime output prediction
    ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input)}
    ort_outs = ort_session.run(None, ort_inputs)	

    # compare ONNX Runtime and PyTorch results
    np.testing.assert_allclose(to_numpy(raw_output), ort_outs[0], rtol=1e-03, atol=1e-05)

    print("Exported model has been tested with ONNXRuntime, and the result looks good!")		

def export_model(model, input, export_model_name):
    torch.onnx.export(model, input, export_model_name, verbose=False, export_params=True, opset_version=11)	
    onnx_model = onnx.load(export_model_name)    
    onnx.checker.check_model(onnx_model)
    graph_output = onnx.helper.printable_graph(onnx_model.graph)
    with open("graph_output.txt", mode="w") as fout:
        fout.write(graph_output)
		
device = torch.device("cpu")

 # load network
model_path = "model.pt"
model = MidasNet(model_path, non_negative=True)

transform = Compose(
        [
            Resize(
                384,
                384,
                resize_target=None,
                keep_aspect_ratio=True,
                ensure_multiple_of=32,
                resize_method="lower_bound",
                image_interpolation_method=cv2.INTER_CUBIC,
            ),
            NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
            PrepareForNet(),
        ]
)

model.to(device)
model.eval()

img = utils.read_image("input/line_up_00.jpg")
img_input = transform({"image": img})["image"]

# compute
#with torch.no_grad():
sample = torch.from_numpy(img_input).to(device).unsqueeze(0)
print("sample type = ", type(sample), ", shape of sample = ", sample.shape)
print(sample)	
prediction = model.forward(sample)
export_model_name = "midas.onnx"	
export_model(model, img_input, export_model_name)

Environment:

pytorch 1.4.0(installed by anaconda)
os is windows 10 64bits

Slow image transformation

Hej,
I dont think this is a issue, sorry for posting like this.
But the image that goes through you model is really slow. Do you have a method for speeding it up?
Sorry again for posting it as a issue, but dont know how else to make contact.

Nice! Would you release your implementation of the loss function?

I'm working on a relative article that use your great work, and would like to use your loss function.

Thanks in advance!

What kind of activation function do you use during training?

Hi,
I see that relu is used during evaluation(this repo), What about training?
other question is that we do not include invalid pixels such as the sky when calculating the loss.
How can we ensure that the network outputs the correct depth (zero or negative) for the background such as the sky?

Loss implementations

Hi,

I read in #43 that you did not plan on releasing the training code. Can you still share the implementations you used for the losses?

Thanks!

How to

Camera Intrinsics

Here is a visualization of the combined rgb image taken with my phone and depth inferred by MiDaS.

I used the intrinsic values from the phone camera but I'm not sure that makes sense to do since the depth projects out to much greater than the ground truth.

Is the depth just relative within the image or should I expect to be able to achieve a realistic rough depth value?

How to generate the training data?

Hi,

Could you release your script to generate the training data?

Fine tuning

Hi, i want to fine tune your model for some specific purposes. I will be very grateful, if you provide your train scripts with metrics or if you give a peace of advice about what part of model I can successfully fine tune. Thanks!

FileNotFoundError: [Errno 2] No such file or directory: 'MiDaS/model.pt'

Hi, everyone!

I like to see this great product. But I have faced with that issue.
It happens on start after python main.py --config argument.yml
This files really doesn`t exist. So next question, where I can find this file or how I can create it ?

The full console message

(3DP) D:\Projects\Python\3d_photo>python main.py --config argument.yml
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]Current Source ==>  moon
initialize
device: cpu
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    config['MiDaS_model_ckpt'], MonoDepthNet, MiDaS_utils, target_w=640)
  File "D:\Projects\Python\3d_photo\MiDaS\run.py", line 29, in run_depth
    model = Net(model_path)
  File "D:\Projects\Python\3d_photo\MiDaS\monodepth_net.py", line 52, in __init__
    self.load(path)
  File "D:\Projects\Python\3d_photo\MiDaS\monodepth_net.py", line 88, in load
    parameters = torch.load(path)
  File "D:\Programs\miniconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 525, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "D:\Programs\miniconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "D:\Programs\miniconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'MiDaS/model.pt'

Did you fill the holes in the groundtruth before training midasnet?

I tried to reproduce the experimental result of MIDAS V2 but failed. The edge of instances produced by my models is not as clear as the official model.

I figure that because of the procedure of getting groundtruth depth, in these 6 datasets there are a lot of areas that are masked out when calculating the loss. Most of these areas covers the edge of objects or scenes. I thought this might be the reason.

So did you preprocessed in the groundtruth depth map to fill these holes? If not, how did you deal with these holes that are important to produce depth maps with clear edges? Did you simply masked these holes out?

Thanks.

What does the predicted depth signify?

A "prediction" gives the following:

[[2496.0127 2495.973 2495.9888 ... 855.7698 855.57666 856.0468 ]
[2495.9575 2495.9158 2495.9329 ... 855.4036 855.20917 855.68256]
[2495.9797 2495.9387 2495.9556 ... 855.55426 855.36035 855.83234]
...
[3245.7551 3245.7756 3245.7664 ... 2852.5774 2852.4922 2852.702 ]
[3245.7275 3245.7478 3245.739 ... 2852.4827 2852.397 2852.6072 ]
[3245.7974 3245.8179 3245.809 ... 2852.7156 2852.6309 2852.8398 ]]

What are the units of these numbers m, mm, ft? Of course those numbers aren't disparities (since the images aren't that wide). So what do these numbers represent? How to convert this prediction to actual depth given camera intrinsics?
Thanks

Input size during inference

Hello!

You write that during training images are "randomly cropped and resized to 384×384". If we leave augmentation aside, images are essentially resized so that the shorter dimension becomes 384. However, during inference we resize so that the longer dimension is 384.

For example, suppose all the inputs are 1,280x720. During training, if we do the central crop, then the image is downsampled by the factor 720 / 384. If the square crop is not 720x720, but rather, say, 672x672, then the image is downsampled by the factor 672 / 384. During inference, however, we take the 1,280x720 image and downsample it by the much higher factor 1280 / 384, so it becomes 384x224 (there's a tiny distortion because the shorter dimension must be divisible by 32). Instead, we could go for something like 672x384 and get input features approximately of the same size as during training.

I don't know how important this is, but I would much appreciate if you could shed some light on the reasoning behind your choice.

Training code

Do you plan to release your training code sometime in the future? It would be really helpful to advance the research on monocular depth estimation!

If not, can you explain how the Pareto optimatility is ensured during training? It seems like there will also have to be an undo step in the training pipeline such that whenever the Pareto optimum is reached and the next backpropagation update disturbs this state, this update will have to be reversed.

MiDaS-v2 was trained using the median + MAE

Thanks for your work! In the paper proposeMiDaS-v2 was trained using the median + MAD,What's special about the implementation details?I implementationit but the training result is terrible than using the former loss.( I remove the relu at the end
of the net )Maybe I make some mistakes.Could you release your implementation of the loss function?

Non-cuda compatibility

Please add a switch for running the code without CUDA. It's being a real pain to refactor the code to run it on my MacBook.

Scale of model outputs

Hi! Great work! My question is about model outputs. You mentioned that you learn disparity values which are shifted in [0,1] range. However, the inference values of ~10000. Where does this differnce come from?

Will you release the script for processing the movies?

The part described in Supplementary Material A. 3D Movies Dataset.

How important is a consistent loss to the generalizability?

Do you think the same multi-objective training with different loss functions would yield the same type of generalizable performance. In MiDaS, all training was directly supervised with the same loss function, regardless of dataset. Have you considered or evaluated what the impact of blending different losses would be? For example, take a SSIM loss like Monodepth for training on stereo image pairs or video frames while also training on the datasets you analyzed in this work. From the original paper, it seems like it should work (they evaluate in a truly multi-task setting), but I am curious if the cross-dataset generalizability would hold up as well.

Unused weights?

Are there unused weights (resConfUnit1) in the first block of the decoder?

https://github.com/intel-isl/MiDaS/blob/ad9c260d7e5ac713fce6554729155d93142ed9b2/models/blocks.py#L144

output range of network is not consistent with report in the paper?

Hi, thx for your great work!
I'm running your code, but network output ranges from 0 to 5000 which is not same as what you've mentioned in the paper.
Could you please tell me if I miss some details?

why use inverse depth ?

Hi, @dvdhfnr I can't understand why use inverse depth has difference with using original depth? or the inverse depth has some unique process? thanks~

Function loss or training code

Hi!

Thanks for sharing the source code, this is a great contribution!
I have searched in the repository, but I didn't find the loss function that you have described in the paper (considering the normalization procedure for disparity maps).
Do you have plans to share it, or even to share the training code?

Thanks.

Wrong predictions?

I did a test run with a book on a flat surface. On the output the background depth isn't uniform. Is there something I'm missing?

How to get the mask to input the loss function

Hi,

I want to ask how I can get the mask to input the loss function? Cloud you release the training code and data generation script?

Thank you!

Pre-trained model on Pytorch hub not updated

Loading the pre-trained model from Pytorch hub tries to use this link github.com/intel-isl/MiDaS/releases/download/v2_1/model-f6b98070.pt and fails. This file does not exist as the model file seems to have been updated. Please make relevant changes so that the model can be loaded from Pytorch hub.

How can I visualize the point cloud?

Hi,

Thanks for your great work. I want to know how I can visualize the point cloud from the depth image by Open3D. According the Camera Intrinsics, I don't know the content of the intrinsics.json file. Could you give me an example of this file?
Another question, when do you plan to release the scripts that used to produce the data set?

Thank you very much!

Scale and shift in inference

Hello!

Suppose during inference you get values in some interval [a,b]. Then for visualization you scale and shift them into some region, say [0, 1]. Now, there are two ways to do this: (x - a) / (b - a) and (b - x) / (b - a). Naturally, you choose the first way; however, I do not actually see how we are guaranteed that the scale must be positive.

Looking at your loss function: https://gist.github.com/dvdhfnr/732c26b61a0e63a0abc8a5d769dbebd0 - you just use least squares and can easily get positive or negative scale.

Since the network has to learn ordinal relationships (near vs. far), it seems intuitive that the scale would be positive for all images if it is positive for some; however, I am not sure we are guaranteed even that. Or is it something you did during training?

Also, I see that in the paper propose using the median and mean absolute deviation instead. So, what did you end up using?

Could you share the script or some pseudocode for the loss computation

I am not clear on the implementation of the multi-objective loss in your paper. Can you give me some details？

Loss functions

Hi, the loss functions when training midas are very simple, i.e., ptrim(l1) and gradient loss. Have you tried other loss functions like normal loss or BerHu? Or have you tried these loss functions but they didn't work well?

Thanks.

How to get the clear edge

Hi，thank you for your work！
I trained the model following your job. I found the released model has more clear object-edges. Which part does benefit the clear edges?
Thanks in advance!

Great improvement. Training code release?

The new model generates a more accurate depthmap than its previous version. Can you share the change you made? Also, is there a timeline for releasing the training code and the training data processing pipeline?

Depth in float32 in meters units

Hello! Thanks for you work!

I have two questions:

What is the .pmf format and what is it used for?
While opening .png depth maps how to convert them into float32 in meters units?

KITTI Numbers

The KITTI set reported in the paper is said to have 161 images.
(from Supplementary Material Section C,

"For KITTI we used the intersection of the official validation set for depth estimation (with improved ground-truth depth [69]) and the Eigen test split [60] (161 images)".)

I assume that is a mix of Depth Benchmark Val and Eigen Test. These seem to have only 145 common images. Could you please point me to the exact datasets that were used?

Here's the code I wrote to find the intersection

#!/usr/bin/python
import re

benchmark_val_file = "splits/benchmark/val_files.txt"
eigen_test_file = "splits/eigen/test_files.txt"

zfill=True

with open(benchmark_val_file, 'r') as f:
    val_set= set()
    for line in f.readlines():
        dir_name, img_num = line.split()[:2]
        if zfill:
            val_set.add((dir_name, img_num.zfill(10)))
        else:
            val_set.add((dir_name, img_num))

with open(eigen_test_file, 'r') as f:
    eigen_set = set() 
    for line in f.readlines():
        dir_name, img_num = line.split()[:2]
        eigen_set.add((dir_name, img_num))

print(len(val_set.intersection(eigen_set)))

how do I get the real depth value in inference mode?

may I know how to get the real depth value in inference mode? such as with the unit meter.

No such file or directory: 'model-f46da743.pt'

Hello

I got this error. Did I put the .pt file in wrong location? (see attachment)

initialize
device: cuda
Loading weights: model-f46da743.pt
Using cache found in C:\Users\gregb/.cache\torch\hub\facebookresearch_WSL-Images_master
Traceback (most recent call last):
File "run.py", line 105, in
run(INPUT_PATH, OUTPUT_PATH, MODEL_PATH)
File "run.py", line 29, in run
model = MidasNet(model_path, non_negative=True)
File "C:\Users\gregb\Documents\Python\MiDaS\midas\midas_net.py", line 47, in init
self.load(path)
File "C:\Users\gregb\Documents\Python\MiDaS\midas\base_model.py", line 11, in load
parameters = torch.load(path)
File "C:\Users\gregb\anaconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 525, in load
with _open_file_like(f, 'rb') as opened_file:
File "C:\Users\gregb\anaconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 212, in _open_file_like
return _open_file(name_or_buffer, mode)
File "C:\Users\gregb\anaconda3\envs\3DP\lib\site-packages\torch\serialization.py", line 193, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model-f46da743.pt'

Testing code

Hi,
I'm having some problems in obtaining your same results at testing time. Could you share also an example of a script (e.g, NYU or TUM would be great) to test your network, please?

how much gpus did you use?

Hi, since the MidasNet is a very large model, how much gpus did you use and how long did it take to train the model? Since the batchsize is not larget ( 8 for each dataset), would multi-gpu training hurt the performance? Since there are a lot of batch normalization layers in the encoder.

Thanks.

Depth map estimated distance

Hello, the project I am working on requires a monocular camera to estimate the distance of the object. The general method is to obtain the depth value from the depth camera and then calculate the distance according to the camera_factor. But I did not see this related variable in the paper. I know if I can use the depth generated by this network to estimate the distance. I am a student who has just learned knowledge in this field. Hope to get help. Thank you

Measure distance by the depth map

How could I measure the distance by the output depth map? What is the unit of it?
From test_simple.py,

# PREDICTION
input_image = input_image.to(device)
features = encoder(input_image)
outputs = depth_decoder(features)

disp = outputs[("disp", 0)]

How could I measure the absolute distance of each pixels by the disp(disp mean disparity, or it is depth?) tensor?Thanks

Got unexpected keyword argument "groups"

python3 run.py
initialize
device: cuda
Loading weights: model-f46da743.pt
Downloading: "https://github.com/facebookresearch/WSL-Images/archive/master.zip" to /home/prikshet/.cache/torch/hub/master.zip
Traceback (most recent call last):
File "run.py", line 105, in
run(INPUT_PATH, OUTPUT_PATH, MODEL_PATH)
File "run.py", line 29, in run
model = MidasNet(model_path, non_negative=True)
File "/home/prikshet/midas/midas/midas_net.py", line 30, in init
self.pretrained, self.scratch = _make_encoder(features, use_pretrained)
File "/home/prikshet/midas/midas/blocks.py", line 6, in _make_encoder
pretrained = _make_pretrained_resnext101_wsl(use_pretrained)
File "/home/prikshet/midas/midas/blocks.py", line 26, in _make_pretrained_resnext101_wsl
resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl")
File "/usr/local/lib/python3.6/dist-packages/torch/hub.py", line 354, in load
model = entry(*args, **kwargs)
File "/home/prikshet/.cache/torch/hub/facebookresearch_WSL-Images_master/hubconf.py", line 39, in resnext101_32x8d_wsl
return _resnext('resnext101_32x8d', Bottleneck, [3, 4, 23, 3], True, progress, **kwargs)
File "/home/prikshet/.cache/torch/hub/facebookresearch_WSL-Images_master/hubconf.py", line 23, in _resnext
model = ResNet(block, layers, **kwargs)
TypeError: init() got an unexpected keyword argument 'groups'

isl-org / midas Goto Github PK

midas's Introduction

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

Setup

optional

Usage

optional

via Camera

via Docker

via PyTorch Hub

via TensorFlow or ONNX

via Mobile (iOS / Android)

via ROS1 (Robot Operating System)

Accuracy

Improvement

Depth map comparison

Speed on Camera Feed

Applications

Changelog

Citation

Acknowledgements

License

midas's People

Contributors

Stargazers

Watchers

Forkers

midas's Issues

Recommend Projects

Recommend Topics

Recommend Org