google / mannequinchallenge Goto Github PK

Inference code and trained models for "Learning the Depths of Moving People by Watching Frozen People."

Home Page: https://google.github.io/mannequinchallenge

License: Apache License 2.0

Shell 2.49% Python 81.19% HTML 10.99% JavaScript 0.90% CSS 4.44%

mannequinchallenge's Introduction

Mannequin Challenge Code and Trained Models

This repository contains inference code for models trained on the Mannequin Challenge dataset introduced in the CVPR 2019 paper "Learning the Depths of Moving People by Watching Frozen People."

This is not an officially supported Google product.

Setup

The code is based on PyTorch. The code has been tested with PyTorch 1.1 and Python 3.6.

We recommend setting up a virtualenv environment for installing PyTorch and the other necessary Python packages. The TensorFlow installation guide may be helpful (follow steps 1 and 2) or follow the virtualenv documentation.

Once your environment is set up and activated, install the necessary packages:

(pytorch)$ pip install torch torchvision scikit-image h5py

The model checkpoints are stored on Google Cloud and may be retrieved by running:

(pytorch)$ ./fetch_checkpoints.sh

Single-View Inference

Our test set for single-view inference is the DAVIS 2016 dataset. Download and unzip it by running:

(pytorch)$ ./fetch_davis_data.sh

Then run the DAVIS inference script:

(pytorch)$ python test_davis_videos.py --input=single_view

Once the run completes, visualizations of the output should be available in test_data/viz_predictions.

Full Model Inference

The full model described in the paper requires several additional inputs: the human segmentation mask, the depth-from-parallax buffer, and (optionally) a human keypoint buffer. We provide a preprocessed version of the TUM RGBD dataset that includes these inputs. Download (~9GB) and unzip it using the script:

(pytorch)$ ./fetch_tum_data.sh

To reproduce the numbers in Table 2 of the paper, run:

(pytorch)$ python test_tum.py --input=single_view
(pytorch)$ python test_tum.py --input=two_view
(pytorch)$ python test_tum.py --input=two_view_k

Where single_view is the variant I from the paper, two_view is the variant IDCM, and two_view_k is the variant IDCMK. The script prints running averages of the various error metrics as it runs. When the script completes, the final error metrics are shown.

Acknowledgements

If you find the code or results useful, please cite the following paper:

@inproceedings{li2019learning,
  title={Learning the Depths of Moving People by Watching Frozen People},
  author={Li, Zhengqi and Dekel, Tali and Cole, Forrester and Tucker, Richard
    and Snavely, Noah and Liu, Ce and Freeman, William T},
  booktitle={Proc. Computer Vision and Pattern Recognition (CVPR)},
  year={2019}
}

mannequinchallenge's People

Contributors

Stargazers

Watchers

Forkers

afcarl danish87 mkocabas peterzhousz zoombapup umariqb miaowu99 tuqiugithub funkdub peterzs eshafeeqe roxanneluo lane565 jeffresh nestorsgarzonc stevewongv stevematos avatarworld jiaojiaozhang xujing1022 aliendeep sareldu singh-ap xubin1994 minzhangm hell-to-heaven jiangwenpl lamarrr arvkr balbir8git suyibjut dreadlord1984 bxck75 miu200521358 firskey vishal5212 willert98 divisiondeariza liyongsheng-tech giantstonex7 liuwenhaha mc261670164 zhuomingliang sttomato dbarac liuguoyou zlou gereltuya b1sounours acerto jennyesoo bookendus dragonx081 mintblackberry laurensent thedani taehakim-kor electricrobot anthonydickson carlosfora abbyluhui plarr2020-team1 tuskaw snaves vladzilla nutela-san nevrax neotim ztzhang killyseason muskanmahajan37 manojkesani rongxike malesilver jzengust jamiebeach deep-blue-blue xuelimin elena-ssq longervision ivan-alles leifengsoul cuchulainx apikielny isabella232 altis5526 chenzezheng borisnadion joeking11829 kaihsiangl python-repository-hub maheshkkumar varunakk test-mass-forker-org-1 forky-mcforkface jlygit ghas-results thushar-marvel axifeng potevots

mannequinchallenge's Issues

Code to generate depth maps from Youtube videos

Hi! Thank for the work! Will You release the code to generate depth maps from Youtube videos?

will you share the MannequinChallenge Dataset?

will you share the MannequinChallenge Dataset?
and i wonder the details of trainining the 3input model

Camera pose and intrinsics estimation at inference time

Hi, it's mentioned in your paper that ORB-SLAM2 and COLMAP are used to obtain camera poses/intrinsics for the training dataset.

I'm wondering if the same procedure has to be used in inference stage. Would there be a more efficient way to obtain camera parameters? Many thanks!

thank you for your work~when do you offer the training code?

thank you for your work~when do you provide the training code?

typo in instruction

I guess "single-view" should be "single_view" in your below instruction.
(pytorch)$ python test_davis_videos.py --input=single-view

Could I get more detail about object insertion from depth-based visual effect part?

Hi, I am so interested on how to generate the object insertion and ground reconstruction like the figure you showed in Figure 7(c). Could you give me some introductions or tips on it? Thank you so much! My email address is [email protected].

Inference of the multi-view model - guidance required

Ok so thanks for all etc but if I want to infer your advanced model on something different than the TUM preprocessed dataset you provide, using the 2 views and masks and optic flaw and key points, does it mean I have to generate all those inputs at the appropriate formats, cast it properly into the expexted HDF5 file structure, all that without any specifications?
For the RGB images, I can manage, but for the flow input what is the expected input format?
For the optional keypoints, what is the expected data format? Can you model support more/less key points than the ones you have in your dictionnaries lists currently?
Back-engeneering your code is not fun.
In the TUM HDF5 files there is more than 5 data subset (I did expect 2 rgb images, one binary mask image, one rgb flow image, one vector of key points pairs), but there are some strange low resolution matrices in the file as well) I do not know if those are only necessary for training of for inference as well.
Any chance you can give some further guidance on the required inputs and required formats?
Thanks

Inference Performance

Hi
Thanks for sharing this promising work. I am planning to run this locally once I have passed all the install/config oops. I did not see perf data in the paper or the blog or this repo, but I may have missed it. I will share my results but would be nice if there was a list of perf for different hardware specs.
In the meantime, do you have some performance benchmarks to share?

Images Downscaled

Hey, great work you all did! I really like it!
I have a problem though. After running python test_davis_videos.py --input=single_view, the result is downscaled to 288x512 (or, 288x1024, with the original and the result next to each other). But the original has a much higher resolution!
I found the resized_width and _height in the DAVISImageFolder class, but adjusting these made me run out of memory (CUDA) immediatly. I have a GTX 1070.
Is it really that intense? Or am I doing something wrong?

Would you like to share the initial depth estimation methods?

Hello, would you like to give out the code for the initial estimation of depth map? It can help us evaluate the method on other datasets (for example, the KITTI dataset). Thank you greatly.

RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.

When I run this python test_tum.py --input=two_view_k I got an error like this

batchSize: 8
beta1: 0.5
checkpoints_dir: ./checkpoints/
continue_train: False
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0, 1, 2, 3]
human_data_term: 0
identity: 0.0
input: two_view_k
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0004
lr_decay_epoch: 8
lr_policy: step
max_dataset_size: inf
mode: Ours_Bilinear
model: pix2pix
nThreads: 2
name: test_local
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
simple_keypoints: 0
use_dropout: False
which_epoch: latest
which_model_netG: unet_256
-------------- End ----------------
========================= TUM evaluation #images = 1815 =========
====================================== DIW NETWORK TRAIN FROM Ours_Bilinear=======================
===================Loading Pretrained Model OURS ===================================
Traceback (most recent call last):
File "test_tum.py", line 36, in
model = pix2pix_model.Pix2PixModel(opt, False)
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\pix2pix_model.py", line 96, in init
new_model, 'G', 'best_depth_Ours_Bilinear_inc_7')
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\base_model.py", line 68, in load_network
model = torch.load(save_path)
File "C:\Users\awww_.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 386, in load
return load(f, map_location, pickle_module, **pickle_load_args)
File "C:\Users\awww.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 580, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.
(pytorch) PS D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd>

And I check the file size is 8.14 GB in .tgz , how to fix this issue. I am on window 10 and succeed with these 2 command

`
pytorch)$ python test_tum.py --input=single_view

(pytorch)$ python test_tum.py --input=two_view
`

How to get the camera intrinsic parameters from the original YouTube wild videos?

I watch that in the dataset, you have supplied the camera intrinsics, but I do not know how do you get them when you create the dataset from the original YouTube ? So could you please explain it for me or tell me what method you have adopted to solve this ? Thank you very much!

Clarifications on training pipeline

Hi guys! Congrats for the great work.

I have been trying to implement the single view training pipeline, following the details on the paper's supplemental material , and I have a few questions regarding the implementation:

In the scale invariant mse loss (equation 8), the second term is divided by N. Shouldn't that be N^2 ? Is this just a typo?
During training you state that you normalize the GT log depths by subtracting a random value from the 40-60th percentile. So you dont actually "normalise" but rather kind of center the map around 0. Since the losses take into consideration relative distance between pixel pairs, how does this normalisation affect performance?
Moreover, I have generated the MC dataset following the instructions in #14, and I have noticed that the absolute and log values of the GT depth maps (usually in the range [5-50]) are significantly larger than those of the depth maps generated from your single-view pretrained model (usually in the range [0.2-1.5]). Is there some other kind of normalisation that you also perform?
Finally, regarding the "paired" mse loss, do you actually compute the distances between all possible pairs, or you do some sub-sampling? Because this can become really computationally intensive, even for small resolutions. There are potentially n*(n-1)/2 possible pairs for an image with n pixels

Thanks!

How to generate HDF5 data?

Hi! Thanks for your work! What if have a dataset that includes RGB, depth, and human masks? How to generate the dataset similar to your TUM-HDF5 dataset?

Output depth data format

Hi,
Thanks for sharing the inference code. When the model infers the depth for the code from a single image, is the estimated depth in meters? What format is it exactly? Since the ground truth info is not there, I am not able to figure this out directly. Thank you.

CPU Support

I'd like to run the inference pass on Mac OS(Mojave), a system that runs AMD/Radeon, rather than NVIDIA/Cuda.

Would there be a quick fix to run the inference pass without GPU acceleration but on CPU resources instead?

Being rather new to pytorch, I'm starting at the stacktrace andlooking at my first error in models/pix2pix.py in the line that declares a cuda model.

new_model = torch.nn.parallel.DataParallel( new_model.cuda(), device_ids=range(torch.cuda.device_count()))

I can try to make the appropriate changes , but might need help figuring what to replace this with.

training data

Hi!Thank for the work! Will You release the training data 170K valid image-depth pairs?

Running on custom videos

Hi Authors,

Is it possible to run code on custom videos with extrinsic calibration available?

Regards
Aakash

Colmap Patch matching settings

Hi,
I am trying to replicate your training method, but my results from colmap look more dense (though also noisier) than yours (example at the end). I assume this is caused by a different setting on the patch matching (I am just using the default one) and not only the MegaDepth post processing steps. Could you provide further details about the hyperparameters you used?
Thanks!

IggIqNXfu_U_119519400:

Yours:

Mine:

Some of the video in the datasets become unavailable

Some videos, for example https://www.youtube.com/watch?v=UP6dBjsmg3U is unavailable because the user has removed the video, making it impossible to fully reproduce the original dataset and results. So would you like to share the full original video or the processed sequences?