Git Product home page Git Product logo

mannequinchallenge's Introduction

Mannequin Challenge Code and Trained Models

This repository contains inference code for models trained on the Mannequin Challenge dataset introduced in the CVPR 2019 paper "Learning the Depths of Moving People by Watching Frozen People."

This is not an officially supported Google product.

Setup

The code is based on PyTorch. The code has been tested with PyTorch 1.1 and Python 3.6.

We recommend setting up a virtualenv environment for installing PyTorch and the other necessary Python packages. The TensorFlow installation guide may be helpful (follow steps 1 and 2) or follow the virtualenv documentation.

Once your environment is set up and activated, install the necessary packages:

(pytorch)$ pip install torch torchvision scikit-image h5py

The model checkpoints are stored on Google Cloud and may be retrieved by running:

(pytorch)$ ./fetch_checkpoints.sh

Single-View Inference

Our test set for single-view inference is the DAVIS 2016 dataset. Download and unzip it by running:

(pytorch)$ ./fetch_davis_data.sh

Then run the DAVIS inference script:

(pytorch)$ python test_davis_videos.py --input=single_view

Once the run completes, visualizations of the output should be available in test_data/viz_predictions.

Full Model Inference

The full model described in the paper requires several additional inputs: the human segmentation mask, the depth-from-parallax buffer, and (optionally) a human keypoint buffer. We provide a preprocessed version of the TUM RGBD dataset that includes these inputs. Download (~9GB) and unzip it using the script:

(pytorch)$ ./fetch_tum_data.sh

To reproduce the numbers in Table 2 of the paper, run:

(pytorch)$ python test_tum.py --input=single_view
(pytorch)$ python test_tum.py --input=two_view
(pytorch)$ python test_tum.py --input=two_view_k

Where single_view is the variant I from the paper, two_view is the variant IDCM, and two_view_k is the variant IDCMK. The script prints running averages of the various error metrics as it runs. When the script completes, the final error metrics are shown.

Acknowledgements

If you find the code or results useful, please cite the following paper:

@inproceedings{li2019learning,
  title={Learning the Depths of Moving People by Watching Frozen People},
  author={Li, Zhengqi and Dekel, Tali and Cole, Forrester and Tucker, Richard
    and Snavely, Noah and Liu, Ce and Freeman, William T},
  booktitle={Proc. Computer Vision and Pattern Recognition (CVPR)},
  year={2019}
}

mannequinchallenge's People

Contributors

fcole avatar snaves avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mannequinchallenge's Issues

RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.

When I run this python test_tum.py --input=two_view_k I got an error like this

batchSize: 8
beta1: 0.5
checkpoints_dir: ./checkpoints/
continue_train: False
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0, 1, 2, 3]
human_data_term: 0
identity: 0.0
input: two_view_k
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0004
lr_decay_epoch: 8
lr_policy: step
max_dataset_size: inf
mode: Ours_Bilinear
model: pix2pix
nThreads: 2
name: test_local
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
simple_keypoints: 0
use_dropout: False
which_epoch: latest
which_model_netG: unet_256
-------------- End ----------------
========================= TUM evaluation #images = 1815 =========
====================================== DIW NETWORK TRAIN FROM Ours_Bilinear=======================
===================Loading Pretrained Model OURS ===================================
Traceback (most recent call last):
File "test_tum.py", line 36, in
model = pix2pix_model.Pix2PixModel(opt, False)
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\pix2pix_model.py", line 96, in init
new_model, 'G', 'best_depth_Ours_Bilinear_inc_7')
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\base_model.py", line 68, in load_network
model = torch.load(save_path)
File "C:\Users\awww_.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 386, in load
return load(f, map_location, pickle_module, **pickle_load_args)
File "C:\Users\awww
.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 580, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.
(pytorch) PS D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd>

--

And I check the file size is 8.14 GB in .tgz , how to fix this issue. I am on window 10 and succeed with these 2 command

`
pytorch)$ python test_tum.py --input=single_view

(pytorch)$ python test_tum.py --input=two_view
`

Clarifications on training pipeline

Hi guys! Congrats for the great work.

I have been trying to implement the single view training pipeline, following the details on the paper's supplemental material , and I have a few questions regarding the implementation:

  1. In the scale invariant mse loss (equation 8), the second term is divided by N. Shouldn't that be N^2 ? Is this just a typo?

  2. During training you state that you normalize the GT log depths by subtracting a random value from the 40-60th percentile. So you dont actually "normalise" but rather kind of center the map around 0. Since the losses take into consideration relative distance between pixel pairs, how does this normalisation affect performance?
    Moreover, I have generated the MC dataset following the instructions in #14, and I have noticed that the absolute and log values of the GT depth maps (usually in the range [5-50]) are significantly larger than those of the depth maps generated from your single-view pretrained model (usually in the range [0.2-1.5]). Is there some other kind of normalisation that you also perform?

  3. Finally, regarding the "paired" mse loss, do you actually compute the distances between all possible pairs, or you do some sub-sampling? Because this can become really computationally intensive, even for small resolutions. There are potentially n*(n-1)/2 possible pairs for an image with n pixels

Thanks!

Images Downscaled

Hey, great work you all did! I really like it!
I have a problem though. After running python test_davis_videos.py --input=single_view, the result is downscaled to 288x512 (or, 288x1024, with the original and the result next to each other). But the original has a much higher resolution!
I found the resized_width and _height in the DAVISImageFolder class, but adjusting these made me run out of memory (CUDA) immediatly. I have a GTX 1070.
Is it really that intense? Or am I doing something wrong?

training data

Hi!Thank for the work! Will You release the training data 170K valid image-depth pairs?

Running on custom videos

Hi Authors,

Is it possible to run code on custom videos with extrinsic calibration available?

Regards
Aakash

Inference of the multi-view model - guidance required

Ok so thanks for all etc but if I want to infer your advanced model on something different than the TUM preprocessed dataset you provide, using the 2 views and masks and optic flaw and key points, does it mean I have to generate all those inputs at the appropriate formats, cast it properly into the expexted HDF5 file structure, all that without any specifications?
For the RGB images, I can manage, but for the flow input what is the expected input format?
For the optional keypoints, what is the expected data format? Can you model support more/less key points than the ones you have in your dictionnaries lists currently?
Back-engeneering your code is not fun.
In the TUM HDF5 files there is more than 5 data subset (I did expect 2 rgb images, one binary mask image, one rgb flow image, one vector of key points pairs), but there are some strange low resolution matrices in the file as well) I do not know if those are only necessary for training of for inference as well.
Any chance you can give some further guidance on the required inputs and required formats?
Thanks

Colmap Patch matching settings

Hi,
I am trying to replicate your training method, but my results from colmap look more dense (though also noisier) than yours (example at the end). I assume this is caused by a different setting on the patch matching (I am just using the default one) and not only the MegaDepth post processing steps. Could you provide further details about the hyperparameters you used?
Thanks!

IggIqNXfu_U_119519400:
image
Yours:
image
Mine:
image

Inference Performance

Hi
Thanks for sharing this promising work. I am planning to run this locally once I have passed all the install/config oops. I did not see perf data in the paper or the blog or this repo, but I may have missed it. I will share my results but would be nice if there was a list of perf for different hardware specs.
In the meantime, do you have some performance benchmarks to share?

Camera pose and intrinsics estimation at inference time

Hi, it's mentioned in your paper that ORB-SLAM2 and COLMAP are used to obtain camera poses/intrinsics for the training dataset.

I'm wondering if the same procedure has to be used in inference stage. Would there be a more efficient way to obtain camera parameters? Many thanks!

Output depth data format

Hi,
Thanks for sharing the inference code. When the model infers the depth for the code from a single image, is the estimated depth in meters? What format is it exactly? Since the ground truth info is not there, I am not able to figure this out directly. Thank you.

How to generate HDF5 data?

Hi! Thanks for your work! What if have a dataset that includes RGB, depth, and human masks? How to generate the dataset similar to your TUM-HDF5 dataset?

typo in instruction

I guess "single-view" should be "single_view" in your below instruction.
(pytorch)$ python test_davis_videos.py --input=single-view

CPU Support

I'd like to run the inference pass on Mac OS(Mojave), a system that runs AMD/Radeon, rather than NVIDIA/Cuda.

Would there be a quick fix to run the inference pass without GPU acceleration but on CPU resources instead?

Being rather new to pytorch, I'm starting at the stacktrace andlooking at my first error in models/pix2pix.py in the line that declares a cuda model.

new_model = torch.nn.parallel.DataParallel( new_model.cuda(), device_ids=range(torch.cuda.device_count())) 

I can try to make the appropriate changes , but might need help figuring what to replace this with.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.