This repository contains inference code for models trained on the Mannequin Challenge dataset introduced in the CVPR 2019 paper "Learning the Depths of Moving People by Watching Frozen People."
This is not an officially supported Google product.
The code is based on PyTorch. The code has been tested with PyTorch 1.1 and Python 3.6.
We recommend setting up a virtualenv
environment for installing PyTorch and
the other necessary Python packages. The TensorFlow installation
guide may be helpful (follow steps 1
and 2) or follow the virtualenv
documentation.
Once your environment is set up and activated, install the necessary packages:
(pytorch)$ pip install torch torchvision scikit-image h5py
The model checkpoints are stored on Google Cloud and may be retrieved by running:
(pytorch)$ ./fetch_checkpoints.sh
Our test set for single-view inference is the DAVIS 2016 dataset. Download and unzip it by running:
(pytorch)$ ./fetch_davis_data.sh
Then run the DAVIS inference script:
(pytorch)$ python test_davis_videos.py --input=single_view
Once the run completes, visualizations of the output should be
available in test_data/viz_predictions
.
The full model described in the paper requires several additional inputs: the human segmentation mask, the depth-from-parallax buffer, and (optionally) a human keypoint buffer. We provide a preprocessed version of the TUM RGBD dataset that includes these inputs. Download (~9GB) and unzip it using the script:
(pytorch)$ ./fetch_tum_data.sh
To reproduce the numbers in Table 2 of the paper, run:
(pytorch)$ python test_tum.py --input=single_view
(pytorch)$ python test_tum.py --input=two_view
(pytorch)$ python test_tum.py --input=two_view_k
Where single_view
is the variant I from the paper, two_view
is the variant IDCM, and two_view_k
is the variant IDCMK. The script prints running averages of the various error metrics as it runs. When the script completes, the final error metrics are shown.
If you find the code or results useful, please cite the following paper:
@inproceedings{li2019learning,
title={Learning the Depths of Moving People by Watching Frozen People},
author={Li, Zhengqi and Dekel, Tali and Cole, Forrester and Tucker, Richard
and Snavely, Noah and Liu, Ce and Freeman, William T},
booktitle={Proc. Computer Vision and Pattern Recognition (CVPR)},
year={2019}
}
mannequinchallenge's People
Forkers
afcarl danish87 mkocabas peterzhousz zoombapup umariqb miaowu99 tuqiugithub funkdub peterzs eshafeeqe roxanneluo lane565 jeffresh nestorsgarzonc stevewongv stevematos avatarworld jiaojiaozhang xujing1022 aliendeep sareldu singh-ap xubin1994 minzhangm hell-to-heaven jiangwenpl lamarrr arvkr balbir8git suyibjut dreadlord1984 bxck75 miu200521358 firskey vishal5212 willert98 divisiondeariza liyongsheng-tech giantstonex7 liuwenhaha mc261670164 zhuomingliang sttomato dbarac liuguoyou zlou gereltuya b1sounours acerto jennyesoo bookendus dragonx081 mintblackberry laurensent thedani taehakim-kor electricrobot anthonydickson carlosfora abbyluhui plarr2020-team1 tuskaw snaves vladzilla nutela-san nevrax neotim ztzhang killyseason muskanmahajan37 manojkesani rongxike malesilver jzengust jamiebeach deep-blue-blue xuelimin elena-ssq longervision ivan-alles leifengsoul cuchulainx apikielny isabella232 altis5526 chenzezheng borisnadion joeking11829 kaihsiangl python-repository-hub maheshkkumar varunakk forky-mcforkface jlygit ghas-results thushar-marvel axifeng potevotsmannequinchallenge's Issues
Clarifications on training pipeline
Hi guys! Congrats for the great work.
I have been trying to implement the single view training pipeline, following the details on the paper's supplemental material , and I have a few questions regarding the implementation:
-
In the scale invariant mse loss (equation 8), the second term is divided by N. Shouldn't that be N^2 ? Is this just a typo?
-
During training you state that you normalize the GT log depths by subtracting a random value from the 40-60th percentile. So you dont actually "normalise" but rather kind of center the map around 0. Since the losses take into consideration relative distance between pixel pairs, how does this normalisation affect performance?
Moreover, I have generated the MC dataset following the instructions in #14, and I have noticed that the absolute and log values of the GT depth maps (usually in the range [5-50]) are significantly larger than those of the depth maps generated from your single-view pretrained model (usually in the range [0.2-1.5]). Is there some other kind of normalisation that you also perform? -
Finally, regarding the "paired" mse loss, do you actually compute the distances between all possible pairs, or you do some sub-sampling? Because this can become really computationally intensive, even for small resolutions. There are potentially n*(n-1)/2 possible pairs for an image with n pixels
Thanks!
thank you for your work~when do you offer the training code?
thank you for your work~when do you provide the training code?
Code to generate depth maps from Youtube videos
Hi! Thank for the work! Will You release the code to generate depth maps from Youtube videos?
How to generate HDF5 data?
Hi! Thanks for your work! What if have a dataset that includes RGB, depth, and human masks? How to generate the dataset similar to your TUM-HDF5 dataset?
will you share the MannequinChallenge Dataset?
will you share the MannequinChallenge Dataset?
and i wonder the details of trainining the 3input model
training data
Hi!Thank for the work! Will You release the training data 170K valid image-depth pairs?
Would you like to share the initial depth estimation methods?
Hello, would you like to give out the code for the initial estimation of depth map? It can help us evaluate the method on other datasets (for example, the KITTI dataset). Thank you greatly.
Colmap Patch matching settings
Hi,
I am trying to replicate your training method, but my results from colmap look more dense (though also noisier) than yours (example at the end). I assume this is caused by a different setting on the patch matching (I am just using the default one) and not only the MegaDepth post processing steps. Could you provide further details about the hyperparameters you used?
Thanks!
Images Downscaled
Hey, great work you all did! I really like it!
I have a problem though. After running python test_davis_videos.py --input=single_view
, the result is downscaled to 288x512 (or, 288x1024, with the original and the result next to each other). But the original has a much higher resolution!
I found the resized_width and _height in the DAVISImageFolder class, but adjusting these made me run out of memory (CUDA) immediatly. I have a GTX 1070.
Is it really that intense? Or am I doing something wrong?
typo in instruction
I guess "single-view" should be "single_view" in your below instruction.
(pytorch)$ python test_davis_videos.py --input=single-view
Inference Performance
Hi
Thanks for sharing this promising work. I am planning to run this locally once I have passed all the install/config oops. I did not see perf data in the paper or the blog or this repo, but I may have missed it. I will share my results but would be nice if there was a list of perf for different hardware specs.
In the meantime, do you have some performance benchmarks to share?
Some of the video in the datasets become unavailable
Some videos, for example https://www.youtube.com/watch?v=UP6dBjsmg3U is unavailable because the user has removed the video, making it impossible to fully reproduce the original dataset and results. So would you like to share the full original video or the processed sequences?
RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.
When I run this python test_tum.py --input=two_view_k
I got an error like this
batchSize: 8
beta1: 0.5
checkpoints_dir: ./checkpoints/
continue_train: False
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0, 1, 2, 3]
human_data_term: 0
identity: 0.0
input: two_view_k
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0004
lr_decay_epoch: 8
lr_policy: step
max_dataset_size: inf
mode: Ours_Bilinear
model: pix2pix
nThreads: 2
name: test_local
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
simple_keypoints: 0
use_dropout: False
which_epoch: latest
which_model_netG: unet_256
-------------- End ----------------
========================= TUM evaluation #images = 1815 =========
====================================== DIW NETWORK TRAIN FROM Ours_Bilinear=======================
===================Loading Pretrained Model OURS ===================================
Traceback (most recent call last):
File "test_tum.py", line 36, in
model = pix2pix_model.Pix2PixModel(opt, False)
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\pix2pix_model.py", line 96, in init
new_model, 'G', 'best_depth_Ours_Bilinear_inc_7')
File "D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd\models\base_model.py", line 68, in load_network
model = torch.load(save_path)
File "C:\Users\awww_.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 386, in load
return load(f, map_location, pickle_module, **pickle_load_args)
File "C:\Users\awww.DESKTOP-82VA5F8\Anaconda3\envs\pytorch\lib\site-packages\torch\serialization.py", line 580, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 109330 more bytes. The file might be corrupted.
(pytorch) PS D:\DeepLearning\autotrace_motion\mannequinchallenge-vmd>
--
And I check the file size is 8.14 GB
in .tgz , how to fix this issue. I am on window 10 and succeed with these 2 command
`
pytorch)$ python test_tum.py --input=single_view
(pytorch)$ python test_tum.py --input=two_view
`
How to get the camera intrinsic parameters from the original YouTube wild videos?
I watch that in the dataset, you have supplied the camera intrinsics, but I do not know how do you get them when you create the dataset from the original YouTube ? So could you please explain it for me or tell me what method you have adopted to solve this ? Thank you very much!
Camera pose and intrinsics estimation at inference time
Hi, it's mentioned in your paper that ORB-SLAM2 and COLMAP are used to obtain camera poses/intrinsics for the training dataset.
I'm wondering if the same procedure has to be used in inference stage. Would there be a more efficient way to obtain camera parameters? Many thanks!
Output depth data format
Hi,
Thanks for sharing the inference code. When the model infers the depth for the code from a single image, is the estimated depth in meters? What format is it exactly? Since the ground truth info is not there, I am not able to figure this out directly. Thank you.
Could I get more detail about object insertion from depth-based visual effect part?
Hi, I am so interested on how to generate the object insertion and ground reconstruction like the figure you showed in Figure 7(c). Could you give me some introductions or tips on it? Thank you so much! My email address is [email protected].
CPU Support
I'd like to run the inference pass on Mac OS(Mojave), a system that runs AMD/Radeon, rather than NVIDIA/Cuda.
Would there be a quick fix to run the inference pass without GPU acceleration but on CPU resources instead?
Being rather new to pytorch, I'm starting at the stacktrace andlooking at my first error in models/pix2pix.py
in the line that declares a cuda model.
new_model = torch.nn.parallel.DataParallel( new_model.cuda(), device_ids=range(torch.cuda.device_count()))
I can try to make the appropriate changes , but might need help figuring what to replace this with.
Inference of the multi-view model - guidance required
Ok so thanks for all etc but if I want to infer your advanced model on something different than the TUM preprocessed dataset you provide, using the 2 views and masks and optic flaw and key points, does it mean I have to generate all those inputs at the appropriate formats, cast it properly into the expexted HDF5 file structure, all that without any specifications?
For the RGB images, I can manage, but for the flow input what is the expected input format?
For the optional keypoints, what is the expected data format? Can you model support more/less key points than the ones you have in your dictionnaries lists currently?
Back-engeneering your code is not fun.
In the TUM HDF5 files there is more than 5 data subset (I did expect 2 rgb images, one binary mask image, one rgb flow image, one vector of key points pairs), but there are some strange low resolution matrices in the file as well) I do not know if those are only necessary for training of for inference as well.
Any chance you can give some further guidance on the required inputs and required formats?
Thanks
Running on custom videos
Hi Authors,
Is it possible to run code on custom videos with extrinsic calibration available?
Regards
Aakash
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.