Git Product home page Git Product logo

videopose3d's Introduction

3D human pose estimation in video with temporal convolutions and semi-supervised training

This is the implementation of the approach described in the paper:

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

More demos are available at https://dariopavllo.github.io/VideoPose3D

Results on Human3.6M

Under Protocol 1 (mean per-joint position error) and Protocol 2 (mean-per-joint position error after rigid alignment).

2D Detections BBoxes Blocks Receptive Field Error (P1) Error (P2)
CPN Mask R-CNN 4 243 frames 46.8 mm 36.5 mm
CPN Ground truth 4 243 frames 47.1 mm 36.8 mm
CPN Ground truth 3 81 frames 47.7 mm 37.2 mm
CPN Ground truth 2 27 frames 48.8 mm 38.0 mm
Mask R-CNN Mask R-CNN 4 243 frames 51.6 mm 40.3 mm
Ground truth -- 4 243 frames 37.2 mm 27.2 mm

Quick start

To get started as quickly as possible, follow the instructions in this section. This should allow you train a model from scratch, test our pretrained models, and produce basic visualizations. For more detailed instructions, please refer to DOCUMENTATION.md.

Dependencies

Make sure you have the following dependencies installed before proceeding:

  • Python 3+ distribution
  • PyTorch >= 0.4.0

Optional:

  • Matplotlib, if you want to visualize predictions. Additionally, you need ffmpeg to export MP4 videos, and imagemagick to export GIFs.
  • MATLAB, if you want to experiment with HumanEva-I (you need this to convert the dataset).

Dataset setup

You can find the instructions for setting up the Human3.6M and HumanEva-I datasets in DATASETS.md. For this short guide, we focus on Human3.6M. You are not required to setup HumanEva, unless you want to experiment with it.

In order to proceed, you must also copy CPN detections (for Human3.6M) and/or Mask R-CNN detections (for HumanEva).

Evaluating our pretrained models

The pretrained models can be downloaded from AWS. Put pretrained_h36m_cpn.bin (for Human3.6M) and/or pretrained_humaneva15_detectron.bin (for HumanEva) in the checkpoint/ directory (create it if it does not exist).

mkdir checkpoint
cd checkpoint
wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_cpn.bin
wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_humaneva15_detectron.bin
cd ..

These models allow you to reproduce our top-performing baselines, which are:

  • 46.8 mm for Human3.6M, using fine-tuned CPN detections, bounding boxes from Mask R-CNN, and an architecture with a receptive field of 243 frames.
  • 33.0 mm for HumanEva-I (on 3 actions), using pretrained Mask R-CNN detections, and an architecture with a receptive field of 27 frames. This is the multi-action model trained on 3 actions (Walk, Jog, Box).

To test on Human3.6M, run:

python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin

To test on HumanEva, run:

python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -a Walk,Jog,Box --by-subject -c checkpoint --evaluate pretrained_humaneva15_detectron.bin

DOCUMENTATION.md provides a precise description of all command-line arguments.

Inference in the wild

We have introduced an experimental feature to run our model on custom videos. See INFERENCE.md for more details.

Training from scratch

If you want to reproduce the results of our pretrained models, run the following commands.

For Human3.6M:

python run.py -e 80 -k cpn_ft_h36m_dbb -arc 3,3,3,3,3

By default the application runs in training mode. This will train a new model for 80 epochs, using fine-tuned CPN detections. Expect a training time of 24 hours on a high-end Pascal GPU. If you feel that this is too much, or your GPU is not powerful enough, you can train a model with a smaller receptive field, e.g.

  • -arc 3,3,3,3 (81 frames) should require 11 hours and achieve 47.7 mm.
  • -arc 3,3,3 (27 frames) should require 6 hours and achieve 48.8 mm.

You could also lower the number of epochs from 80 to 60 with a negligible impact on the result.

For HumanEva:

python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -b 128 -e 1000 -lrd 0.996 -a Walk,Jog,Box --by-subject

This will train for 1000 epochs, using Mask R-CNN detections and evaluating each subject separately. Since HumanEva is much smaller than Human3.6M, training should require about 50 minutes.

Semi-supervised training

To perform semi-supervised training, you just need to add the --subjects-unlabeled argument. In the example below, we use ground-truth 2D poses as input, and train supervised on just 10% of Subject 1 (specified by --subset 0.1). The remaining subjects are treated as unlabeled data and are used for semi-supervision.

python run.py -k gt --subjects-train S1 --subset 0.1 --subjects-unlabeled S5,S6,S7,S8 -e 200 -lrd 0.98 -arc 3,3,3 --warmup 5 -b 64

This should give you an error around 65.2 mm. By contrast, if we only train supervised

python run.py -k gt --subjects-train S1 --subset 0.1 -e 200 -lrd 0.98 -arc 3,3,3 -b 64

we get around 80.7 mm, which is significantly higher.

Visualization

If you have the original Human3.6M videos, you can generate nice visualizations of the model predictions. For instance:

python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60

The script can also export MP4 videos, and supports a variety of parameters (e.g. downsampling/FPS, size, bitrate). See DOCUMENTATION.md for more details.

License

This work is licensed under CC BY-NC. See LICENSE for details. Third-party datasets are subject to their respective licenses. If you use our code/models in your research, please cite our paper:

@inproceedings{pavllo:videopose3d:2019,
  title={3D human pose estimation in video with temporal convolutions and semi-supervised training},
  author={Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019}
}

videopose3d's People

Contributors

dariopavllo avatar michaelauli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

videopose3d's Issues

Normalization during testing

Hi,
I am using your library to predict 3D keypoints for a dataset of individuals walking. The camera I am using is ceiling mounted with the individuals walking toward the camera. I use the 243 pt pretrained model you provide but I notice that the predicted 3D points are less accurate as the individual is at a steeper angle with respect to the camera. There are no issues with the 2D keypoints.
I set the camera resolution to match that of my camera but was wondering if I missed any normalization steps that require the camera matrix. From my understanding, the intrinsic camera parameters are only required for semi-supervised training, and the extrinsic matrix is only used during supervised training. Could you advise if any of these steps are required if I am using the provided pretrained model for testing on my own data?

Training on hourglass ft input, single frame

Hi, thank you for the excellent work and shared code.
I encountered a mismatch between the training result of single-frame hourglass fine-tuned input with the reported.
The input is fetched from Martinez et al. ICCV’17 and the training parameter is set as -k sh_ft_h36m -arc 1,1,1. Finally I got a result MPJPE=58.8mm while the reported in Table 1.(a) is MPJPE=51.8mm, where the mismatch happens.
Did I miss some training details?

HumanEva preprocessing error

Hello Everyone!

What is meant by 4th step download the critical dataset update and apply it?
I have downloaded the critical dataset and added it to the root of the source tree (Release_Code_v1_1_beta/).
I'm getting the following error when am trying to run ConvertHumanEva.m

ConvertHumanEva
Converting...
	Split: Train
	Subject: S1
	Action: Walking
	Trial: 1
Not enough input arguments.

Error in sync_stream (line 160)
for I = 1:length(file_ofs)

Error in ConvertHumanEva (line 100)
                                    = sync_stream(CurrentDataset(SEQ));

Matlab : R2017b
Version : 9.3.0,681905

Thanks in Advance!

Doubts

For in the wild videos.
I had some doubts.
1)all the predictions done are in camera space, right ? How to do it in real world co-ordinate system ?

2)what is the location of the root joint as all the predictions are made with it as the origin ?

3)I have the intrinsic and extrinsic camera parameters for my run in the wild video. Where should I input it to make them contribute to the result or how should I pre-process my video with these parameters ?

4)I see that that the root joint is always fixed and you said to regress it for movement. Could you please explain it in a little more detail what you mean ?

5)What should be the default fps of the video - I wanted to get results at 200 fps is it possible ?
Thanks

Multi-angle

It is a Great work!

Could you please tell me weather is it support display the 3D pose in Multi-angle

Test in the wild

Thanks for open-sourcing the code. Are you planning to release code for testing on an arbitrary input video, like the on shown in your "demo in the wild"?

Maybe MPJPE Protocol #2 is different

According to your paper, MPJPE protocol-2 is the error after rigid alignment with ground-truth. The definition of rigid alignment is only related with rotation and translation, not scaling.

However, I think that your P-MPJPE implementation aligns with rotation, translation and also scaling.

Could you tell me which one is correct?

Thanks in advance.

How to map Detectron and H36M skeletons?

I'm trying to test an in-the-wild video, and I already have the 2D keypoints from the Detectron (see image below). My question is, since the Detectron is predicting keypoints that are different from the h36m keypoints (i.e., spine, thorax are not predicted by Detectron, and 5 MS-COCO face keypoints are not there in H36M), how do I map these? Do I have to change something in the Detectron in order to map to the skeleton present in the h36m_dataset.py? Thanks!
output

Creating my own Dataset based on H36M

Hi Guys,

i'm rather new to Machine Learning and this looks like it would really fit a Project of mine very well.
i'd like to create a dataset based on H3.6M which i'd like to expand with my own Data fitting my Project. Basically i would like to add or exchange one Subject (my own data) and train semi-supervised for that Subject. Does that make sense or is at least a few % labeled Data required for every Subject? Is there something like a guide on how to create such a dataset or could you guys give me a few hints on what to do and what kind of Software to use?

Thanks a lot for any help in advance!

How to organize the 2d pose data from Detectron + CPN for run.py in wild video?

Thanks for your attention.

This the detail for test in wild.

  1. I have using this method get the detectron result. like this
    image

this have two useful info pose and det, the pose is a ndarray , which shape is (17, 4), (x, y, logit score before softmax, probability score after softmax)

  1. Then I try to transform the data to VideoPose3D Pose data。
def load_detr_res(detr_res):
    """
    Note: the open shuld be 'rb'
    Load the detectron keypoint detection result.
    for keypoint it has 4 value, (x, y, logit score before softmax, probability score after softmax),
    so you just need to take the indices 0, 1, and 3
    :param detr_res: dict {det, seg, pose}
    :return:
    """
    with open(detr_res, 'rb') as f:
        d = pickle.load(f)
    return d

def detr_to_h36m(detr, prefix='frames_1s'):
    frame_len = len(detr.keys())
    h36m_list = [0 for i in range(frame_len)]
    for frame_item in detr.keys():
        val = detr[frame_item]
        pose = val['pose'][1][0]

        pose_xy = [[0, 0]for i in range(17)]
        for i in range(2):
            for j in range(17):
                pose_xy[j][i] = pose[i][j]

        frame_ind_str = frame_item[-7:-4]
        frame_ind_int = int(frame_ind_str)

        h36m_list[frame_ind_int] = pose_xy

    action = {'S1' : {
        'Walking 1': [np.array(h36m_list)]
    }}
    return action # shape=(frame_num, 17, 2)

def save_as_npz(frame_ns_h36m, name='frame_1s_h36m'):
    np.savez(f'{name}.npz',positions_2d=frame_ns_h36m, positions_3d={})

def dict_to_h36m_npz(dict_path, out_path='frame_0p05s_h36m'):
    detr = load_detr_res(dict_path)
    frame_1s_h36m = detr_to_h36m(detr)
    save_as_npz(frame_1s_h36m, name=out_path)

The detr_path is the result of detectron.

  1. Then I get the frame_0p05s_h36m.npz
    image
p2d = frame_0p005_p2d['positions_2d'].item()

p3d = frame_0p005_p2d['positions_3d'].item()

This struct is same as data/data_2d_h36m_cpn_ft_h36m_dbb.npz but has only S1 and 'Walking 1' key.
4. Then I change in run.py
change this line

keypoints = np.load('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz')
,

keypoints = np.load('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz')

keypoints_symmetry = keypoints['metadata'].item()['keypoints_symmetry']
kps_left, kps_right = list(keypoints_symmetry[0]), list(keypoints_symmetry[1])

keypoints = np.load('data/wild/frame_0p05s/frame_0p05s_h36m.npz')

5, This method is fail.

overfitting

hello,
When I use my own 2d key points to predict 3d key points, I meet overfitting problem seriously. My train dataset has 290 thousands, and my test dataset has 50 thousands. Thank you for your assistance! Thank you very much!

Why need to "qinverse" from the world coordinate to the camera coordinate

Hi, I really like your work and thank you for providing the source code.

I have a question about the conversion from the world coordinate to the camera coordinate.
in common/camera.py

def world_to_camera(X, R, t):
    Rt = wrap(qinverse, R) # Invert rotation
    return wrap(qrot, np.tile(Rt, (*X.shape[:-1], 1)), X - t) # Rotate and translate

Why dose need to reverse the quaternions firstly and then convert the world coordinate to the camera coordinate?
Can you tell me what is the meaning of reversing the quaternions?

Best regards.
Fabro

error

When I run:
#!/bin/bash
~/anaconda3/bin/python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -c checkpoint --evaluate pretrained_humaneva15_detectron.bin --render --viz-subject Validate/S2 --viz-action "Walking 1 chunk0" --viz-camera 0 --viz-output output_he.gif --viz-size 3 --viz-downsample 2 --viz-video "/home/wangmeng/humaneva/S2/Image_Data/Walking_1_(C1).avi" --viz-skip 115 --viz-limit 100

assert keypoints[subject][action][cam_idx].shape[0] >= mocap_length

bad performance on the same wild video

hello

  1. I downloaded the same skating video with 1920*1080 resolution from youtube.
  2. I predicted 2d coco joints for this video by the model you provided in #2
  3. I made a dataset file and replaced the res_w and res_h in h36m_dataset.py
  4. Then I get a result by d-pt-243.bin as follows.
    _01b44388-2391-4cd5-94c5-f599b690f4c7

Obviously it is wrose than your result
_69d6714a-4b64-42c0-ab0e-92008275b34e

I noticed that your video is with a high resolution and much more accurate 2d joints. Could you please release the original skate video and test in the wild code?

train GT 3D data

Hi, Thanks for your great job.

Recently, I see the training process, find that the section about pre-process H36M GT 3D joints, It seems first transform unit from mm to m, then transform coordinates from world to camera, last abstract every frame 3D coordinate to hip point.
It is right?
I don't why we need transform coordinates from world to camera, by the way dose H3.6M provide 3D joints is world coordinate ?
Could you please tell me about this ? Thank you in advance.

Pretrained model for detectron keypoints (d-pt-243.bin) no longer available for download?

I had been testing pose estimation on "in the wild" videos following instructions in another issue and using a pretrained model for keypoints generated by Detectron.

The model had been downloadable from here:

https://s3.amazonaws.com/video-pose-3d/d-pt-243.bin

I noticed the change away from S3 storage, and I was hoping the same model would be available here:

https://dl.fbaipublicfiles/video-pose-3d/d-pt-243.bin

...but it seems like it didn't make it along with the migration.

Can that model be made available again?

How to normalize cameras for testing a wild video?

Hi,
I was testing a wild video using the keypoints extracted from detectron.
In the run.py for evaluating , there is a lot of camera normalization code (specific to h36m 4 subjects).
My doubt was , if we are testing a wild video like you tested in your demo , how do we do that know?

What i understand is , i should be loading the keypoints using the saved .npz file and directly evaluating after loading the model and skipping everything else in run.py.

Mapping of model output to 3d key point

Hello,

I'm performing research for the university of New Brunswick on view I variance of 3D key point extraction. We've ran your algorithm with our collected data and it seems to work quite well. I need to create a class to store all the 3D key points as a json file with human readable keys and need to figure out what indices of the output map to what key points (I.e: the left knee is index 4 of the array etc). I can't seem to find this mapping in the docs at all.

Thank you

Matthew Sampson

HumanEva

Can you please provide data_3d_humaneva15.npz for the 3D poses, and data_2d_humaneva15_gt.npz ? I do not really plan to buy MATLAB just for this project since my interest is casual.

Failed to reproduce result reported on the paper

I ran the following command to test your model with ground truth 2D keypoints inputs.
python run.py -k gt -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin
However, the program prints Protocol #1 (MPJPE) action-wise average: 42.6 mm, which is worse than reported result(37.2mm).
Is the released model only intended to reproduce the result with 2D keypoints from CPN? In another words, the results from Table 3 on the paper are trained separately on each source of 2D keypoints inputs instead of using an unified model for all those results?

Recommended machine and Google Colab

Hi,

I would like to know the minimal settings to run this model, maybe with a little more of training, and if it is possible to do it in Google Colab.

I will be doing this for a martial-arts group of people (Kung Fu - Aikido). However my machine sux ...

Thank you for your attention.

Can the data be provided?

Hi,

Is it possible for you upload the data file dataset_path = 'data/data_3d_' + args.dataset + '.npz' for convenience?

Thanks

HumanEva erroneous annotation

I ran the code in this repo to generate the 15-joint skeleton for the HumanEva-I dataset. I notice that the head joint of subject S3 in Walking_1 is wrong for the most of the video (the whole training part and some of the validation) and it is on (in front of) the chest and not on the head.

Here is the skeleton I get:
screenshot from 2019-02-04 15-10-46

Did you deal with this in any way or did you just evaluate the method against the wrong head position in the validation set?

Run for own video

Hi,
I love the work you have done.
How would we run your pre-trained models on my own set of videos to get the output of 3D joint locations?

Unable to execute run.py

aki@Ak-Inspiron-3542:~/VideoPose3D-master$ python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video /home/ak/VideoPose3D-master/private/video/walk.mp4 --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60
Namespace(actions='*', architecture='3,3,3,3,3', batch_size=1024, bone_length_term=True, by_subject=False, causal=False, channels=1024, checkpoint='checkpoint', checkpoint_frequency=10, data_augmentation=True, dataset='h36m', dense=False, disable_optimizations=False, downsample=1, dropout=0.25, epochs=60, evaluate='pretrained_h36m_cpn.bin', export_training_curves=False, keypoints='cpn_ft_h36m_dbb', learning_rate=0.001, linear_projection=False, lr_decay=0.95, no_eval=False, no_proj=False, render=True, resume='', stride=1, subjects_test='S9,S11', subjects_train='S1,S5,S6,S7,S8', subjects_unlabeled='', subset=1, test_time_augmentation=True, viz_action='Walking', viz_bitrate=3000, viz_camera=0, viz_downsample=2, viz_limit=60, viz_no_ground_truth=False, viz_output='output.gif', viz_size=3, viz_skip=0, viz_subject='S11', viz_video='/home/ak/VideoPose3D-master/private/video/walk.mp4', warmup=1)
Loading dataset...
Preparing data...
Loading 2D detections...
INFO: Receptive field: 243 frames
INFO: Trainable parameter count: 16952371
Loading checkpoint checkpoint/pretrained_h36m_cpn.bin
This model was trained for 80 epochs
INFO: Testing on 543344 frames
Rendering...
ffprobe: error while loading shared libraries: libx264.so.138: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "run.py", line 742, in
input_video_skip=args.viz_skip)
File "/home/ak/VideoPose3D-master/common/visualization.py", line 100, in render_animation
for f in read_video(input_video_path, skip=input_video_skip):
File "/home/ak/VideoPose3D-master/common/visualization.py", line 26, in read_video
w, h = get_resolution(filename)
TypeError: 'NoneType' object is not iterable

Can someone please tell me why I am getting this error.Please!!

the run.py problem

In the file run.py, the code at the line756 which i think shoulder be the ' if action_name not in all_actions_by_subject[subject]: ' , just a little confused?

joint name of each 3d coordinates

First of all, Thanks for your wonderful works

I choose random youtube videos
and I extract 3d coordinates of every 17joints of every Frames in Video
Because I got the total 253 frames, I finally got [253, 17, 3]format of numpy.ndarray

But the thing is... I want to know which one matches to which joint.... like this one is for left knee just like this....

Sorry for Rudimentary Question.

If there is any part of code lines that will helpful to understand about this question,
Please let me know

Thanks!

Causality and video padding

Hi and thanks a lot for releasing your code!

I have a question regarding the padding of videos in the case of causal convolutions. Sorry if that's answered elsewhere or in the paper, it's not very clear to me :)

Looking at figure 9a of the paper (symmetric case), it's clear that for the first and last frames of the video (depending on the receptive field), the network "sees" duplicate frames due to padding. In the example in 9a, the most extreme case for symmetric convolutions are the 2 first and 2 last frames, where you get the only duplicate frames left and right respectively.

While in the extreme cases for symmetric convs the samples still include some non-duplicate frames (future or past depending on position in the video), for causal convolutions we have more extreme cases, where for the first 2 frames the network sees samples that are pretty much only padding, i.e. the same frame duplicated receptive_field -1 times.

So my questions for the causal case are:

  • Is the assumption is that the network is robust enough to differentiate between these cases
    (all duplicated past frames VS actual past frames)?
  • Did you run any experiments for causal convolutions where you discard these extreme cases and never present them as samples to the network? If yes, was there any change in performance? I guess it's hard to evaluate whole videos in that case, since you can only get predictions for frames where frame_id >= receptive_field if you discard all cases with padding.

Hope my questions make sense :D

Thanks!

detector missing one or more keypoint

Hi,
Thanks for your works. I use openpose to detect keypoint, sometimes detector may get some reason, and missing one or more keypoint of body. Could this work in this situation?

Dataset setup

To make a dataset with new 3d points, do the camera parameters need to be matched? Or can any 3d point be converted to your camera space by the camera parameters you gave? For example, I now have my own 3d point dataset, I need Do you have your own camera parameters? Or can you get the training data set directly with your camera parameters?

Can not draw the GIF correct

image

the run shell is here

#!/usr/bin/env bash
python3 run.py \
-k cpn_ft_h36m_dbb \
-arc 3,3,3,3,3 \
-c checkpoint \
--evaluate pretrained_h36m_cpn.bin \
--render \
--viz-subject S11 \
--viz-action Walking \
--viz-camera 0 \
--viz-video "./Walking.54138969.mp4" \
--viz-output output.gif \
--viz-size 3 \
--viz-downsample 2 \
--viz-limit 60

Thank you very much.

Prediction before 'camera_to_world' transformation is very different

Hello again, I was hoping you could help me better understand the raw predictions of the cpn-dt-243 model. I'm testing the model on some videos/pictures in the wild and (as mentioned in [https://github.com//issues/38](this thread)) I finally got it to output some quite accurate 3D keypoints.

What I'm trying to understand is when working with some in the wild data, we apply the camera_to_world transformation on predictions, where camera rotation is taken at "random" from the h36m_3d dataset - what does this actually do? Furthermore, I tried to plot predictions pre- and post- transformation expecting to see the same skeleton but at different angle/size etc. Unfortunately, what I get are two distinctly different poses, see below:

Screenshot 2019-04-14 16 59 58

It's pretty clear proportions are different AND in pre- transformation plot the forearms are up (which is not the case in the actual image) whereas post- transformation represents the actual image very accurately

Does this make sense?

How to keep same sequence size for input 2D and output 3D

hello,
Thanks for sharing this great work. I was wondering, how did you get the 3D poses of same temporal size as input 2D poses. Say, my input sequence length is 88 and -arch argument is 3, 3, 3; in this case the network will output a sequence of len 62. How did you tackle this during loss calc. did you take only 62 time steps or you padded the input sequence to get a 88 output seq of 3d pose.
Thanks

reconstruction error result

I use Detetron generates 2D keypts by wild video,
The I run run.py to get 3D pose, but the reconstruction result is wired. as follow

image

Should MPJPE be averaged on 16 joints or 17 joints?

Hi, firstly thanks for your code.
I have a question about the way you calculate MPJPE. According to your code, the root joint position of the target 3D poses is set as [0, 0, 0], and calculate MPJPE with the predicted ones.

# line 428 in run.py
inputs_3d[:, :, 0] = 0

# Predict 3D poses
predicted_3d_pos = model_pos(inputs_2d)
loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d)

This means that your MPJPE is averaged on 17 joints.
However, in some other evaluation, for example https://github.com/rayat137/Pose_3D, MPJPE is averaged on 16 joints, and the root is not involved in.

I want to know which one is correct, 16 or 17?

About evaluation on datasets

Hello,
I have one problem about your paper, you only evaluate on Human3.6M, HumanEva, which I think is not enough to prove the generalization ability of the model. Previous papers usually evaluate on the dataset MPI-inf-3DHP. Have you evaluated on it? Otherwise, the model may be just overfitting the dataset Human3.6M.

2D input format dependency

Hi, could you provide any specifics around expected 2D keypoints input? Specifically, does the network expects a vector of keypoints with joints in a specific order? By looking at test data you supplied, they stay consistent - i.e. left arm (11,12,13) etc... Perhaps the input keypoints also have to be resized?

I tried feeding keypoints generated with CPN and AlphaPose on my own dataset but seem to be getting a random set of 3D points

Back-projection Idea

Hi, thanks for open sourcing your code for the paper. Leveraging temporal convolutions is a great idea and the results are promising. However, I just want to point out an issue about the back-projection idea. The idea was visited for 3D human pose estimation by 2 previous works (maybe others that I'm not aware of):

  • Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision (link) at CVPR 2017.
  • Can 3D Pose be Learned from 2D Projections Alone? (link) by @dylandrover et al. at ECCVW 2018.

It'd be nice to mention those great works in your preprint.

Thanks!

Model works only with square frames

Hi,

I realised that cpn-pt-243 checkpoint produces good results when pose is estimated for square images. Whenever a rectangular image is used (i.e. 1280 x 720) the resulting 3D pose is somewhat skewed to one side.

For example here's the original 1280 by 720 frame:
rect

And here's the same video/frame but cropped to 720 by 720 just before extracting the 2D pose coordinates:
sqr

As you can see 2nd one is a lot more accurate. However, it seems like it shouldn't really matter. Although the training data was square(?) it shouldn't really affect the results as long as I (quoting the author of the repo) normalize 2D pose to the longer edge of the frame and keep aspect ratio which I do for both cases.

Could you please advise if the model expects a square input image indeed or that's just a fluke?

Thanks

hip fix

Hi, I get wild video 3D pose result, find the center hip point is fixed, when a person sit down, the hip points is same. like this:

test

I think maybe is training data prepare, Could you please tell me what the function of inputs_3d[:, :, 0] = 0
https://github.com/facebookresearch/VideoPose3D/blob/master/run.py#L312
why need set center_hip coordinate to (0,0,0), Is this aim to fix the hip point?

Appreciate your apply Thank you in advance.

problem about Hip.

Hello.My name is Long.
I had generated 3D Points from 2D Points by my own video.
Question:
My Hip Point is fixed,but your video seems like it is moving, how can I do that ?
thx.

Ground Truth and Limb Lengths?

Hello,
Congrats on VideoPose - a magnificent piece of work.

I am assessing the accuracy of the already processed video.
I figured a good way to do this would be to see how much the limb lengths vary from frame to frame.

As per https://github.com/facebookresearch/VideoPose3D, I ran:
python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin
But I added some code to def update_video(i): in visualization.py to print out the 3Dcoords of the skeleton points every frame.

Then for each frame, I calculated the limb lengths, e.g dist from ankle to knee, elbow to wrist, etc.

For Ground Truth, the limb lengths are basically identical for each frame - great!
For Reconstruction, there is a lot of variation.

This leads me to two questions-
(Pardon my ignorance , I have looked these up, but without much success)
What exactly is Ground truth - and why does it improve accuracy so much?

What determine the length of the limbs / size of the skeleton?

Oh, one more question:
The results from videos show the person moving through space /over the ground, where as others show them moving on the spot. e.g on the page https://github.com/facebookresearch/VideoPose3D , the results show walker is moving across the ground, but the skater is moving on the spot.
What determines this difference?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.