princeton-vl / pose-hg-train Goto Github PK

Training and experimentation code used for "Stacked Hourglass Networks for Human Pose Estimation"

License: Other

Lua 1.59% Python 1.84% MATLAB 0.02% Jupyter Notebook 96.55%

pose-hg-train's Introduction

Stacked Hourglass Networks for Human Pose Estimation (Training Code)

This is the training pipeline used for:

Alejandro Newell, Kaiyu Yang, and Jia Deng, Stacked Hourglass Networks for Human Pose Estimation, arXiv:1603.06937, 2016.

A pretrained model is available on the project site. You can use the option -loadModel path/to/model to try fine-tuning.

To run this code, make sure the following are installed:

Torch7
hdf5
cudnn

Getting Started

Download the full MPII Human Pose dataset, and place the images directory in data/mpii. From there, it is as simple as running th main.lua -expID test-run (the experiment ID is arbitrary). To run on FLIC, again place the images in a directory data/flic/images then call th main.lua -dataset flic -expID test-run.

Most of the command line options are pretty self-explanatory, and can be found in src/opts.lua. The -expID option will be used to save important information in a directory like pose-hg-train/exp/mpii/test-run. This directory will include snapshots of the trained model, training/validations logs with loss and accuracy information, and details of the options set for that particular experiment.

Running experiments

There are a couple features to make experiments a bit easier:

Experiment can be continued with th main.lua -expID example-exp -continue it will pick up where the experiment left off with all of the same options set. But let's say you want to change an option like the learning rate, then you can do the same call as above but add the option -LR 1e-5 for example and it will preserve all old options except for the new learning rate.
In addition, the -branch option allows for the initialization of a new experiment directory leaving the original experiment intact. For example, if you have trained for a while and want to drop the learning rate but don't know what to change it to, you can do something like the following: th main.lua -branch old-exp -expID new-exp-01 -LR 1e-5 and then compare to a separate experiment th main.lua -branch old-exp -expID new-exp-02 -LR 5e-5.

In src/misc there's a simple script for monitoring a set of experiments to visualize and compare training curves.

Getting final predictions

To generate final test set predictions for MPII, you can call:

th main.lua -branch your-exp -expID final-preds -finalPredictions -nEpochs 0

This assumes there is an experiment that has already been run. If you just want to provide a pre-trained model, that's fine too and you can call:

th main.lua -expID final-preds -finalPredictions -nEpochs 0 -loadModel /path/to/model

Training accuracy metric

For convenience during training, the accuracy function evaluates PCK by comparing the output heatmap of the network to the ground truth heatmap. The normalization in this case will be slightly different than the normalization done when officially evaluating on FLIC or MPII. So there will be some discrepancy between the numbers, but the heatmap-based accuracy still provides a good picture of how well the network is learning during training.

Final notes

In the paper, the training time reported was with an older version of cuDNN, and after switching to cuDNN 4, training time was cut in half. Now, with a Titan X NVIDIA GPU, training time from scratch is under 3 days for MPII, and about 1 day for FLIC.

pypose/

Included in this repository is a folder with a bunch of old python code that I used. It hasn't been updated in a while, and might not be totally functional at the moment. There are a number of useful functions for doing evaluation and analysis on pose predictions and it is worth digging into. It will be updated and cleaned up soon.

Questions?

I am sure there is a lot not covered in the README at the moment so please get in touch if you run into any issues or have any questions!

Acknowledgements

Thanks to Soumith Chintala, this pipeline is largely built on his example ImageNet training code available at: https://github.com/soumith/imagenet-multiGPU.torch

pose-hg-train's People

Contributors

Stargazers

Watchers

Forkers

davidbelanger wanjinchang ml-lab rainstrom silasxue xingyizhou ywchao aihgf shaoli-huang libardo1 bearpaw allankevinrichie zimenglan-sysu-512 caomw ilovecv iammarvelous harkmug oztc livinhome gaopeng-eugene xizero00 jessiechouuu dbaranchuk fucusy davidduo raymon-tian rlabuguen sunke123 mylearning2017 lucivpav statml lyk125 soledad89 mingzhew jwu76033 gyunt datakop benjamesbabala jinyixin621 konglongteng coocoky 1165048017 bopang1996 angice erinchen824 achigeor captain-xiong yuanyuangoo leo18 y-dep emredog tjblunt dongzhuoyao xiaoc10 trantorrepository bgtwoigu xiangzi1992 tianjiaoding peakerlee2016 danache kyocen fqss0436 elevanth gengdavid wangnuowa heroxx2011 neherh yougoforward mzh19940817 shaoyanguo phonicavi pavanteja295 seekever rzel ykumards zhangxujinsh firedfree gitwithmch pliu007 antdlx pinglmlcv changwuxie feitiandemiaomi huanglizhi rkshuai fendaq princep haoranlyu emrys-lee fengqingyue yuancoder222 xinxun-xu pbdahzou kanbo0409 qingsong99 sxq2004123 buaacyh flowingzhou sakuralala agangzz

pose-hg-train's Issues

Would you please also share your train.log and valid.log for reference?

Hi, would you share the train.log and valid.log for reference? Since it could be a great help for monitoring the training process.

ps. Did you train the model for only 100 epochs? Since the paper mentioned that you drop the learning rate several times. If so, at which epochs did you drop the learning rate? Thank you.

Can I know if you are passing image as float or integer into the network?

I am still having problems with memory and am wondering if you pass images as float or integer into the network? Thanks.

why no weight decay used ?

Tks for the code, but i cannot understand why do you weight_decay=0 ?

Ask for Tensorflow Version

HI， I had recently read your publications "Hourglass" + "Associative Embedding", and noticed that you were using tensorflow to implement "Associative Embedding".

Is this possible for you to release the tensorflow version in the future? including both two works? After all , lua and torch is not as popular as tf. It is quite inconveniet for other people to reproduce.

Head and Shoulder are not included in computing the accuracy

Hi @anewell ,

I am reading your code and find the accuracy of MPII dataset is computed only based on these predefined joints " self.accIdxs = {1,2,3,4,5,6,11,12,15,16}". In your paper, joints of Head, Shoulder, Elbow, Wrist, Hip, Knee and Ankle are reported. It seems Head and Shoulder are missing in "self.accIdxs". Could you explain why?

Difference between *.h5 annotations and official .mat annotation

Hi i noticed that for some of the images there is a difference between the annotations in the h5 files and the official mpii_human_pose_v1_u12_1.mat file.

For the first entry for example, the order of the parts is different (not really that big of a deal) and the value for scale (3.7763 vs 3.02) and center (594,257 vs 594,302) are different. I made sure that both refer to the same image and the same person id within the image.
Is there any particular reason for this? Does it make a difference if i use one or the other?

Final prediction seems not working with a trained model

Hi!

I have trained a model and during the training I got 0.84/0.80 accuracy. However when I use that trained model to do final prediction, it outputs 0.02 accuracy. Did I use the final prediction incorrectly?

Thanks a lot!

th main.lua  -expID thanksgiving  -trainBatch 3 -continue -LR 1e-8 -LRdecay 1e-2 -nEpochs 10
Saving everything to: /home/ming/pose-hg-train/exp/mpii/thanksgiving
Input is a tensor with dimensions: 3 x 256 x 256
Output is a table
         Entry 1 is a tensor with dimensions: 16 x 64 x 64
         Entry 2 is a tensor with dimensions: 16 x 64 x 64
         Entry 3 is a tensor with dimensions: 16 x 64 x 64
         Entry 4 is a tensor with dimensions: 16 x 64 x 64
         Entry 5 is a tensor with dimensions: 16 x 64 x 64
         Entry 6 is a tensor with dimensions: 16 x 64 x 64
         Entry 7 is a tensor with dimensions: 16 x 64 x 64
         Entry 8 is a tensor with dimensions: 16 x 64 x 64
==> Loading model from: /home/ming/pose-hg-train/exp/mpii/thanksgiving/final_model.t7
==> Converting model to CUDA
==> Starting epoch: 181/190
 [======================================== 8000/8000 ==================================>]  Tot: 30m53s | Step: 232ms    
      train : Loss: 0.0048804 Acc: 0.8402
 [======================================== 1000/1000 ==================================>]  Tot: 1m13s | Step: 72ms      
      valid : Loss: 0.0057696 Acc: 0.8002

th main.lua -expID finalp -finalPredictions -nEpochs 0 -loadModel ../exp/mpii/thanksgiving/final_model.t7 
Saving everything to: /home/ming/pose-hg-train/exp/mpii/finalp
Input is a tensor with dimensions: 3 x 256 x 256
Output is a table
         Entry 1 is a tensor with dimensions: 16 x 64 x 64
         Entry 2 is a tensor with dimensions: 16 x 64 x 64
         Entry 3 is a tensor with dimensions: 16 x 64 x 64
         Entry 4 is a tensor with dimensions: 16 x 64 x 64
         Entry 5 is a tensor with dimensions: 16 x 64 x 64
         Entry 6 is a tensor with dimensions: 16 x 64 x 64
         Entry 7 is a tensor with dimensions: 16 x 64 x 64
         Entry 8 is a tensor with dimensions: 16 x 64 x 64
==> Loading model from: ../exp/mpii/thanksgiving/final_model.t7
==> Converting model to CUDA
==> Generating predictions...
 [======================================== 11731/11731 ================================>]  Tot: 14m9s | Step: 71ms      
      test : Loss: 0.0122433 Acc: 0.0002

meaning of imgname in annot.h5

Hi,

great work!

I don't fully understand what is what you write as imgname in annot.h5. I am training with my own dataset and I am not quite sure what shall be there specified.

right now, in you code, it looks like:
DATASET "imgname" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 40614, 16 ) / ( 40614, 16 ) }
DATA {
(0,0): 48, 51, 55, 52, 53, 52, 48, 49, 50, 46, 106, 112, 103, 0, 0, 0,

which does not correspond to the names of the images. How shall they be interpreted?

thanks in advance
Javier

Official PCKh on the validation set?

Hi Alejandro,

Thanks for sharing your codes!

Since you have noted that the PCKh in the log files is not the official one, could you tell me how to obtain the official PCKh on the MPII validation set using your code, just like the PCKh results reported in your paper?

Thanks,
Wei

what's the usage of MRF.lua?

Question on the transform function in img.lua

function transform(pt, center, scale, rot, res, invert)
    local pt_ = torch.ones(3)
    pt_[1],pt_[2] = pt[1]-1,pt[2]-1

    local t = getTransform(center, scale, rot, res)
    if invert then
        t = torch.inverse(t)
    end
    local new_point = (t*pt_):sub(1,2):add(1e-4)

    return new_point:int():add(1)
end

Is there any special reason why do you add 1e-4 to the new_point in the end?
Why do you subtract 1 from pt at the start and then add back 1 in the end (instead of just use the original pt)?

Structure of .h5 file

Hi, anewell,

Thank you for sharing the code! It's great help for me to understand the network structure.

Under '/data/mpii/annot/', 'Datasets' for test.h5, valid.h5 and train.h5 are different. What they have in common are:
1-center
2-imgname
3-index
4-normalize
5-part
6-person
7-scale
8-torsoangle
9-visible

I'm trying to fine-tune with my own data. What datasets should I provide in the .h5 file? And what's the meaning of them?

bad argument #6 to 'sub'

Hi, anewell! I've met this issue during training from scratch on different epochs (4,5,12) after latest commits.
==> Starting epoch: 12/100 torch/install/bin/luajit: /opt/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] pose-hg-train/src/util/img.lua:115: bad argument #6 to 'sub' (out of range at torch/pkg/torch/generic/Tensor.c:330)

Any ideas about the ways it was caused and the ways it could be fixed?
Thank you in advance

Is the pre-trained model trained using both the FLIC AND MPII dataset?

Is the pre-trained model trained using both the FLIC AND MPII dataset? Or is it just pre-trained on one of the datasets? In your paper, you are unclear as to whether you trained on both or separately.

The amplitude of predicted maps

since only relu used in the network, how to ensure that the the amplitude of predicted maps is [0-1] just same as the gaussian maps?

what‘s the difference between 'hg' and 'hg-stacked' netType?

hi,
In the opts.lua, what‘s the difference between the two netType: 'hg' and 'hg-stacked' netType?

What does the 'modelArgs' mean?

modelArgs is in createModel(modelArgs) in the file model.lua. However, I can not find its definition.
By the way, how can I create a model with fake input data from scratch? I can not figure it out.

Image crop out of range during MPII training

Thanks for releasing the code! It is really helpful.

I tried the training on MPII and after 7 epochs it gave an error for the image cropping being out of range:

==> Starting epoch: 8/100
... pose-hg-train/src/util/img.lua:107: bad argument #6 to 'sub' (out of range...

Do you have any guess why this would happen?

some question about intermediate loss

read from the paper , it seems that the network trained end-to-end, i wonder how the intermediate loss used: is all the loss added together at the last loss layer and do bp or every single hourglass train independently? i do not find the code about the intermediate loss, can you help me ? thanks~

how you get the '1.25'

https://github.com/umich-vl/pose-hg-train/blob/2fef6915fbd836a5d218a5d2f0c87c463532f1a6/src/util/dataset/mpii.lua#L94

thanks for your wonderful works!
I want know how you get the 1.25?
when i change this another number, the training become hard.

How to viuslize your model?

I want to draw down your network structure.

when will you update src/dataset/flic.lua file

Hello

I want to test on this FLIC dataset. Can you upload the file?

Another question, Associative Embedding: End-to-end Learning for Joint Detection and Grouping is a CVPR paper or not? I think the idea is as novel as Part Affinity Field, but results is a bit weak. Anyway, very interesting work.

gaussian kernel size

Sigma of Gaussian used as the target of MSE criterion might be 0.25, despite 1px described in the article.

See:
src/util/img.lua Line 214
local g = image.gaussian(size)

'image'

The default value of the horizontal and vertical standard deviation sigma_horz and sigma_vert of the Gaussian kernel is sigma, where the latter has a default value of 0.25.

Heatmap gaussian standard deviation

We're trying to implement a model similar to yours, but are hitting some confusion with generating the heatmaps for the targets.

In your paper, you say that you use 2D gaussians with a standard deviation of 1 px for targets. However, looking at your drawGaussian function: https://github.com/anewell/pose-hg-train/blob/c25da4b48c74a7b314384b07e8c09a15b0343e7f/src/util/img.lua#L204-L224, you set size to 6 * sigma + 1.

That hmGauss value appears to be 1 per https://github.com/anewell/pose-hg-train/blob/2fef6915fbd836a5d218a5d2f0c87c463532f1a6/src/opts.lua#L64.

This appears to lead to a size of 7, which per the Torch image.gaussian constructor should give a sigma of 1.75.

Am I misunderstanding something here?

clarification of scores variable in postprocessing function

Hi,

I am replicating your work and have one short doubt about the variable "scores" in postprocess function in pose.lua.

What is the meaning of this variable? I cannot fully understand by checking the code what is being stored here. I have seen it is a tensor of 20x1 with values between 0 and 1, could you please explain me a bit what is being stored here?

thanks in advance

How to train this network on new dataset?

Hi @anewell

Thank you so much for sharing this training code. I am wondering if there is a tutorial for training on a new dataset. For example the cub-200-2011 birds dataset. It has provided key points locations and bounding box annotations. Are these enough for making the network run? Thanks!

A question about eval

Hey,I read your paper and implemented it using tensorflow,and after train,I have a question.That is,during test,if I don't know the scale and center of an image(the person maybe in the center,maybe not),what if I just resize a input image to 256x256 directly without scale and center information,and then do prediction?Would the result be bad? Do you have had a try?And if so,is there anyway to get a good result without scale and center?

Using nn.gModule

I have another question, I'm new to machine learning, and especially to torch. Why you wrapped your model with nn.gModule, neither than nn.Module. What is the difference?

How to deal with the keypoints which out of the image size

Hi,
in MPII dataset, some keypoints are out of the image size, or the coordinates of some points are (0, 0), how to deal with these keypoints when generate heatmaps and calculate loss?

Evaluation on MPII Human Pose Dataset

Hi, @anewell, I want to use the evaluation script evaluatePCKh.m downloaded on MPII Human Pose Dataset to measure my model's performance, but I could not find the annolist_dataset_v12.mat file loaded in this line of the script.

load([p.gtDir '/annolist_dataset_v12'],'annolist');

Do you know how to get this file, thanks.

idxs, preds, hms, inp = loadPreds('mpii/test-run/preds', true, false) doesn't work.

The following error pops up:

/home/yxchng/torch/install/share/lua/5.1/hdf5/group.lua:312: HDF5Group:read() - no such child 'heatmaps' for [HDF5Group 33554432 /]
stack traceback:
[C]: in function 'error'
/home/yxchng/torch/install/share/lua/5.1/hdf5/group.lua:312: in function 'read'
/home/yxchng/pose-hg-train/src/util/eval.lua:10: in function 'loadPreds'
[string "idxs, preds, hms, inp = loadPreds('mpii/test-..."]:1: in main chunk
[C]: in function 'xpcall'
/home/yxchng/torch/install/share/lua/5.1/itorch/main.lua:210: in function </home/yxchng/torch/install/share/lua/5.1/itorch/main.lua:174>
/home/yxchng/torch/install/share/lua/5.1/lzmq/poller.lua:75: in function 'poll'
/home/yxchng/torch/install/share/lua/5.1/lzmq/impl/loop.lua:307: in function 'poll'
/home/yxchng/torch/install/share/lua/5.1/lzmq/impl/loop.lua:325: in function 'sleep_ex'
/home/yxchng/torch/install/share/lua/5.1/lzmq/impl/loop.lua:370: in function 'start'
/home/yxchng/torch/install/share/lua/5.1/itorch/main.lua:389: in main chunk
[C]: in function 'require'
(command line):1: in main chunk
[C]: at 0x00405d50

-task'pose-int' vs 'pose'

Hi anewell,

Thanks again for the update!

When set -task from 'pose-int' to 'pose', (not using intermediate supervision) the training goes wrong.
Similar situation when set -nStack to 1.

I guess somewhere when the program anticipates a sequence of length 1, it gets a tensor instead.

Is there any quick fix to this?

Thanks!

about HDF5 errors

when i run th main.lua -expID test-run,there are some errors in the following:
HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 140244906858368:
#000: H5F.c line 604 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#1: H5Fint.c line 992 in H5F_open(): unable to open file: time = Mon Apr 23 15:39:05 2018
, name = '/home/master-grade2-1/linli/pose/pose-hg-train-master/data/mpii/train.h5', tent_flags = 0
major: File accessibilty
minor: Unable to open file
#2: H5FD.c line 993 in H5FD_open(): open failed
major: Virtual File Layer
minor: Unable to initialize object
#3: H5FDsec2.c line 339 in H5FD_sec2_open(): unable to open file: name = '/home/master-grade2-1/linli/pose/pose-hg-train-master/data/mpii/train.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
major: File accessibilty
minor: Unable to open file
/home/master-grade2-1/linli/torch/torch/install/bin/luajit: ...-1/linli/torch/torch/install/share/lua/5.1/hdf5/file.lua:12: HDF5File: fileID -1 is not valid
stack traceback:
[C]: in function 'error'
...-1/linli/torch/torch/install/share/lua/5.1/hdf5/file.lua:12: in function '__init'
...1/linli/torch/torch/install/share/lua/5.1/torch/init.lua:91: in function <...1/linli/torch/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'open'
...inli/pose/pose-hg-train-master/src/util/dataset/mpii.lua:19: in function '__init'
...1/linli/torch/torch/install/share/lua/5.1/torch/init.lua:91: in function <...1/linli/torch/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'Dataset'
...ter-grade2-1/linli/pose/pose-hg-train-master/src/ref.lua:29: in main chunk
[C]: in function 'dofile'
main.lua:2: in main chunk
[C]: in function 'dofile'
...orch/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406670

how can i solve this problem?thanks:)

Memory consumption of the model

Hi, I am trying to implement your code in theano and I am bumping into memory problems. Can I know how much memory does your model takes on the GPU? I am using Titan X as well. Thanks.

usage of visualize_results

Hi,

I have a short question regarding the usage of the lua-file for visualizing results "misc/visualize_results.lua"

This file is expecting you to store your predictions in exp/mpii/best

is this "best" file and "best_preds" automatically generated when validating your net? When is the group "pred_heatmap" generated in the preds.h5? I have not been able to find this on the code

Thank you very much in advance

Missing flic.lua

Error running with "-continue"

Hi anewell, thanks for sharing this project!
I was running the command th main.lua -expID test-run successfully, and I had to stop after epoch 1. Then I wanted to resume it by the command th main.lua -expID test-run -continue but I met the error below.

Saving everything to: /home/ubuntu/pose-hg-train/exp/mpii/test-run  
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/pose-hg-train/src/opts.lua:111: attempt to perform arithmetic on field 'lastEpoch' (a nil value)
stack traceback:
    /home/ubuntu/pose-hg-train/src/opts.lua:111: in main chunk
    [C]: in function 'dofile'
    /home/ubuntu/pose-hg-train/src/ref.lua:17: in main chunk
    [C]: in function 'dofile'
    main.lua:2: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

Is that the lastEpoch is not saved correctly? Have you seen similar errors before and any idea how to solve this?
Thank you in advance.

what's the meaning of 'scale' in mpii dataset?

if I use my own dataset, every image is 640 * 480, what's the value of scale should I set it to?

The difference between your pretrained model and that generated from your script?

Hi!

Thanks for your shared code. I find that there is a little bit differences between your pretrained model and the model I generate from your code.

There is a 1GB cuda memory gap during training between them.

Could you tell me what could be the reason of that?

Thanks!

About train this model on MPII dataset

Hi,I trained the model on MPII dataset,and i got the loss and accuracy.but But why are some accuracy rates showing nan?lile this：

How can I train with my own images?

Hi, @anewell

How can I train with my own images?

I don't know how to edit the train.h5 file.

What's the meaning of h5 file's many table 7 it's column ...

Thanks in advance ~

A PyTorch version

Thanks for sharing your code!

I wrote a pytorch version of hourglass network. Hope this could be helpful for who are not familiar with Torch. Many codes for data processing are brought from your code (src/pypose). Thank the author again!

However, the code cannot reproduce the results perfectly (83.58 [email protected] score for the simplified 4-stack hourglass). Some details might be missed, especially the post processing part (e.g., coordinates mapping, etc). Anyone interested in this project is welcomed to contribute to this project!

no flic.lua whenth 'main.lua -dataset flic -expID test-run`

I execute the command 'th main.lua -dataset flic -expID test-run', but your code does not contain src/dataset/flic.lua, and after I changed mpii.lua to flic.lua, it shows cannot find data/flic/annot.h5

Unable to run final prediction

Hi,

I've tried to run the final prediction with the model file downloaded from your website, but get an error. Could you help? Thanks!

th main.lua -expID finalp -finalPredictions -nEpochs 0 -loadModel umich-stacked-hourglass.t7

Input is a tensor with dimensions: 3 x 256 x 256
Output is a table
         Entry 1 is a tensor with dimensions: 16 x 64 x 64
         Entry 2 is a tensor with dimensions: 16 x 64 x 64
         Entry 3 is a tensor with dimensions: 16 x 64 x 64
         Entry 4 is a tensor with dimensions: 16 x 64 x 64
         Entry 5 is a tensor with dimensions: 16 x 64 x 64
         Entry 6 is a tensor with dimensions: 16 x 64 x 64
         Entry 7 is a tensor with dimensions: 16 x 64 x 64
         Entry 8 is a tensor with dimensions: 16 x 64 x 64
==> Loading model from: ../../pose-hg-demo/umich-stacked-hourglass.t7
==> Converting model to CUDA
==> Generating predictions...
/home/ming/torch/install/bin/luajit: /home/ming/torch/install/share/lua/5.1/nn/MSECriterion.lua:13: attempt to index local 'input' (a nil value)
stack traceback:
        /home/ming/torch/install/share/lua/5.1/nn/MSECriterion.lua:13: in function 'updateOutput'
        ...ing/torch/install/share/lua/5.1/nn/ParallelCriterion.lua:23: in function 'forward'
        /home/ming/pose-hg-train/src/train.lua:47: in function 'step'
        /home/ming/pose-hg-train/src/train.lua:106: in function 'predict'
        main.lua:38: in main chunk
        [C]: in function 'dofile'
        ...ming/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d50

Can you upload the code for converting MPII annotations to h5 file?

inaccuracy caused by transform

Hi newell:

I notice that transform function causes the inaccuracy

function transform(pt, center, scale, rot, res, invert)
    local pt_ = torch.ones(3)
    pt_[1],pt_[2] = pt[1]-1,pt[2]-1

    local t = getTransform(center, scale, rot, res)
    if invert then
        t = torch.inverse(t)
    end
    local new_point = (t*pt_):sub(1,2)

    return new_point:int():add(1)
end

the type of return value is int
this will introduce the inaccuracy.

You can find that the variable diff's norm is very big in the following testing code.
That's a big problem.

t_pt = transform(pt, center, scale, rot, res, False)
recovered_pt = transform(t_pt, center, scale, rot, res, True)
diff = pt-recovered_pt

In the train.lua, you use the following code to overcome this problem when it's validating or testing
Am I right?

 -- Validation: Get flipped output
output = applyFn(function (x) return x:clone() end, output)
local flippedOut = model:forward(flip(input))
flippedOut = applyFn(function (x) return flip(shuffleLR(x)) end, flippedOut)
output = applyFn(function (x,y) return x:add(y):div(2) end, output, flippedOut)

Multi GPU training

Hi anewell,

I'm new to Torch. Is there a way to train the model using multiple GPUs like Caffe does? Thank you!

Training on MSCOCO keypoint dataset

Hi @anewell ,

First of all thank you for sharing your code.

I'm preparing COCO keypoint annotations and dataset specific interface file to train on COCO. I've done mostly except one issue. MPII dataset provides head bbox for each person which COCO doesn't have. To overcome this I found a work around:

If shoulders are visible use distance between shoulders to define head bbox, else use person bbox and ideal human body ratio to define head bbox.

But this approach is erroneous when the image doesn't contain whole body.

Is there any advice to properly define the head size?

Below code snippet belongs to src/misc/convert_annot.py file:

# Find shoulder coordinates
left_shoulder = (ann['keypoints'][0::3][6], ann['keypoints'][1::3][6])
right_shoulder = (ann['keypoints'][0::3][5], ann['keypoints'][1::3][5])
            
# If shoulders not visible then approximate head bbox with person bbox values
if left_shoulder == (0,0) or right_shoulder == (0,0):
    diff = np.array([ann['bbox'][3]/7.5, ann['bbox'][2]/7.5], np.float)
    normalization = np.linalg.norm(diff) * .6

# If shoulders are visible define head bbox according to dist between shoulders
else:
    dist = math.sqrt((right_shoulder[0] - left_shoulder[0])**2 + (right_shoulder[1] - left_shoulder[1])**2)
    diff = np.array([dist/2, dist/1.5], np.float)
    normalization = np.linalg.norm(diff) * .6

annot['normalize'] += [normalization]

Missing labels of data/mpii/annot.h5'

There are lots of missing entries in annot.h5 files.
One direct result is that the main.lua -eval returns nan for accuracy.
I also print out several parts labels directly from the dataset.

After creating the dataset obj, run following scripts

for i = 1, 100 do   -- print the  randomly selected  pts
    local pts = dataset:getPartInfo(i)
    print('the joint centers of test ',i,'\n coordinates are',pts )  -- totally empty , the
end

part of the results

the joint centers of test 62
coordinates are 1033 538
1044 463
1051 344
1099 352
1116 458
1117 543
1075 348
1077 255
1078 252
1116 195
1125 253
1105 293
1048 259
1105 250
1108 307
1110 357
[torch.DoubleTensor of size 16x2]

the joint centers of test 63
coordinates are 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
[torch.DoubleTensor of size 16x2]

the joint centers of test 64
coordinates are 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
[torch.DoubleTensor of size 16x2]

the joint centers of test 65
coordinates are 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
[torch.DoubleTensor of size 16x2]

the joint centers of test 66
coordinates are 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
[torch.DoubleTensor of size 16x2]

Actually, from 63 to 97, the labels are totally empty.

Could you provide a correct annotation files ?
Thanks.

Validation loss/accuracy inconsistency problem

Hi @anewell ,

I found that during training, I can get the following log

==> Starting epoch: 40/100
 [======================================== 8000/8000 ==================================>]  Tot: 1h10m | Step: 526ms     
      train : Loss: 0.0056964 Acc: 0.8067
 [======================================== 1000/1000 ==================================>]  Tot: 1m54s | Step: 116ms     
      valid : Loss: 0.0060296 Acc: 0.8033

Then I use the following cmd to do testing on validation set

th main.lua -dataDir ../data/ -expDir ../exp/ -trainIters 0 -loadModel ../exp/mpii/default/model_40.t7

and get the following result! The validation accuracy and loss are different with those during training.

Saving everything to: /home/wyang/code/pose/iccv17/pose-hg-train/exp/mpii/default	
Input is a tensor with dimensions: 3 x 256 x 256	
Output is a table	
	 Entry 1 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 2 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 3 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 4 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 5 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 6 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 7 is a tensor with dimensions: 16 x 64 x 64	
	 Entry 8 is a tensor with dimensions: 16 x 64 x 64	
==> Loading model from: ../exp/mpii/default/model_40.t7	
==> Converting model to CUDA	
==> Starting epoch: 1/100	
 [======================================== 1000/1000 ==================================>]  Tot: 2m42s | Step: 115ms     
      valid : Loss: 0.0050312 Acc: 0.8510

I check your code many times but still have no idea. Can you repeat my problem? How to solve this? Thank you.

princeton-vl / pose-hg-train Goto Github PK

pose-hg-train's Introduction

Stacked Hourglass Networks for Human Pose Estimation (Training Code)

Getting Started

Running experiments

Getting final predictions

Training accuracy metric

Final notes

pypose/

Questions?

Acknowledgements

pose-hg-train's People

Contributors

Stargazers

Watchers

Forkers

pose-hg-train's Issues

Recommend Projects

Recommend Topics

Recommend Org