vitoralbiero / img2pose Goto Github PK

The official PyTorch implementation of img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation - CVPR 2021

License: Other

Python 82.11% Shell 0.02% CMake 0.20% C++ 14.34% Cython 3.34%

face-pose-estimators face-detection face-alignment bounding-box-labels 3d-face

img2pose's People

Contributors

Stargazers

Watchers

Forkers

chaoso deepandy liuguoyou aliabd dev233 trendingtechnology avatarworld j-xy swayhrl pandinosaurus tubbz-alt templeblock peterzs hajungong007 xiyinmsu bruinxiong klankaos hhy5277 abxense xrosliang ruofei7 cooparation meijiangyuan viliusmat zhangjf2018 vandmoon sinianyutian quire7 mornydew wolfworld6 thailand88 peterzhousz cunjian jshang simonzeus flyinggh 23119841 td-wzw xiaozhao00 dukexy serviolimareina huguensjean yshen47 jcsma-face mavaa chwbo aliushn guomingjin r-snijders martinhoang11 zineos sstzal yaodidida wangdeyu lbtanh zyg11 wangtao2668129173 tford9 repo-collection gridl reloadbrain liviust prathamesh43 bill007bill prateeksarangi ms003 chenying99 berrlinn musinghead informaticacba shidqiet kangsyahrul jonathanha52 sailfish009 charlie6004 fengchengai sts0mrg0 killsking shiyuan0806 larryteal lizhenqi111 thohemp feixuedudiao shangdibufashi jiaxiangshang janfschr zyhhe lucaspar vsycx zivzone milanpatel25 janglinko-dac xiaotie1005 zilipeng placeboooo uni20181283 el-beth gg-big-org brahimmade yiminglin-ai

img2pose's Issues

test_own_images

when i run the notebook test_own_images, in the last cell . i have a problem.

how can i do,please help me , thank you!

What is the difference between the constrained and unconstrained head pose model?

Hello. Help me figure out in what units are the translation vector returned? I am trying to determine how far a person is from the camera using img2pose model. It seems that X and Y are not measured in the same units as Z. And I can't figure out what these units are. Help me please.

Questions related to the prediction values and rendered results

Dear Vítor Albiero,

Thanks for your helpful comments in the previous git issues.
It was great help in understanding the paper, img2pose.

I have additional questions.

What is the definition of the proposed idea's result (i.e. img2pose prediction value) ?
According to the equation (2) in your paper, 6D vector h_i consists of Euler angles and 3D face translation vectors.
Also, you let me know the pose_pred includes rotation vectors, not the Euler angles. (Both information can be easily converted.)
I clearly understand your comment and also cannot find the code where the rotation can convert to the Euler angles [2].
Thus, I hope to fully understand your paper, so I politely ask you again. Is it correct that the whole proposed method (i.e. network and post processing) makes the global pose, h_{i}^{img} that consists of a 3D face translation vector and 3D rotation vector, not Euler information in the given entire image, not an image corp B that is defined in the Appendix A..
Questions related to the rendered results.
Please refer to the below images in the [3]. Please note that the name of the image is the 27_Spa_Spa_27_32.jpg in the WIDERFACE training dataset. In other words, I think the model may already used the image in the training phase.
I get the values for the boxes, labels and dofs from the lmdb dataset that is obtained by your guide. I think the values might be obtained correctly because the both functions, random_crop and random_clip are turned off.
However, the result in [3]-b is a little odd. It makes me confused.
If I get the GT values correctly, it is impressive that the generalization of the model is well trained by numerous other GTs and the results are nice although GT used for training may be inaccurate as far as you can see.
For reference, please note that both the rendering results using prediction and GT values corresponding to the other images obtained in the same way as above were nice unlike [3], although I did not attach a picture.
- Is it correct that the GT - dof values in the lmdb dataset corresponds to h_{i}^{img*} for the given entire image (i.e. whole image) in the Fig. 4?
- What do you think that which factor determines the size of face? It's difficult to understand why the rotation vector determines the size of the face. Could you tell me your explanation for this?

[1] #27 (comment)
[2] In inference mode (i.e. test_own_images.ipynb), the proposed network performs transform module, transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std), at the end. However the transform does not include any rotation - Euler conversion method.
[3] Rendered result for the image, 27_Spa_Spa_27_32.jpg.

a) The rendered result of the img2pose.

b) The rendered result using the GT values that are obtained in the train.py using below additional codes:

dict = targets[0]
boxes = data_dict['boxes'].numpy().tolist()
labels = data_dict['labels'].numpy().tolist()
dofs = data_dict['dofs'].numpy().tolist()

Question about BIWI evaluation

Hello,

In evaluation/jupyter_notebooks/biwi_evaluation.ipynb, the annotations file is opened with dataset_path = "./BIWI_annotations.txt", however, I didn't find that file in the BIWI dataset. I download the dataset from the link you provided at readme.

Would you please tell me where can I find this file? I would be highly appreciated it.

Cheers,
Harry.

What is the result of the training script in README? Why is the batch_size = 2 in your training scripts?

@vitoralbiero
hello，nice to try your great job,
however, I have 2 questions:
What is the result of the training script in README?
Why is the batch_size = 2 in your training scripts?

why not 4, 16, or 64?
Doethe result differ much for different batch_size?

CUDA_VISIBLE_DEVICES=0 python3 train.py
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy
--workspace ./workspace/
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb
--prefix trial_1
--batch_size 2
--lr_plateau
--early_stop
--random_flip
--random_crop
--max_size 1400

Thanks for your kind-hearted reply!

Not able to evaluate images using the file test_own_images.ipynb

First of all, thank you for your work. I am trying to run "test_own_images.ipynb" evaluation but I am getting the error "the size of tensor a must match the size of tensor b at non-singleton dimension 0" at line "res = img2pose_model.predict([transform(img)])[0]". I am trying to use images from CASIA webface and MultiPIE dataset. Please let me know I should I solve this.

Testing with WIDER FACE dataset

Thank you for the open-source code. In section WIDER FACE dataset evaluation in readme, you use validation data instead of test data. Could you also explain this? Am I missing something?
Could you give some more details on how to evaluate your network on pretrained models to get the results mentioned in the paper for WIDER FACE dataset in Testing section of readme?

Thanks a lot for your help :)

Some questions about 68 facial landmarks

hello. I have some questions:

How do I get the (tx, ty, tz) coordinates of 68 facial landmarks?
What is the physical meaning of the tz coordinate? Is it the depth of the feature point relative to the camera?
How do I run the code through the camera to get the pose prediction [rx, ry, rz, tx, ty, tz]

Looking for your reply! Thank you

Regarding yaw, pitch and roll in test_own_images.ipynb

Hi, can you please help me how to get yaw, pitch and roll angles in test_own_images.ipynb.
Thank you so much

mAP and plot curves

Hey,
So I downloaded the eval_tools.zip file. However, I don't really understand what to do with it to get the mAP and plot curves. Like I unzipped it and it has a lot of matlab type of files in it. Could you be kindly a bit more clear as to which files to run?
Regards,
Zaigham

how to get the Yaw Pitch Roll X Y Z when testing?

Can this model be applied to face recognition

Hello, I want to ask a question, can this model be applied to face recognition? Can you give me some ideas

Whether the prediction of 6DOF can be completed through one-stage detector?

Direct face detection through 6DOF is a great activity. Can this task be completed directly through the one-stage method?
Thanks~

How do I extract the head position feature

the last question has been solved. Thank you very much for your answer. If I want to extract the head posture features for downstream tasks, which layer of output in the network should I extract. The output doesn't take into account the position of the face but only the direction of the head. module.roi_heads.class_head.fc7？

focal length and image resolution

Can you explain why you set the focal length equals w+h (for the global camera)? I checked some articles, and their conclusion is also that higher resolution leads to higher focal length, I know it is a way to approximate, but I still cannot fully understand the reason behind this. Thank you very much~

how Sim3DR render color?

I want to use Sim3DR to render a face with color but I get wrong img
I use this fuction
Sim3DR_Cython.rasterize(bg, vertices, triangles, colors, buffer, triangles.shape[0], height, width, channel,
reverse=reverse)

and input colors (N ,3)

Evaluation on AFLW2000 seems flawed (not comparable to others)

Hi,
first of all congrats for achieving great results on head pose estimation. I find it quite impressive what's possible with such an implicit approach (going from automatized WIDER pose annotations, where surely there is some error on the ground-truth data itself). On BIWI your results seem to set new SOTA accuracy.

Yet, for AFLW2000 it seems you are comparing your predicted pose rotation with the ground-truth (GT) that you generated yourself via fitting your own n-pt model to the landmarks via:
_, pose_target = get_pose(threed_points, target_points, image_intrinsics),
which basically uses solvepnp:
retval, rvecs, tvecs = cv2.solvePnP( threed_landmarks, twod_landmarks, camera_intrinsics, None, flags=cv2.SOLVEPNP_EPNP)

Now the problem is that AFLW2000 provides ground-truth headpose information in euler angles that all the other authors have used, to which you are now comparing your results based on your own pose GT in your paper (Table1):

Please correct me if I am wrong with those assumptions. Assuming I understood your evaluation correct, that means everyone compares to the data is provided with AFLW2000, while you are comparing and basing your result metrics on your data. Hence, it is not comparable unless 1) you also use the provided ground-truth or 2) rerun the experiments of the other authors while providing your data, which at the minimum would raise the question as to why your ground-truth is more correct?

PS: I reran your model (single scale) using approach 1) but only achieved an MAE accuracy above 7 degrees...

Best regards,
Felix

skipping 'lib/rasterize.cpp'

Hey , whenever I run this command :

I am getting this :

What does this mean?
Please help me with this.

How to interpret the translation vector t?

I am wondering how I can interpret the t vector of the 6DoF pose.

In which coordinate frame/relative to what is the translation specified? Is the reference frame consistent across multiple images?

Do inferential images need to run test_own_images?

How to train on my own dataset?

Hellow!
Thanks for your perfect project!
I wonder to know how to train on my own dataset!
I notice that you use the five landmarks to generate 6DoF pose labels by using standard means. Can you share the code about these methods?
Thanks for your beautiful work!
I am waiting for your reply！

Is Prediction Rotation Vector or Euler Angles (yaw, pitch roll)?

Hi,

Thanks for your code. I was checking your notebook biwi_evaluation.ipynb and found that the ground truth and predictions are both in rotation vector representation (3d vector representing rotation axis and norm representing rotation angle). However, when you print out the errors, you call it error on Yaw, Pitch and Roll. But it should be errors on rotation vector?

Evidence of ground truth being rotation vector:

    img_path, pitch, yaw, roll = sample
    pitch = float(pitch)
    yaw = float(yaw)
    roll = float(roll)
        
    annotations = open(img_path.replace("_rgb.png", "_pose.txt"))
    lines = annotations.readlines()
    
    pose_target = []
    for i in range(3):
        lines[i] = str(lines[i].rstrip("\n")) 
        pose_target.append(lines[i].split(" ")[:3])
    
    pose_target = np.asarray(pose_target)       
    pose_target = Rotation.from_matrix(pose_target).as_rotvec()        
    pose_targets.append(pose_target)

Here it's clear that the pose_target is first a rotation matrix and then converted to rotation vector by as_rotvec().
To convert to Pitch, Yaw, Roll it should be

    pose_target = Rotation.from_matrix(pose_target).as_euler('xyz', degrees=False) # in order pitch, yaw, roll

Thanks!

Drawing axis based on yaw, pitch, roll

Hi,

I am trying to render (x,y,z) axis based on network output instead of using provided renderer.

Currently using this code:

def draw_axis(img, euler_angle, center, size=80, thickness=3,
              angle_const=np.pi / 180, copy=False):
    if copy:
        img = img.copy()

    euler_angle *= angle_const
    sin_pitch, sin_yaw, sin_roll = np.sin(euler_angle)
    cos_pitch, cos_yaw, cos_roll = np.cos(euler_angle)

    axis = np.array([
        [cos_yaw * cos_roll,
         cos_pitch * sin_roll + cos_roll * sin_pitch * sin_yaw],
        [-cos_yaw * sin_roll,
         cos_pitch * cos_roll - sin_pitch * sin_yaw * sin_roll],
        [sin_yaw,
         -cos_yaw * sin_pitch]
    ])
    axis *= size
    axis += center

    axis = axis.astype(np.int)
    print('axis', axis)

    tp_center = tuple(center.astype(np.int))

    cv2.line(img, tp_center, tuple(axis[0]), (0, 0, 255), thickness)
    cv2.line(img, tp_center, tuple(axis[1]), (0, 255, 0), thickness)
    cv2.line(img, tp_center, tuple(axis[2]), (255, 0, 0), thickness)

    return img

According to readme poses variable contains 6 values per detected face

      [
        0.013399183198461687,
        0.0015700862562677677,
        -0.0008193041494016704,
        0.1667461395263672,
        -7.139801979064941,
        53.44799041748047
      ]

Is it true that first one is pitch then yaw then roll other 3 are horizontal, vertical translation and scale?

Based on provided rendering code, i get pretty static axis always pointing in same direction, but the face mask shows orientation

How to get mAP and plot curves in python scripts, instead of `eval tools`?

Hello, I m not quite familiar with eval tools.
How to get mAP and plot curves in python scripts, instead of eval tools, upon getting results/WIDER_FACE/Val.

Why does mode size differ between trained model and your released model?

when running the training scripts as the readme, the generated model size is 324M,
while your released model in model zoo only ~150M

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_2 \
--batch_size 2 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400

Spikes in training curve

Hi Vitor,

I'm using the default code and the default WiderFace data. However training curve is having spikes as in the following image.

I'm wondering if this is abnormal or not?

smooth_l1_loss not part of models.detection

Getting error:
File "C:\Code\Python\img2pose-main\rpn.py", line 467, in compute_loss
det_utils.smooth_l1_loss(
AttributeError: module 'torchvision.models.detection._utils' has no attribute 'smooth_l1_loss'

Questions related to image and target transformation

Dear Vítor Albiero,
I have one questions about the image and target transformation in the train phasse:

The model has a transformation GeneralizedRCNNTransform before the image and target passing to the network. This transformation will resize the image , bbox and keypoints, but no pose, mentioned in #28 [2]. I wonder if should recalculate the global pose accordding the resized image, bbox, keypoints.

best regards!

Problems with resuming

I resumed training from a checkpoint, however the validation loss jump back very high. I'm wondering if the following usage is proper?

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_6_fast_train_resume \
--batch_size 16 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400 \
--distributed \
--resume_path workspace/trial_6_fast_train/models/model_val_loss_6.1507_step_2814.pth

Training log: validation loss jump from 5.9978 back to 15.6339

| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
Namespace(batch_size=2, contrast_augmentation=False, depth=18, dist_backend='nccl', dist_url='env://', distributed=True, early_stop=True, epochs=100, gpu=0, lr=0.001, lr_plateau=True, max_size=1400, min_size=[640, 672, 704, 736, 768, 800], noise_augmentation=False, optimizer='SGD', pose_mean='./datasets/lmdb/WIDER_train_annotations_pose_mean.npy', pose_stddev='./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy', prefix='trial_1_resume', pretrained_path=None, random_crop=True, random_flip=True, rank=0, resume_path='workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth', threed_5_points='./pose_references/reference_3d_5_points_trans.npy', threed_68_points='./pose_references/reference_3d_68_points_trans.npy', train_source='./datasets/lmdb/WIDER_train_annotations.lmdb', val_source='./datasets/lmdb/WIDER_val_annotations.lmdb', workers=4, workspace='./workspace/', world_size=4)
[Errno 17] File exists: './workspace/trial_1_resume/models'
Training with 12874 images.
Model will use distributed mode!
Resuming training from workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth
{'prefix': 'trial_1_resume', 'work_path': './workspace/trial_1_resume', 'model_path': './workspace/trial_1_resume/models', 'log_path': './workspace/trial_1_resume/log', 'frequency_log': 20, 'train_source': './datasets/lmdb/WIDER_train_annotations.lmdb', 'val_source': './datasets/lmdb/WIDER_val_annotations.lmdb', 'pose_loss': MSELoss(), 'pose_mean': array([-0.02376969,  0.02746343, -0.01443735,  0.06643507,  0.237996  ,
        3.48129357]), 'pose_stddev': array([0.23534443, 0.5395153 , 0.17667074, 0.13199842, 0.13579309,
       0.36629391]), 'depth': 18, 'lr': 0.001, 'lr_plateau': True, 'early_stop': True, 'batch_size': 2, 'workers': 4, 'epochs': 100, 'min_size': [640, 672, 704, 736, 768, 800], 'max_size': 1400, 'device': device(type='cuda'), 'weight_decay': 0.0005, 'momentum': 0.9, 'pin_memory': True, 'pretrained_path': None, 'resume_path': 'workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth', 'noise_augmentation': False, 'contrast_augmentation': False, 'random_flip': True, 'random_crop': True, 'threed_5_points': './pose_references/reference_3d_5_points_trans.npy', 'threed_68_points': './pose_references/reference_3d_68_points_trans.npy', 'distributed': True, 'gpu': 0, 'num_gpus': 4}
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.001
    momentum: 0.9
    nesterov: False
    weight_decay: 0.0005
)
Epoch: [1-100] Batch: [80-6436] Speed: 36.69 samples/sec Loss: 23.00932
Epoch: [1-100] Batch: [160-6436] Speed: 32.39 samples/sec Loss: 23.09449
Epoch: [1-100] Batch: [240-6436] Speed: 32.32 samples/sec Loss: 17.81749
Epoch: [1-100] Batch: [320-6436] Speed: 30.10 samples/sec Loss: 16.34565
Epoch: [1-100] Batch: [400-6436] Speed: 27.72 samples/sec Loss: 22.51138
Epoch: [1-100] Batch: [480-6436] Speed: 30.71 samples/sec Loss: 25.69460
...
Epoch: [1-100] Batch: [6240-6436] Speed: 31.18 samples/sec Loss: 24.62859
Epoch: [1-100] Batch: [6320-6436] Speed: 23.26 samples/sec Loss: 25.31805
Epoch: [1-100] Batch: [6400-6436] Speed: 16.66 samples/sec Loss: 14.03424
Evaluating model...
Current validation loss: 15.633916 at step 1609 - Best validation loss: 15.633916 at step 1609

Are the angle predictions in Radians or degrees?

Is there a preferred size for the images?

Good morning,
Does img2pose automatically resize the facial images to a preferred size? Sorry, I am just being a bit lazy with reading the code.
Regards,
Zaigham

Where is the 'Sim3DR_Cython'

I want to test my own images with a pre-trained model, and No module named 'Sim3DR_Cython' in Sim3DR，how to get it ?

Alignment issue: Assertion error

First of all, excellent work. I love the work that you have done.

Now coming to the matter at hand: I am getting the assertion error below when I try to run the face alignment thing in aflw_2000_3d_evaluation.ipynb.

The total_imgs variable is reduced to 1.
Tried with the visualise variable both on and off.
Called the variable both within and outside the loop.

Outside:

Within:

And my lmk.shape is 68,2

Sim3DR misses the lib dir

It seems that the codebase of Sim3DR lacks the lib dir.

Questions related to the bbox and landmark values

Dear Vítor Albiero,

After the publication of the RetinaFace paper about three years ago, I think this paper is very interesting and meaningful for detecting face and estimating head pose.
In particular, I believe that your work collecting and refining data for the start of new research (e.g 6DoF) will have a good impact on the academic and industrial world.

I have three questions while analyzing your code and reading the paper.

Why does you use the RetinaFace's predicted values for face bbox and landmarks, not the RetinaFace's annoated values [1] ? Do you have any special reasons? For reference, RetinaFace authors provide additional annotated landmark coordinate values. In other words, the bbox values are not changed compared to the original WIDERFACE bbox values for the GT.
When looking at the json files provided as annotations [2], there are 5 or 68 landmarks depending on the type of data (i.e. face in the WIDERFACE dataset). According to your paper, img2pose, the 5-point-landmarks of the RetinaFace was utilized to estimate 6DoF values in the sub-section 4.1. This confuses me. How do you get 68 landmark values? and why did the number of landmark values differ (e.g. 5 or 68)?
This question 3 is related to the above question 1. Since I analyzed the code, the bbox values used to get the 6DoF is not the original WIDERFACE bbox value as far as I mentioned above. Although img2pose didn't use the bbox coordinates for training (i.e. it is not used for loss.), I think the use of non-GT values for the WIDERFACE dataset affect the performance for predicting the face bbox on the WIDERFACE. What is your opinion for this? A example between the GT bbox values for original WIDERFACE and img2pose are below. The order of coordinates can be different. However, even when converting coordinate values, their values are clearly different.
- image: 0--Parade/0_Parade_marchingband_1_849.jpg
- original GT bbox form WIDERFACE: 449 330 122 149
- img2pose GT bbox from code [3]: 442 310 578 473

Best regards,
vujadeyoon

[1] https://github.com/deepinsight/insightface/tree/master/detection/RetinaFace
[2] https://github.com/vitoralbiero/img2pose/wiki/Annotations
[3] imgs, targets = data in the train.py; data_dict = targets[0]; boxes = data_dict['boxes'].numpy().tolist()

Converting scale to translation z

Hello! In the article you use t_z and in code you use scale. What is the difference between these values and how to convert one to another? Thank you.

How to extract camera extrinsics?

I want to use img2pose to extract the camera pose (for the purpose of using it as the input for NeRF). On page 3 of the paper it is stated that this can be obtained from the 6DoF pose h by "standard means," but I'm struggling to figure out how this is done. I'm especially struggling with how to determine the t of the [R¦t] matrix; I did manage to extract the R.

In short: How can one obtain a camera pose from the output 6DoF pose of this model?

Edit: My specific use case is an input video of a single talking head; I would like to get a camera pose determined by the head pose for each frame; i.e. interpret head movement as camera movement instead.

How to get the result of yaw, pitch, roll in paper?

Thanks for your great job!
Could you share the model and scripts to get the result of yaw, pitch, roll in paper?

Some questions about the code

Hello. I have some questions:

Tz is the front/back distance between the face and the camera, I'd like to know how accurate it is in mm?
Can I get the PLY format for the face reconstruction model?

Looking for your reply! Thank you!

Redundant output of the `box_predictor` and `class_predictor` of the second phase network

Thanks for your contribution at first, I learned a lot from your work!

However, I found that you continue to config the predictor by setting num_classes=2 as this line shows.

As I think, since the whole network is used for human face only, there is no need to set in this way, which will lead to redundant parameters usage.

Is there any reason to do so? Or this could be refined for further parameters saving?

Thanks for your attention.

How do I enable the multiple scale option?

Hey,
I am trying to reproduce some results that you showed on the BIWI and ALFW-2000-3D datasets. The code that you have given in the evaluation notebooks is carrying out the non-multi scale option, How do I enable the multiscale option?

6 Dofs

Could you please specify where the origin of the 6 Dofs is?

About Assumption of Focal length

As the paper said in appendix A, "Here, we assume f equals the image crop height, h_bb,
plus width, w_bb".

Namely, the focal length f for image crop perspective is assumed as w_bb + h_bb.

Here, could we assume focal lenth f to other value ,such as w_bb + h_bb + 1, or something else?

lack of three pretrainedmodels

FileNotFoundError: [Errno 2] No such file or directory: '../../models/WIDER_train_pose_mean_v1.npy'
FileNotFoundError: [Errno 2] No such file or directory: '../../models/WIDER_train_pose_stddev_v1.npy'
FileNotFoundError: [Errno 2] No such file or directory: '../../models/img2pose_v1.pth'

How to determine the final pose for my own image?

Hi, I have a question regarding the final pose I should use for my own image. Since there are multiple pose_pred that have high scores while they have similar values. In AFLW evaluation, you determine the best index using max_iou. In test_own_images, it seems like you use all high score predicted poses to render (That is what I understand the overlap variable in renderer.render). In my case, I just want to obtain the pose only. Would it be the average of all high score predictions or simply using the highest score result? Or any other solution? Thank you.

how to optimize and speed up the alignment code?

Hello, thank your for your work.
I read the alignment code and found there is a pytorch version alignment function. Can it be used for inference? Do you think it is possible to align faces by batch or any other ways to speed up the alignment process? like run in gpu?
Thank you!

focal length setting

Great work! I have a question on focal length.

In your paper, Pose conversion methods were explained in the end.

But I have no idea on this statement

because the larger focal length means a smaller field of view,

I just wonder why this is a "zoom-out" operation instead of "zoom-in" operation

torch value Error in training

I am runing the command:

python3 -m torch.distributed.launch --nproc_per_node=4 --use_env train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_1 \
--batch_size 2 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400 \
--distributed

Having a issue:
Traceback (most recent call last):
File "train.py", line 403, in
train.run()
File "train.py", line 151, in run
losses = self.img2pose_model.forward(imgs, targets)
File "/data/x/img2pose/img2pose.py", line 127, in forward
losses = self.run_model(imgs, targets)
File "/data/x/img2pose/img2pose.py", line 122, in run_model
outputs = self.fpn_model(imgs, targets)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/x/img2pose/generalized_rcnn.py", line 79, in forward
images, targets = self.transform(images, targets)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 105, in forward
image, target_index = self.resize(image, target_index)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 140, in resize
size = float(self.torch_choice(self.min_size))
ValueError: could not convert string to float: '640, 672, 704, 736, 768, 800'

Question about the `pool` features

Thanks again for your work.

As I see from the code, the backbone network in your original implementation outputs feature maps in 5 scales, and they are stored by an OrderedDict with the keys = '0', '1', '2', '3', 'pool'.
However, in the second phrase, the MultiScaleRoIAlign only uses featmap_names=["0", "1", "2", "3"].

So, why drop the pool features, since they are also supervised by the loss function?