vitoralbiero / img2pose Goto Github PK
View Code? Open in Web Editor NEWThe official PyTorch implementation of img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation - CVPR 2021
License: Other
The official PyTorch implementation of img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation - CVPR 2021
License: Other
Hello. Help me figure out in what units are the translation vector returned? I am trying to determine how far a person is from the camera using img2pose model. It seems that X and Y are not measured in the same units as Z. And I can't figure out what these units are. Help me please.
Dear Vítor Albiero,
Thanks for your helpful comments in the previous git issues.
It was great help in understanding the paper, img2pose.
I have additional questions.
What is the definition of the proposed idea's result (i.e. img2pose prediction value) ?
According to the equation (2) in your paper, 6D vector h_i consists of Euler angles and 3D face translation vectors.
Also, you let me know the pose_pred includes rotation vectors, not the Euler angles. (Both information can be easily converted.)
I clearly understand your comment and also cannot find the code where the rotation can convert to the Euler angles [2].
Thus, I hope to fully understand your paper, so I politely ask you again. Is it correct that the whole proposed method (i.e. network and post processing) makes the global pose, h_{i}^{img} that consists of a 3D face translation vector and 3D rotation vector, not Euler information in the given entire image, not an image corp B that is defined in the Appendix A..
Questions related to the rendered results.
Please refer to the below images in the [3]. Please note that the name of the image is the 27_Spa_Spa_27_32.jpg in the WIDERFACE training dataset. In other words, I think the model may already used the image in the training phase.
I get the values for the boxes, labels and dofs from the lmdb dataset that is obtained by your guide. I think the values might be obtained correctly because the both functions, random_crop and random_clip are turned off.
However, the result in [3]-b is a little odd. It makes me confused.
If I get the GT values correctly, it is impressive that the generalization of the model is well trained by numerous other GTs and the results are nice although GT used for training may be inaccurate as far as you can see.
For reference, please note that both the rendering results using prediction and GT values corresponding to the other images obtained in the same way as above were nice unlike [3], although I did not attach a picture.
[1] #27 (comment)
[2] In inference mode (i.e. test_own_images.ipynb), the proposed network performs transform module, transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std), at the end. However the transform does not include any rotation - Euler conversion method.
[3] Rendered result for the image, 27_Spa_Spa_27_32.jpg.
a) The rendered result of the img2pose.
b) The rendered result using the GT values that are obtained in the train.py using below additional codes:
Hello,
In evaluation/jupyter_notebooks/biwi_evaluation.ipynb
, the annotations file is opened with dataset_path = "./BIWI_annotations.txt"
, however, I didn't find that file in the BIWI dataset. I download the dataset from the link you provided at readme
.
Would you please tell me where can I find this file? I would be highly appreciated it.
Cheers,
Harry.
@vitoralbiero
hello,nice to try your great job,
however, I have 2 questions:
What is the result of the training script in README?
Why is the batch_size = 2 in your training scripts?
CUDA_VISIBLE_DEVICES=0 python3 train.py
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy
--workspace ./workspace/
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb
--prefix trial_1
--batch_size 2
--lr_plateau
--early_stop
--random_flip
--random_crop
--max_size 1400
Thanks for your kind-hearted reply!
First of all, thank you for your work. I am trying to run "test_own_images.ipynb" evaluation but I am getting the error "the size of tensor a must match the size of tensor b at non-singleton dimension 0" at line "res = img2pose_model.predict([transform(img)])[0]". I am trying to use images from CASIA webface and MultiPIE dataset. Please let me know I should I solve this.
Thank you for the open-source code. In section WIDER FACE dataset evaluation in readme, you use validation data instead of test data. Could you also explain this? Am I missing something?
Could you give some more details on how to evaluate your network on pretrained models to get the results mentioned in the paper for WIDER FACE dataset in Testing section of readme?
Thanks a lot for your help :)
hello. I have some questions:
Looking for your reply! Thank you
Hi, can you please help me how to get yaw, pitch and roll angles in test_own_images.ipynb.
Thank you so much
Hey,
So I downloaded the eval_tools.zip file. However, I don't really understand what to do with it to get the mAP and plot curves. Like I unzipped it and it has a lot of matlab type of files in it. Could you be kindly a bit more clear as to which files to run?
Regards,
Zaigham
Hello, I want to ask a question, can this model be applied to face recognition? Can you give me some ideas
Direct face detection through 6DOF is a great activity. Can this task be completed directly through the one-stage method?
Thanks~
the last question has been solved. Thank you very much for your answer. If I want to extract the head posture features for downstream tasks, which layer of output in the network should I extract. The output doesn't take into account the position of the face but only the direction of the head. module.roi_heads.class_head.fc7?
Can you explain why you set the focal length equals w+h (for the global camera)? I checked some articles, and their conclusion is also that higher resolution leads to higher focal length, I know it is a way to approximate, but I still cannot fully understand the reason behind this. Thank you very much~
Hi,
first of all congrats for achieving great results on head pose estimation. I find it quite impressive what's possible with such an implicit approach (going from automatized WIDER pose annotations, where surely there is some error on the ground-truth data itself). On BIWI your results seem to set new SOTA accuracy.
Yet, for AFLW2000 it seems you are comparing your predicted pose rotation with the ground-truth (GT) that you generated yourself via fitting your own n-pt model to the landmarks via:
_, pose_target = get_pose(threed_points, target_points, image_intrinsics)
,
which basically uses solvepnp:
retval, rvecs, tvecs = cv2.solvePnP( threed_landmarks, twod_landmarks, camera_intrinsics, None, flags=cv2.SOLVEPNP_EPNP)
Now the problem is that AFLW2000 provides ground-truth headpose information in euler angles that all the other authors have used, to which you are now comparing your results based on your own pose GT in your paper (Table1):
Please correct me if I am wrong with those assumptions. Assuming I understood your evaluation correct, that means everyone compares to the data is provided with AFLW2000, while you are comparing and basing your result metrics on your data. Hence, it is not comparable unless 1) you also use the provided ground-truth or 2) rerun the experiments of the other authors while providing your data, which at the minimum would raise the question as to why your ground-truth is more correct?
PS: I reran your model (single scale) using approach 1) but only achieved an MAE accuracy above 7 degrees...
Best regards,
Felix
I am wondering how I can interpret the t vector of the 6DoF pose.
In which coordinate frame/relative to what is the translation specified? Is the reference frame consistent across multiple images?
Hellow!
Thanks for your perfect project!
I wonder to know how to train on my own dataset!
I notice that you use the five landmarks to generate 6DoF pose labels by using standard means. Can you share the code about these methods?
Thanks for your beautiful work!
I am waiting for your reply!
Hi,
Thanks for your code. I was checking your notebook biwi_evaluation.ipynb and found that the ground truth and predictions are both in rotation vector representation (3d vector representing rotation axis and norm representing rotation angle). However, when you print out the errors, you call it error on Yaw, Pitch and Roll. But it should be errors on rotation vector?
Evidence of ground truth being rotation vector:
img_path, pitch, yaw, roll = sample
pitch = float(pitch)
yaw = float(yaw)
roll = float(roll)
annotations = open(img_path.replace("_rgb.png", "_pose.txt"))
lines = annotations.readlines()
pose_target = []
for i in range(3):
lines[i] = str(lines[i].rstrip("\n"))
pose_target.append(lines[i].split(" ")[:3])
pose_target = np.asarray(pose_target)
pose_target = Rotation.from_matrix(pose_target).as_rotvec()
pose_targets.append(pose_target)
Here it's clear that the pose_target is first a rotation matrix and then converted to rotation vector by as_rotvec().
To convert to Pitch, Yaw, Roll it should be
pose_target = Rotation.from_matrix(pose_target).as_euler('xyz', degrees=False) # in order pitch, yaw, roll
Thanks!
Hi,
I am trying to render (x,y,z) axis based on network output instead of using provided renderer.
Currently using this code:
def draw_axis(img, euler_angle, center, size=80, thickness=3,
angle_const=np.pi / 180, copy=False):
if copy:
img = img.copy()
euler_angle *= angle_const
sin_pitch, sin_yaw, sin_roll = np.sin(euler_angle)
cos_pitch, cos_yaw, cos_roll = np.cos(euler_angle)
axis = np.array([
[cos_yaw * cos_roll,
cos_pitch * sin_roll + cos_roll * sin_pitch * sin_yaw],
[-cos_yaw * sin_roll,
cos_pitch * cos_roll - sin_pitch * sin_yaw * sin_roll],
[sin_yaw,
-cos_yaw * sin_pitch]
])
axis *= size
axis += center
axis = axis.astype(np.int)
print('axis', axis)
tp_center = tuple(center.astype(np.int))
cv2.line(img, tp_center, tuple(axis[0]), (0, 0, 255), thickness)
cv2.line(img, tp_center, tuple(axis[1]), (0, 255, 0), thickness)
cv2.line(img, tp_center, tuple(axis[2]), (255, 0, 0), thickness)
return img
According to readme poses
variable contains 6 values per detected face
[
0.013399183198461687,
0.0015700862562677677,
-0.0008193041494016704,
0.1667461395263672,
-7.139801979064941,
53.44799041748047
]
Is it true that first one is pitch
then yaw
then roll
other 3 are horizontal, vertical translation and scale?
Based on provided rendering code, i get pretty static axis always pointing in same direction, but the face mask shows orientation
Hello, I m not quite familiar with eval tools
.
How to get mAP and plot curves in python scripts, instead of eval tools
, upon getting results/WIDER_FACE/Val.
when running the training scripts as the readme, the generated model size is 324M,
while your released model in model zoo only ~150M
CUDA_VISIBLE_DEVICES=0 python3 train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_2 \
--batch_size 2 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400
Getting error:
File "C:\Code\Python\img2pose-main\rpn.py", line 467, in compute_loss
det_utils.smooth_l1_loss(
AttributeError: module 'torchvision.models.detection._utils' has no attribute 'smooth_l1_loss'
Dear Vítor Albiero,
I have one questions about the image and target transformation in the train phasse:
The model has a transformation GeneralizedRCNNTransform before the image and target passing to the network. This transformation will resize the image , bbox and keypoints, but no pose, mentioned in #28 [2]. I wonder if should recalculate the global pose accordding the resized image, bbox, keypoints.
best regards!
I resumed training from a checkpoint, however the validation loss jump back very high. I'm wondering if the following usage is proper?
python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_6_fast_train_resume \
--batch_size 16 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400 \
--distributed \
--resume_path workspace/trial_6_fast_train/models/model_val_loss_6.1507_step_2814.pth
Training log: validation loss jump from 5.9978 back to 15.6339
| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
Namespace(batch_size=2, contrast_augmentation=False, depth=18, dist_backend='nccl', dist_url='env://', distributed=True, early_stop=True, epochs=100, gpu=0, lr=0.001, lr_plateau=True, max_size=1400, min_size=[640, 672, 704, 736, 768, 800], noise_augmentation=False, optimizer='SGD', pose_mean='./datasets/lmdb/WIDER_train_annotations_pose_mean.npy', pose_stddev='./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy', prefix='trial_1_resume', pretrained_path=None, random_crop=True, random_flip=True, rank=0, resume_path='workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth', threed_5_points='./pose_references/reference_3d_5_points_trans.npy', threed_68_points='./pose_references/reference_3d_68_points_trans.npy', train_source='./datasets/lmdb/WIDER_train_annotations.lmdb', val_source='./datasets/lmdb/WIDER_val_annotations.lmdb', workers=4, workspace='./workspace/', world_size=4)
[Errno 17] File exists: './workspace/trial_1_resume/models'
Training with 12874 images.
Model will use distributed mode!
Resuming training from workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth
{'prefix': 'trial_1_resume', 'work_path': './workspace/trial_1_resume', 'model_path': './workspace/trial_1_resume/models', 'log_path': './workspace/trial_1_resume/log', 'frequency_log': 20, 'train_source': './datasets/lmdb/WIDER_train_annotations.lmdb', 'val_source': './datasets/lmdb/WIDER_val_annotations.lmdb', 'pose_loss': MSELoss(), 'pose_mean': array([-0.02376969, 0.02746343, -0.01443735, 0.06643507, 0.237996 ,
3.48129357]), 'pose_stddev': array([0.23534443, 0.5395153 , 0.17667074, 0.13199842, 0.13579309,
0.36629391]), 'depth': 18, 'lr': 0.001, 'lr_plateau': True, 'early_stop': True, 'batch_size': 2, 'workers': 4, 'epochs': 100, 'min_size': [640, 672, 704, 736, 768, 800], 'max_size': 1400, 'device': device(type='cuda'), 'weight_decay': 0.0005, 'momentum': 0.9, 'pin_memory': True, 'pretrained_path': None, 'resume_path': 'workspace/trial_1/models/model_val_loss_5.9978_step_3417.pth', 'noise_augmentation': False, 'contrast_augmentation': False, 'random_flip': True, 'random_crop': True, 'threed_5_points': './pose_references/reference_3d_5_points_trans.npy', 'threed_68_points': './pose_references/reference_3d_68_points_trans.npy', 'distributed': True, 'gpu': 0, 'num_gpus': 4}
SGD (
Parameter Group 0
dampening: 0
lr: 0.001
momentum: 0.9
nesterov: False
weight_decay: 0.0005
)
Epoch: [1-100] Batch: [80-6436] Speed: 36.69 samples/sec Loss: 23.00932
Epoch: [1-100] Batch: [160-6436] Speed: 32.39 samples/sec Loss: 23.09449
Epoch: [1-100] Batch: [240-6436] Speed: 32.32 samples/sec Loss: 17.81749
Epoch: [1-100] Batch: [320-6436] Speed: 30.10 samples/sec Loss: 16.34565
Epoch: [1-100] Batch: [400-6436] Speed: 27.72 samples/sec Loss: 22.51138
Epoch: [1-100] Batch: [480-6436] Speed: 30.71 samples/sec Loss: 25.69460
...
Epoch: [1-100] Batch: [6240-6436] Speed: 31.18 samples/sec Loss: 24.62859
Epoch: [1-100] Batch: [6320-6436] Speed: 23.26 samples/sec Loss: 25.31805
Epoch: [1-100] Batch: [6400-6436] Speed: 16.66 samples/sec Loss: 14.03424
Evaluating model...
Current validation loss: 15.633916 at step 1609 - Best validation loss: 15.633916 at step 1609
Good morning,
Does img2pose automatically resize the facial images to a preferred size? Sorry, I am just being a bit lazy with reading the code.
Regards,
Zaigham
I want to test my own images with a pre-trained model, and No module named 'Sim3DR_Cython' in Sim3DR,how to get it ?
First of all, excellent work. I love the work that you have done.
Now coming to the matter at hand: I am getting the assertion error below when I try to run the face alignment thing in aflw_2000_3d_evaluation.ipynb.
Outside:
Within:
And my lmk.shape is 68,2
It seems that the codebase of Sim3DR lacks the lib dir.
Dear Vítor Albiero,
After the publication of the RetinaFace paper about three years ago, I think this paper is very interesting and meaningful for detecting face and estimating head pose.
In particular, I believe that your work collecting and refining data for the start of new research (e.g 6DoF) will have a good impact on the academic and industrial world.
I have three questions while analyzing your code and reading the paper.
Why does you use the RetinaFace's predicted values for face bbox and landmarks, not the RetinaFace's annoated values [1] ? Do you have any special reasons? For reference, RetinaFace authors provide additional annotated landmark coordinate values. In other words, the bbox values are not changed compared to the original WIDERFACE bbox values for the GT.
When looking at the json files provided as annotations [2], there are 5 or 68 landmarks depending on the type of data (i.e. face in the WIDERFACE dataset). According to your paper, img2pose, the 5-point-landmarks of the RetinaFace was utilized to estimate 6DoF values in the sub-section 4.1. This confuses me. How do you get 68 landmark values? and why did the number of landmark values differ (e.g. 5 or 68)?
This question 3 is related to the above question 1. Since I analyzed the code, the bbox values used to get the 6DoF is not the original WIDERFACE bbox value as far as I mentioned above. Although img2pose didn't use the bbox coordinates for training (i.e. it is not used for loss.), I think the use of non-GT values for the WIDERFACE dataset affect the performance for predicting the face bbox on the WIDERFACE. What is your opinion for this? A example between the GT bbox values for original WIDERFACE and img2pose are below. The order of coordinates can be different. However, even when converting coordinate values, their values are clearly different.
Best regards,
vujadeyoon
[1] https://github.com/deepinsight/insightface/tree/master/detection/RetinaFace
[2] https://github.com/vitoralbiero/img2pose/wiki/Annotations
[3] imgs, targets = data in the train.py; data_dict = targets[0]; boxes = data_dict['boxes'].numpy().tolist()
Hello! In the article you use t_z and in code you use scale. What is the difference between these values and how to convert one to another? Thank you.
I want to use img2pose to extract the camera pose (for the purpose of using it as the input for NeRF). On page 3 of the paper it is stated that this can be obtained from the 6DoF pose h by "standard means," but I'm struggling to figure out how this is done. I'm especially struggling with how to determine the t of the [R¦t] matrix; I did manage to extract the R.
In short: How can one obtain a camera pose from the output 6DoF pose of this model?
Edit: My specific use case is an input video of a single talking head; I would like to get a camera pose determined by the head pose for each frame; i.e. interpret head movement as camera movement instead.
Hello. I have some questions:
Looking for your reply! Thank you!
Thanks for your contribution at first, I learned a lot from your work!
However, I found that you continue to config the predictor by setting num_classes=2
as this line shows.
As I think, since the whole network is used for human face only, there is no need to set in this way, which will lead to redundant parameters usage.
Is there any reason to do so? Or this could be refined for further parameters saving?
Thanks for your attention.
Could you please specify where the origin of the 6 Dofs is?
As the paper said in appendix A, "Here, we assume f equals the image crop height, h_bb,
plus width, w_bb".
Namely, the focal length f for image crop perspective is assumed as w_bb + h_bb.
Here, could we assume focal lenth f to other value ,such as w_bb + h_bb + 1, or something else?
FileNotFoundError: [Errno 2] No such file or directory: '../../models/WIDER_train_pose_mean_v1.npy'
FileNotFoundError: [Errno 2] No such file or directory: '../../models/WIDER_train_pose_stddev_v1.npy'
FileNotFoundError: [Errno 2] No such file or directory: '../../models/img2pose_v1.pth'
Hi, I have a question regarding the final pose I should use for my own image. Since there are multiple pose_pred
that have high scores while they have similar values. In AFLW evaluation, you determine the best index using max_iou. In test_own_images
, it seems like you use all high score predicted poses to render (That is what I understand the overlap
variable in renderer.render
). In my case, I just want to obtain the pose only. Would it be the average of all high score predictions or simply using the highest score result? Or any other solution? Thank you.
Hello, thank your for your work.
I read the alignment code and found there is a pytorch version alignment function. Can it be used for inference? Do you think it is possible to align faces by batch or any other ways to speed up the alignment process? like run in gpu?
Thank you!
I am runing the command:
python3 -m torch.distributed.launch --nproc_per_node=4 --use_env train.py \
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy \
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy \
--workspace ./workspace/ \
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb \
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb \
--prefix trial_1 \
--batch_size 2 \
--lr_plateau \
--early_stop \
--random_flip \
--random_crop \
--max_size 1400 \
--distributed
Having a issue:
Traceback (most recent call last):
File "train.py", line 403, in
train.run()
File "train.py", line 151, in run
losses = self.img2pose_model.forward(imgs, targets)
File "/data/x/img2pose/img2pose.py", line 127, in forward
losses = self.run_model(imgs, targets)
File "/data/x/img2pose/img2pose.py", line 122, in run_model
outputs = self.fpn_model(imgs, targets)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/x/img2pose/generalized_rcnn.py", line 79, in forward
images, targets = self.transform(images, targets)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 105, in forward
image, target_index = self.resize(image, target_index)
File "/home/x/.conda/envs/py38/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 140, in resize
size = float(self.torch_choice(self.min_size))
ValueError: could not convert string to float: '640, 672, 704, 736, 768, 800'
Thanks again for your work.
As I see from the code, the backbone network in your original implementation outputs feature maps in 5 scales, and they are stored by an OrderedDict with the keys
= '0', '1', '2', '3', 'pool'.
However, in the second phrase, the MultiScaleRoIAlign
only uses featmap_names=["0", "1", "2", "3"]
.
So, why drop the pool
features, since they are also supervised by the loss function?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.