junkybyte / easy_vitpose Goto Github PK

Easy and fast 2d human and animal multi pose estimation using SOTA ViTPose [Y. Xu et al., 2022] Real-time performances and multiple skeletons supported.

License: Apache License 2.0

Jupyter Notebook 83.04% Python 8.44% Cython 0.10% Cuda 8.41% C++ 0.01%

human-pose pose-estimation computer-vision python transformers vitpose openpose-skeleton onnx tensorrt torch

easy_vitpose's Introduction

easy_ViTPose

Accurate 2d human and animal pose estimation

Easy to use SOTA `ViTPose` [Y. Xu et al., 2022] models for fast inference.

We provide all the VitPose original models, converted for inference, with single dataset format output.

In addition to that we also provide a Coco-25 model, trained on the original coco dataset + feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/ Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of train.py

Warning

Ultralytics yolov8 has issue with wrong bounding boxes when using mps, upgrade to latest version! (Works correctly on 8.2.48)

Results

people_out.mp4

zebra_out.mp4

(Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw )
(Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk )

Features

Image / Video / Webcam support
Video support using SORT algorithm to track bboxes between frames
Torch / ONNX / Tensorrt inference
Runs the original VitPose checkpoints from ViTAE-Transformer/ViTPose
4 ViTPose architectures with different sizes and performances (s: small, b: base, l: large, h: huge)
Multi skeleton and dataset: (AIC / MPII / COCO / COCO + FEET / COCO WHOLEBODY / APT36k / AP10k)
Human / Animal pose estimation
cpu / gpu / metal support
show and save images / videos and output to json

We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in, if you do please open an issue, we might want to integrate other models for detection.

Benchmark:

You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!).

Skeleton reference

There are multiple skeletons for different dataset. Check the definition here visualization.py.

Installation and Usage

Important

Install torch>2.0 with cuda / mps support by yourself. also check requirements_gpu.txt.

git clone [email protected]:JunkyByte/easy_ViTPose.git
cd easy_ViTPose/
pip install -e .
pip install -r requirements.txt

Download models

Download the models from Huggingface We provide torch models for every dataset and architecture.
If you want to run onnx / tensorrt inference download the appropriate torch ckpt and use export.py to convert it.
You can use ultralytics yolo export command to export yolo to onnx and tensorrt as well.

Export to onnx and tensorrt

$ python export.py --help
usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET]

optional arguments:
  -h, --help            show this help message and exit
  --model-ckpt MODEL_CKPT
                        The torch model that shall be used for conversion
  --model-name {s,b,l,h}
                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
  --output OUTPUT       File (without extension) or dir path for checkpoint output
  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]

Run inference

To run inference from command line you can use the inference.py script as follows:

$ python inference.py --help
usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET]
                    [--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE]
                    [--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP]
                    [--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         path to image / video or webcam ID (=cv2)
  --output-path OUTPUT_PATH
                        output path, if the path provided is a directory output files are "input_name
                        +_result{extension}".
  --model MODEL         checkpoint path of the model
  --yolo YOLO           checkpoint path of the yolo model
  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]
  --det-class DET_CLASS
                        ["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",
                        "animals"]
  --model-name {s,b,l,h}
                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
  --yolo-size YOLO_SIZE
                        YOLOv8 image size during inference
  --conf-threshold CONF_THRESHOLD
                        Minimum confidence for keypoints to be drawn. [0, 1] range
  --rotate {0,90,180,270}
                        Rotate the image of [90, 180, 270] degress counterclockwise
  --yolo-step YOLO_STEP
                        The tracker can be used to predict the bboxes instead of yolo for performance, this flag
                        specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any
                        effect when is_video is False
  --single-pose         Do not use SORT tracker because single pose is expected in the video
  --show                preview result during inference
  --show-yolo           draw yolo results
  --show-raw-yolo       draw yolo result before that SORT is applied for tracking (only valid during video inference)
  --save-img            save image results
  --save-json           save json results

You can run inference from code as follows:

import cv2
from easy_ViTPose import VitInference

# Image to run inference RGB format
img = cv2.imread('./examples/img1.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# set is_video=True to enable tracking in video inference
# be sure to use VitInference.reset() function to reset the tracker after each video
# There are a few flags that allows to customize VitInference, be sure to check the class definition
model_path = './ckpts/vitpose-s-coco_25.pth'
yolo_path = './yolov8s.pth'

# If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo
# If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda')
# dataset and det_class parameters can be inferred from the ckpt name, but you can specify them.
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)

# Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score))
# If is_video=True the IDs will be consistent among the ordered video frames.
keypoints = model.inference(img)

# call model.reset() after each video

img = model.draw(show_yolo=True)  # Returns RGB image with drawings
cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0)

Note

If the input file is a video SORT is used to track people IDs and output consistent identifications.

OUTPUT json format

The output format of the json files:

{
    "keypoints":
    [  # The list of frames, len(json['keypoints']) == len(video)
        {  # For each frame a dict
            "0": [  #  keys are id to track people and value the keypoints
                [121.19, 458.15, 0.99], # Each keypoint is (y, x, score)
                [110.02, 469.43, 0.98],
                [110.86, 445.04, 0.99],
            ],
            "1": [
                ...
            ],
        },
        {
            "0": [
                [122.19, 458.15, 0.91],
                [105.02, 469.43, 0.95],
                [122.86, 445.04, 0.99],
            ],
            "1": [
                ...
            ]
        }
    ],
    "skeleton":
    {  # Skeleton reference, key the idx, value the name
        "0": "nose",
        "1": "left_eye",
        "2": "right_eye",
        "3": "left_ear",
        "4": "right_ear",
        "5": "neck",
        ...
    }
}

Finetuning

Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue.
You can check train.py, datasets/COCO.py and config.yaml for details.

TODO:

refactor finetuning
benchmark and check bottlenecks of inference pipeline
parallel batched inference
other minor fixes
yolo version for animal pose, check #18
solve cuda exceptions on script exit when using tensorrt (no idea how)
add infos about inferred informations during inference, better output of inference status (device etc)
check if is possible to make colab work without runtime restart

Feel free to open issues, pull requests and contribute on these TODOs.

Reference

Thanks to the VitPose authors and their official implementation ViTAE-Transformer/ViTPose.
The SORT code is taken from abewley/sort

easy_vitpose's People

Contributors

Stargazers

Watchers

Forkers

sweaterr kovlo spidartist lum1t4 vikasvobbilisetti dongwoo-koo pinto0309 martincamposdonoso rossmcmorrow saltfish-len h2k chips-song sshuster khlu1658 ojh6404 omkaar718 aslanis

easy_vitpose's Issues

batch inference

Great work helps me a lot!
Can you share with me some sample code for batch inference using TRT model?

failed fp16 inference.

hi, I have modifed your export function for fp16 infernece. I run into the following issue.
Here is the modfied export code

    pmodel.load_state_dict(ckpt)
    pmodel.eval()
    pmodel.cuda()
    pmodel.half()

    C, H, W = (3, 256, 192)

    # model_wrapper = PoseModelWrapper(backbone=pose_model.backbone, head=pose_model.keypoint_head)

    trt_ts_module = torch_tensorrt.compile(pmodel,
                                           # If the inputs to the module are plain Tensors, specify them via the `inputs` argument:
                                           inputs=[
                                               torch_tensorrt.Input(  # Specify input object with shape and dtype
                                                   shape=[1, C, H, W],
                                                   dtype=torch.half
                                               )
                                           ],
                                           # TODO: ADD Datatype for inference. Allowed options torch.(float|half|int8|int32|bool)
                                           enabled_precisions= {torch.half},  # half Run with FP16
                                           workspace_size=1 << 32
                                           )
    torch.jit.save(trt_ts_module, engine_file_path)  # save the TRT embedded Torchscript

The inference input to the model is I am sure in cuda, half format.

            img_crop = torch.from_numpy(img_crop).cuda().half()

And here is the detailed error message.

Inference 2D pose:   0%|          | 1/1196 [00:00<02:02,  9.74it/s]Traceback (most recent call last):
  File "/home/khanh/mvai/code/sdc/mocap-mdc-tracking/mocap_mdc_tracking/detect/pose_trt.py", line 123, in <module>
    test_inference(vit_jit_model, smt_frms, vid_path, out_debug_dir=out_debug_dir)
  File "/home/khanh/mvai/code/sdc/mocap-mdc-tracking/mocap_mdc_tracking/detect/pose_trt.py", line 87, in test_inference
    heatmaps = vit_jit_model(img_crop).detach().cpu().numpy()
  File "/home/khanh/.cache/pypoetry/virtualenvs/mocap-demo-vNvmCQwv-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/mocap_mdc_tracking/detect/vit_models/model.py", line 11, in forward
    _0 = ops.tensorrt.execute_engine([input_1], __torch___mocap_mdc_tracking_detect_vit_models_model_ViTPose_trt_engine_0x596979e5cbd0)
    _1, = _0
    input = torch.conv_transpose2d(_1, CONSTANTS.c0, None, [2, 2], [1, 1])
            ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    input0 = torch.batch_norm(input, CONSTANTS.c1, CONSTANTS.c2, CONSTANTS.c3, CONSTANTS.c4, False, 0.10000000000000001, 1.0000000000000001e-05, True)
    input1 = torch.conv_transpose2d(torch.relu(input0), CONSTANTS.c5, None, [2, 2], [1, 1])

Traceback of TorchScript, original code (most recent call last):
  File "/home/khanh/.cache/pypoetry/virtualenvs/mocap-demo-vNvmCQwv-py3.10/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 956, in forward
            num_spatial_dims, self.dilation)  # type: ignore[arg-type]
    
        return F.conv_transpose2d(
               ~~~~~~~~~~~~~~~~~~ <--- HERE
            input, self.weight, self.bias, self.stride, self.padding,
            output_padding, self.groups, self.dilation)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

Do have any idea about this issue? Just wonder if you have tested the export with fp16?

Improve yolo bbox to be sure to include feet

Sometimes feet are cut and won't be correctly predicted.

error when i use .engine files to use tensorrt

Error with tensorrt model

Hi,
Thank you for the work you've done.
I have run the inference script with the torch model (vit-pose-coco-s) on my webcam using a gpu and it worked well.
After that I tried to use the tensorrt model provided and I converted the yolov8.pt file to a engine file. When running the script, I'm having this error:
[TRT] [E] 1: [runtime.cpp::nvinfer1::Runtime::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
trt_engine is empty (None)
Traceback (most recent call last):
File "C:\Users\tech\Desktop\Caroline\Internship_Project\Internship_Project\VitPose\easy_ViTPose\testinf.py", line 45, in
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=640, is_video=True, device="cuda")
File "C:\Users\tech\Desktop\Caroline\Internship_Project\Internship_Project\VitPose\easy_ViTPose\easy_ViTPose\inference.py", line 169, in init
engine_utils.allocate_buffers(trt_engine)
File "C:\Users\tech\Desktop\Caroline\Internship_Project\Internship_Project\VitPose\easy_ViTPose\easy_ViTPose\utils_engine.py", line 98, in allocate_buffers
for binding in engine:
TypeError: 'NoneType' object is not iterable

The loading function return an empty engine.

Thank you.

Running inference on video with both humans and horses ?

Hi First of all thank you for the implementation. I have two questions

How can i run the example google colab script for videos ?
If in the video it has both humans and animals ( ex horses) will I get keypoints for both the objects ?
How can i modify keypoints to vizualize it better ?

Regards T

ONNX, Tensorrt Core Dumped

Hello, thank you very much for the great repo! Very helpful!
I work with a gpu with 8Go of V-RAM. When I load the torch models everything works well however when I try to load onnx and tensorrt models (for vitpose and for yolov5s), I got OOM and then Core dumped. Do you know how much more VRAM onnx and tensorrt are using?

Origins of coco_25 ckpt

Hi @JunkyByte, thx for great work!

In original ViTPose repo there are no models with coco_25-like output. Can you provide more details on which dataset and hyperparameters have you trained your coco_25 models?

No module named 'configs'

Hi, i have some problems when clone easy_ViTPose and pip install -e .
Please help or fix that.
Thanks regards

First frame takes long, others are faster

Hello,

I have two questions:

I was playing around with your project and noticed, that when I run the model on a videostream, the first frame takes quite some time to finish, while the next frames are faster. Why is that the case?

I am working in a jupyter notebook, a minimal version of it would look like this:

model_path = '.\\models\\vitpose-l-coco_25.pth'
yolo_path = '.\\models\\yolov8l.pt'

model = VitInference(model_path, yolo_path, model_name='l', yolo_size=544, is_video=True, device=None, det_class="human")

path = "D:\\Documents\\xxx\\"
source = "file_name"
source_type= ".mp4"

cap = cv2.VideoCapture(path+source+source_type)
fourcc = cv2.VideoWriter_fourcc(*'MP4V')
out = cv2.VideoWriter(path+source+'_Test.mp4', fourcc, 25.0, (720,1280))

while cap.isOpened():
    # Read a frame from the video

    success, frame = cap.read()

    if success:
        start = time.time()

        # Run inference on the frame
       keypoints = model.inference(frame)
        
        # draw skeleton on the frame
        img = model.draw(show_yolo=True)
  
        # save frame
        out.write(img)
  
        # display the frame
        cv2.imshow('Output', img)

        end = time.time()
        print('time passed (in sec): ')
        print(end - start)
        
        # Break the loop if 'q' is pressed
        # TODO fenster mit x schießen können
        key = cv2.waitKey(1)
        if key == ord("q"):
            break
    else:
        # Break the loop if the end of the video is reached
        break

# Release the video capture object and close the display window
cap.release()
out.release()
cv2.destroyAllWindows()

when executing the last cell, it takes quite some time (~10-20 sec) before anything happens. Do you know why that is the case?

I sometimes also get the following warning:
WARNING NMS time limit 0.550s exceeded, when it finally starts detecting in the first frame

Is it possible to only detect the pose for people in a specific area of the image defined by a rectangle in pxl coordinates? For example in the following image, I only want the pose for the person in the red rectangle. Would the inference get faster if vit only has to consider one person instead of e.g. 3?

Wholebody inference

Hi, thank you for sharing a nice codebase!

I am looking for a way to get COCO wholebody keypoints
How are feet keypoints added additionally? can I get some tips?

Evaluation on coco dataset

The results of using this implementation on coco val dataset seem to be quite lower than those reported in the paper.

Model: ViT-B
YOLOv8 detector model: yolov8x
YOLOv8 threshold: 0.35
YOLOv8 image size: 640
[email protected]:0.95 obtained on coco val data: 0.446, [email protected]: 0.589.

ModuleNotFoundError: No module named 'easy_ViTPose.vit_models.backbone'

Description

After updating my local copy of the codebase to the latest version (previously working on a version that was 2 months old), I encountered a ModuleNotFoundError during runtime. Specifically, the error points to a missing module within the easy_ViTPose package.

Error Message

ModuleNotFoundError: No module named 'easy_ViTPose.vit_models.backbone'

The backbone folder is present within the vit_models directory when I clone the repository directly from GitHub.
However, after installing the package using pip install, the backbone folder does not seem to be included in the installed library.

This issue might be related to how the setup.py is configured, potentially not including the backbone directory during the installation process?

Shoulder joint

Hello,

Im having a hard time finding out which keypoint is the shoulder for my model and was hoping someone knows.

`import os
import math
import numpy as np
import pandas as pd
from easy_ViTPose import VitInference
import cv2
from tqdm import tqdm
from math import atan2, degrees

print("Initializing model...")
model_path = './vitpose-25-s.pth'
yolo_path = './yolov5s.onnx'
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)
`

Feet dataset

Could you give a link to the feet dataset?

Finetuning on custom dataset

How would I go about finetuning from one of the coco pretrained models on my own dataset?

Thanks in advance

easy-ViTPose Import error

Hi, thank you for providing this simple implementation of ViTPose!
Two weeks ago I was running the code in Google Colab, but when I tried to run it again today, I encountered an error while trying to install the easy_ViTPose package.
The error message I received was as follows:

Obtaining file:///content/easy_ViTPose
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

How can I solve this problem?

How to use official pretrained model in your project?

fintuning

There is a problem I have encountered: how to fine-tune the ViTPose model

Slow loss convergence

Hello,
I'm attempting to perform fine tuning with your implementation (I'm using the commit e8e2ad1 from April 24, as I don't need feet key points).
Unfortunately I think the loss might not converge properly. I tried to run the training without fine tuning (from scratch) - in the first 5 epochs it decreased from 0.0168 to 0.0063, but remained stuck at 0.0063 for the next 25 epochs.

Do you have any suggestions for how to solve it?
I've used the same hyper parameters in your code, but changed the layer decay rate from 0.75 to 1-1e-4.

Thank you for your time and assistance!

could not build wheels for pycuda and nvidia tensorrt version not found

i am trying to install requirements_gpu i did install PyTorch cuda before trying to install but I am getting these errors

Normalize image before inference

Thanks for the helpful repo!

When using the pretrained models, should I normalize the image tensor before feeding it to the model?

For training, COCODataset applies the usual Normalize transform (code) when loading images. The config files also include NormalizeTensor steps in train_pipeline and val_pipeline, though that part of the config doesn't appear to be used in this repo.

On the inference side, however, the inference.py script only scales the image input to [0, 1], without normalizing (code). Is that a bug, or am I missing something?