Git Product home page Git Product logo

yolo_deepstream's Introduction

Yolo DeepStream

Description

This repo have 4 parts:

1) yolov7_qat

In yolov7_qat, We use TensorRT's pytorch quntization tool to Finetune training QAT yolov7 from the pre-trained weight. Finally we get the same performance of PTQ in TensorRT on Jetson OrinX. And the accuracy(mAP) of the model only dropped a little.

2) tensorrt_yolov7

In tensorrt_yolov7, We provide a standalone c++ yolov7-app sample here. You can use trtexec to convert FP32 onnx models or QAT-int8 models exported from repo yolov7_qat to trt-engines. And set the trt-engine as yolov7-app's input. It can do detections on images/videos. Or test mAP on COCO dataset.

3) deepstream_yolo

In deepstream_yolo, This sample shows how to integrate YOLO models with customized output layer parsing for detected objects with DeepStreamSDK.

4) tensorrt_yolov4

In tensorrt_yolov4, This sample shows a standalone tensorrt-sample for yolov4.

Performance

For YoloV7 sample:

Below table shows the end-to-end performance of processing 1080p videos with this sample application.

  • Testing Device :

    1. Jetson AGX Orin 64GB(PowerMode:MAXN + GPU-freq:1.3GHz + CPU:12-core-2.2GHz)

    2. Tesla T4

Device precision Number
of streams
Batch Size trtexec FPS deepstream-app FPS
with cuda-post-process
deepstream-app FPS
with cpu-post-process
Orin-X FP16 1 1 126 124 120
Orin-X FP16 16 16 162 145 135
Orin-X Int8(PTQ/QAT) 1 1 180 175 128
Orin-X Int8(PTQ/QAT) 16 16 264 264 135
T4 FP16 1 1 132 125 123
T4 FP16 16 16 169 169 123
T4 Int8(PTQ/QAT) 1 1 208 170 127
T4 Int8(PTQ/QAT) 16 16 305 300 132
  • note: trtexec cudaGraph not enabled as deepstream not support cudaGraph

Code structure

├── deepstream_yolo
│   ├── config_infer_primary_yoloV4.txt # config file for yolov4 model
│   ├── config_infer_primary_yoloV7.txt # config file for yolov7 model
│   ├── deepstream_app_config_yolo.txt # deepStream reference app configuration file for using YOLOv models as the primary detector.
│   ├── labels.txt # labels for coco detection # output layer parsing function for detected objects for the Yolo model.
│   ├── nvdsinfer_custom_impl_Yolo 
│   │   ├── Makefile
│   │   └── nvdsparsebbox_Yolo.cpp 
│   └── README.md 
├── README.md
├── tensorrt_yolov4
│   ├── data 
│   │   ├── demo.jpg # the demo image
│   │   └── demo_out.jpg # image detection output of the demo image
│   ├── Makefile
│   ├── Makefile.config
│   ├── README.md
│   └── source
│       ├── generate_coco_image_list.py # python script to get list of image names from MS COCO annotation or information file
│       ├── main.cpp # program main entrance where parameters are configured here
│       ├── Makefile
│       ├── onnx_add_nms_plugin.py # python script to add BatchedNMSPlugin node into ONNX model
│       ├── SampleYolo.cpp # yolov4 inference class functions definition file
│       └── SampleYolo.hpp # yolov4 inference class definition file
├── tensorrt_yolov7
│   ├── CMakeLists.txt
│   ├── imgs # the demo images
│   │   ├── horses.jpg 
│   │   └── zidane.jpg
│   ├── README.md
│   ├── samples 
│   │   ├── detect.cpp # detection app for images detection
│   │   ├── validate_coco.cpp # validate coco dataset app
│   │   └── video_detect.cpp # detection app for video detection
│   ├── src
│   │   ├── argsParser.cpp # argsParser helper class for commandline parsing
│   │   ├── argsParser.h # argsParser helper class for commandline parsing
│   │   ├── tools.h # helper function for yolov7 class
│   │   ├── Yolov7.cpp # Class Yolov7
│   │   └── Yolov7.h # Class Yolov7
│   └── test_coco_map.py # tool for test coco map with json file
└── yolov7_qat
    ├── doc
    │   ├── Guidance_of_QAT_performance_optimization.md # guidance for Q&DQ insert and placement for pytorch-quantization tool
    ├── quantization
    │   ├── quantize.py # helper class for quantize yolov7 model
    │   └── rules.py # rules for Q&DQ nodes insert and restrictions
    ├── README.md 
    └── scripts
        ├── detect-trt.py # detect a image with tensorrt engine
        ├── draw-engine.py # draw tensorrt engine to graph
        ├── eval-trt.py # the script for evalating tensorrt mAP
        ├── eval-trt.sh # the command lne script for evaluating tensorrt mAP
        ├── qat.py # main function for QAT and PTQ
        └── trt-int8.py # tensorrt build-in calibration

yolo_deepstream's People

Contributors

hopef avatar mchi-zg avatar wanghr323 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yolo_deepstream's Issues

QAT with multiple GPUs?

I tried to do QAT with multiple GPUs with torch.nn.DataParallel, but I got an error

Traceback (most recent call last):
  File "scripts/qat.py", line 347, in <module>
    args.eval_origin, args.eval_ptq
  File "scripts/qat.py", line 245, in cmd_quantize
    preprocess=preprocess, supervision_policy=supervision_policy())
  File "/GSOL_lossless_AI/yolov7/quantization/quantize.py", line 347, in finetune
    model(imgs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/GSOL_lossless_AI/yolov7/models/yolo.py", line 599, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/GSOL_lossless_AI/yolov7/models/yolo.py", line 625, in forward_once
    x = m(x)  # run
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/GSOL_lossless_AI/yolov7/models/common.py", line 111, in fuseforward
    return self.act(self.conv(x))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/nn/modules/quant_conv.py", line 120, in forward
    quant_input, quant_weight = self._quant(input)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/nn/modules/quant_conv.py", line 85, in _quant
    quant_input = self._input_quantizer(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/nn/modules/tensor_quantizer.py", line 346, in forward
    outputs = self._quant_forward(inputs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/nn/modules/tensor_quantizer.py", line 310, in _quant_forward
    outputs = fake_tensor_quant(inputs, amax, self._num_bits, self._unsigned, self._narrow_range)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/tensor_quant.py", line 306, in forward
    outputs, scale = _tensor_quant(inputs, amax, num_bits, unsigned, narrow_range)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_quantization/tensor_quant.py", line 354, in _tensor_quant
    outputs = torch.clamp((inputs * scale).round_(), min_bound, max_bound)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Different throughput running deepstream vs trtexec mode

I all, I am running YOLOv3 with DeepStream 5.1, and run the saved optimize engine withtrtexeccommand to double-check I get consistent results; instead, I got different throughput, some ideas about what is going on?

Throughput FPS (avg) | INT8 | BS=1
Running TensorRT engine with DeepStream 5.1: 292
Running TensorRT engine in standalone mode (trtexec): 201

Running with DeepStream
$ deepstream-app -c deepstream_app_config_yoloV3.txt
Performance:

IOU Tracker Init with threshold 0.100000
****PERF:  292.03 (291.97)**
** INFO: <bus_callback:204>: Received EOS. Exiting ...

Running with trtexec command
$ /usr/src/tensorrt/bin/trtexec --plugins=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so --loadEngine=model_b1_gpu0_int8.engine --int8
Performance

[03/04/2021-02:07:59] [I] min: 5.7467 ms (end to end 9.50974 ms)
[03/04/2021-02:07:59] [I] max: 9.63377 ms (end to end 18.3497 ms)
[03/04/2021-02:07:59] [I] mean: 5.92314 ms (end to end 9.8676 ms)
[03/04/2021-02:07:59] [I] median: 5.89404 ms (end to end 9.80298 ms)
[03/04/2021-02:07:59] [I] percentile: 6.07446 ms at 99% (end to end 10.1627 ms at 99%)
**[03/04/2021-02:07:59] [I] throughput: 201.199 qps**
[03/04/2021-02:07:59] [I] walltime: 3.01691 s
[03/04/2021-02:07:59] [I] Enqueue Time
[03/04/2021-02:07:59] [I] min: 0.516113 ms
[03/04/2021-02:07:59] [I] max: 0.815536 ms
[03/04/2021-02:07:59] [I] median: 0.520508 ms
[03/04/2021-02:07:59] [I] GPU Compute
[03/04/2021-02:07:59] [I] min: 4.7764 ms
[03/04/2021-02:07:59] [I] max: 8.66505 ms
[03/04/2021-02:07:59] [I] mean: 4.95182 ms
[03/04/2021-02:07:59] [I] median: 4.92303 ms
[03/04/2021-02:07:59] [I] percentile: 5.10498 ms at 99%

deepstream-app -c deepstream_app_config_yoloV4.txt error

NvMMLiteBlockCreate : Block : BlockType = 4
deepstream-app: nvdsparsebbox_Yolo.cpp:141: bool NvDsInferParseCustomYoloV4(const std::vector&, const NvDsInferNetworkInfo&, const NvDsInferParseDetectionParams&, std::vector&): Assertion `scores.inferDims.numDims == 2' failed.
已放弃 (核心已转储)

Deepstream-app with YOLOv4 ONNX model + BatchedNMSPlugin

Hi,
I successfully created engine file from yolov4 + BatchedNMSPlugin according these instructions in this repository. The engine file works fine (I can successfully run ../bin/yolov4 --demo command ).

Now I want to deploy this engine file to deepstream-app. To do that, I need a parse function for config_infer_primary.txt file. The default parse function for yolov4 is here in this repository including the NMS. But I do not need NMS in the post processing time because the BatchedNMSPlugin is part of the engine file so the NMS is already done and the output should contain only the final bboxes.

I tried to rewrite the parse function for yolov4 for my case, but with no success.

Is there any example of parse function for yolov4 + BatchedNMSPlugin? If not, where should I start? Is there any information how to write own parse function for engine files?

What are topk and keeptopk parameters?

Can someone please describe to me what the "topk" and "keeptopk" parameters are? They are found in the conversion to onnx, in the arguments given for running the python script.

About converting YOLOv7 QAT model to TensorRT engine(failed for dynamic-batch setting)

When I refer to yolo_deepstream/tree/main/tensorrt_yolov7 and use "yolov7QAT" to perform a batch detection task, the following error occurs
./build/detect --engine=yolov7QAT.engine --img=./imgs/horses.jpg,./imgs/zidane.jpg

Error Message

input 2 images, paths: ./imgs/horses.jpg, ./imgs/zidane.jpg, 
--------------------------------------------------------
Yolov7 initialized from: /opt/nvidia/deepstream/deepstream/samples/models/tao_pretrained_models/yolov7/yolov7QAT.engine
input : images , shape : [ 1,3,640,640,]
output : outputs , shape : [ 1,25200,85,]
--------------------------------------------------------
preprocess start
error cv_img.size() in preProcess
 error: mImgPushed = 1 numImg = 1 mMaxBatchSize= 1, mImgPushed + numImg > mMaxBatchSize 
inference start
postprocessing start
detectec image written to: ./imgs/horses.jpgdetect0.jpg

Note

  • It works fine when running a single detection task with "yolov7QAT.engine".
  • "yolov7QAT.engine" comes from "yolov7qat.onnx" conversion.
  • Whether downloaded from here or self trained "yolov7qat.onnx" (using 'netron' view, it shows the same structure), the same error occurs when running `. /build/detect `` all show the same error message
  • Runs fine with non-qat "yolov7db4fp32.engine" or "yolov7db4fp16.engine"

Environment

    CUDA: 11.4.315
    cuDNN: 8.6.0.166
    TensorRT: 5.1
    Python: 3.8.10
    PyTorch: 1.12.0a0+2c916ef.nv22.3
Hardware
    Model: Jetson-AGX
    Module: NVIDIA Jetson AGX Xavier (32 GB ram)
    L4T: 35.2.1
    Jetpack: 5.1

Run sample error using DeepStream 5.1 and YOLOv4

Complete information of setup.

• Hardware Platform (Jetson / GPU) : GPU
• DeepStream Version : 5.1
• CUDA Version : 11.1
• JetPack Version (valid for Jetson only): None
• TensorRT Version : 7.2.3.4
• NVIDIA GPU Driver Version (valid for GPU only) : 460.84
• Issue Type( questions, new requirements, bugs) : questions
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

I have done Deepstream 5.1 with YOLOv4 setup and while Run the Sample getting below error.
Can anybody please help me around?

ERROR:

(deepstreamer_env) xxxx@xxxx:/opt/nvidia/deepstream/deepstream-5.1/sources/deepstream_yolov4$ deepstream-app -c deepstream_app_config_yoloV4.txt
Unknown or legacy key specified ‘is-classifier’ for group [property]
** ERROR: main:655: Failed to set pipeline to PAUSED
Quitting
ERROR from sink_sub_bin_sink1: Could not open file “yolov4.mp4” for writing.
Debug info: gstfilesink.c(431): gst_file_sink_open_file (): /GstPipeline:pipeline/GstBin:processing_bin_0/GstBin:sink_bin/GstBin:sink_sub_bin1/GstFileSink:sink_sub_bin_sink1:
system error: Permission denied
ERROR from sink_sub_bin_sink1: GStreamer error: state change failed and some element failed to post a proper error message with the reason for the failure.
Debug info: gstbasesink.c(5265): gst_base_sink_change_state (): /GstPipeline:pipeline/GstBin:processing_bin_0/GstBin:sink_bin/GstBin:sink_sub_bin1/GstFileSink:sink_sub_bin_sink1:
Failed to start
App run failed

Yolov7 instance segmentation with deepstream?

How can I deploy an instance segmentation yolov7 model with deepstream? Once I export the model to .trt, I need to parse the output layers.

Is it planned to add this feature here?

Thanks!

[Solution] TensorRT 8.0.1 engine for YOLOv4 in standalone mode

To compile the yolov4 models with batched NMS, I had to change this:

  1. comment builder->allowGPUFallback(true); in SampleYolo.cpp. According docs the allowGPUFallback was removed from TensorRT 8.0.1. If I understand well, I do not have to care about it in TensorRT 8.0.1. Am I right?
  2. comment all MYELIN_LIB and ENABLE_MYELIN, in Makefile.config to prevent error: /usr/bin/ld: cannot find -lmyelin because it seems to me that TensorRT 8.0.1 does not use it, but I found nothing about it. Do you have any idea?

If you do these 2 steps, then you can compile it and it works.
Good luck!

The Python-deepStream API implements YOLOV4,Multichannel RTSP video stream inference cannot run

My computer
Operating system:ubuntu20.04
CPU:i9 Intel(R) Core(TM) i9-12900KF
GPU:NVIDIA 3090 24G video memory
Deepstream6.1 was installed successfully.
Python successfully bound DeepStream.

git clone https://github.com/NVIDIA-AI-IOT/yolov4_deepstream.git
Successfully compiled, ready to run。
import to python interface, use deepstream-test3, run one RTSP stream can run, but run two RTSP appear.
In the deepstream-test3 folder deepstream-test3.py file
Modify
< pgie.set_property('config-file-path', "config_infer_primary_yoloV4.txt" >

python3 deepstream-test3.py -i rtsp://admin:[email protected]/h264/ch1/main/av_stream -s no problem

but
python3 deepstream-test3.py -i rtsp://admin:[email protected]/h264/ch1/main/av_stream rtsp://admin:[email protected]/h264/ch1/main/av_stream -s
problem:
0:00:09.288897521 97120 0x29c0900 WARN nvinfer gstnvinfer.cpp:643:gst_nvinfer_logger: NvDsInferContext[UID 1]: Warning from NvsInferContextImpl::checkBackendParams() <nvdsinfer_context_impl.cpp:1832> [UID = 1]: Backend has maxBatchSize 1 whereas 2 has been requested
0:00:09.289061931 97120 0x29c0900 WARN nvinfer gstnvinfer.cpp:643:gst_nvinfer_logger: NvDsInferContext[UID 1]: Warning from NvsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2009> [UID = 1]: deserialized backend context :/home/ai-box/deepstream/nvidia/deepstream/deepsream-6.1/sources/deepstream_yolov4/yolov4.engine failed to match config params, trying rebuild
0:00:09.297623267 97120 0x29c0900 INFO nvinfer gstnvinfer.cpp:646:gst_nvinfer_logger: NvDsInferContext[UID 1]: Info from NvDsIferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:860 failed to build network since there is no model file matched.
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:799 failed to build network.
0:00:09.801857495 97120 0x29c0900 ERROR nvinfer gstnvinfer.cpp:640:gst_nvinfer_logger: NvDsInferContext[UID 1]: Error in NvDsInerContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1934> [UID = 1]: build engine file failed
0:00:09.851086491 97120 0x29c0900 ERROR nvinfer gstnvinfer.cpp:640:gst_nvinfer_logger: NvDsInferContext[UID 1]: Error in NvDsInerContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2020> [UID = 1]: build backend context failed
0:00:09.851121902 97120 0x29c0900 ERROR nvinfer gstnvinfer.cpp:640:gst_nvinfer_logger: NvDsInferContext[UID 1]: Error in NvDsInerContextImpl::initialize() <nvdsinfer_context_impl.cpp:1257> [UID = 1]: generate backend failed, check config file settings
0:00:09.851353192 97120 0x29c0900 WARN nvinfer gstnvinfer.cpp:846:gst_nvinfer_start: error: Failed to create NvDsInferContext nstance
0:00:09.851358835 97120 0x29c0900 WARN nvinfer gstnvinfer.cpp:846:gst_nvinfer_start: error: Config file path: config_infer_priary_yoloV4.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED

**PERF: {'stream0': 0.0, 'stream1': 0.0}

Error: gst-resource-error-quark: Failed to create NvDsInferContext instance (1): gstnvinfer.cpp(846): gst_nvinfer_start (): /GstPipeline:pipeline0/GstNvInfer:primary-nference:
Config file path: config_infer_primary_yoloV4.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
Exiting app
image

There are some errors after adding "BatchedNMS_TRT" layer

Description:
I got YOLOv4 ONNX mode (yolov4_1_3_608_608_static.onnx) from https://github.com/Tianxiaomo/pytorch-YOLOv4, then used the command :
"python .\onnx_add_nms_plugin.py -f .\yolov4_1_3_608_608_static.onnx -t 2000 -k 100 "
to add "BatchedNMS_TRT" layer,and got a new mode ( yolov4_1_3_608_608_static.nms.onnx).
But when I used the command :
"trtexec --onnx=yolov4_1_3_608_608_static.nms.onnx --explicitBatch --saveEngine=tensorRT-eng --workspace=4096 "
to convert the model ,there were some errors, here is the log:
[11/25/2020-12:10:31] [I] === Model Options ===
[11/25/2020-12:10:31] [I] Format: ONNX
[11/25/2020-12:10:31] [I] Model: yolov4_1_3_608_608_static.nms.onnx
[11/25/2020-12:10:31] [I] Output:
[11/25/2020-12:10:31] [I] === Build Options ===
[11/25/2020-12:10:31] [I] Max batch: explicit
[11/25/2020-12:10:31] [I] Workspace: 4096 MB
[11/25/2020-12:10:31] [I] minTiming: 1
[11/25/2020-12:10:31] [I] avgTiming: 8
[11/25/2020-12:10:31] [I] Precision: FP32
[11/25/2020-12:10:31] [I] Calibration:
[11/25/2020-12:10:31] [I] Safe mode: Disabled
[11/25/2020-12:10:31] [I] Save engine: tensorRT-eng
[11/25/2020-12:10:31] [I] Load engine:
[11/25/2020-12:10:31] [I] Builder Cache: Enabled
[11/25/2020-12:10:31] [I] NVTX verbosity: 0
[11/25/2020-12:10:31] [I] Inputs format: fp32:CHW
[11/25/2020-12:10:31] [I] Outputs format: fp32:CHW
[11/25/2020-12:10:31] [I] Input build shapes: model
[11/25/2020-12:10:31] [I] Input calibration shapes: model
[11/25/2020-12:10:31] [I] === System Options ===
[11/25/2020-12:10:31] [I] Device: 0
[11/25/2020-12:10:31] [I] DLACore:
[11/25/2020-12:10:31] [I] Plugins:
[11/25/2020-12:10:31] [I] === Inference Options ===
[11/25/2020-12:10:31] [I] Batch: Explicit
[11/25/2020-12:10:31] [I] Input inference shapes: model
[11/25/2020-12:10:31] [I] Iterations: 10
[11/25/2020-12:10:31] [I] Duration: 3s (+ 200ms warm up)
[11/25/2020-12:10:31] [I] Sleep time: 0ms
[11/25/2020-12:10:31] [I] Streams: 1
[11/25/2020-12:10:31] [I] ExposeDMA: Disabled
[11/25/2020-12:10:31] [I] Spin-wait: Disabled
[11/25/2020-12:10:31] [I] Multithreading: Disabled
[11/25/2020-12:10:31] [I] CUDA Graph: Disabled
[11/25/2020-12:10:31] [I] Skip inference: Disabled
[11/25/2020-12:10:31] [I] Inputs:
[11/25/2020-12:10:31] [I] === Reporting Options ===
[11/25/2020-12:10:31] [I] Verbose: Disabled
[11/25/2020-12:10:31] [I] Averages: 10 inferences
[11/25/2020-12:10:31] [I] Percentile: 99
[11/25/2020-12:10:31] [I] Dump output: Disabled
[11/25/2020-12:10:31] [I] Profile: Disabled
[11/25/2020-12:10:31] [I] Export timing to JSON file:
[11/25/2020-12:10:31] [I] Export output to JSON file:
[11/25/2020-12:10:31] [I] Export profile to JSON file:
[11/25/2020-12:10:31] [I]

Input filename: yolov4_1_3_608_608_static.nms.onnx
ONNX IR version: 0.0.7
Opset version: 11
Producer name:
Producer version:
Domain:
Model version: 0
Doc string:

[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/25/2020-12:10:32] [I] [TRT] ModelImporter.cpp:135: No importer registered for op: BatchedNMS_TRT. Attempting to import as plugin.
[11/25/2020-12:10:32] [I] [TRT] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMS_TRT, plugin_version: 1, plugin_namespace:
[11/25/2020-12:10:32] [I] [TRT] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMS_TRT
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs
[11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs
[11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs
[11/25/2020-12:10:32] [W] [TRT] Output type must be INT32 for shape outputs
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] (Unnamed Layer* 1330) [PluginV2Ext]: PluginV2Layer must be V2DynamicExt when there are runtime input dimensions.
[11/25/2020-12:10:32] [E] [TRT] Layer (Unnamed Layer* 1330) [PluginV2Ext] failed validation
[11/25/2020-12:10:32] [E] [TRT] Network validation failed.
[11/25/2020-12:10:33] [E] Engine creation failed
[11/25/2020-12:10:33] [E] Engine set up failed

Does anyone know why this happens?

Environment
TensorRT Version:TensorRT-7.1.3.4
GPU Type: 1080ti
CUDA Version: cuda_11.0.3_451.82_win10
Operating System :Windows 10
Python Version (if applicable): 3.7
PyTorch Version (if applicable): 1.8.0.dev20201118

YOLOv7 EfficientNMS - Num Classes

I am currently working on integrating YOLOv7 with DeepStream and Triton Server. I have been using the NvDsInferParseCustomEfficientNMS function from /opt/nvidia/deepstream/deepstream-6.1/sources/libs/nvdsinfer_customparser/nvdsinfer_custombboxparser.cpp in my setup.

Deepstream / Triton Server - YOLOv7

Now, I'm looking to transition to the implementation provided by NVIDIA in the repository https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/deepstream_yolo/nvdsinfer_custom_impl_Yolo. However, I noticed that the code in this repository has a hardcoded value NUM_CLASSES_YOLO, which is not present in NvDsInferParseCustomEfficientNMS function.

static const int NUM_CLASSES_YOLO = 80;
#define OBJECTLISTSIZE 25200
#define BLOCKSIZE  1024
thrust::device_vector<NvDsInferParseObjectInfo> objects_v(OBJECTLISTSIZE);

extern "C" bool NvDsInferParseCustomYoloV7_cuda( 

As I have multiple YOLOv7 models with primary and secondary inference, I am concerned about having to compile a separate NvDsInferParseCustomYoloV7_cuda for each model.

Could you kindly advise if there is a way to avoid compiling individual NvDsInferParseCustomYoloV7_cuda for each model and instead make it more dynamic or configurable to support multiple models?

Thank you for your assistance and guidance. Any help you can provide will be greatly appreciated.

_pickle.UnpicklingError: invalid load key, '<'.

Within the docker container nvcr.io/nvidia/deepstream:6.0-triton:

mkdir /src
cd /src
git clone https://github.com/Tianxiaomo/pytorch-YOLOv4.git
cd pytorch-YOLOv4
apt -y install python3-venv
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip3 install \
numpy==1.18.2 \
torch==1.4.0 \
tensorboardX==2.0 \
scikit_image==0.16.2 \
matplotlib==2.2.3 \
tqdm==4.43.0 \
easydict==1.9 \
Pillow==7.1.2 \
opencv_python \
onnx \
onnxruntime

wget --no-check-certificate "https://docs.google.com/uc?export=download&id=1wv_LiFeCRYwtpkqREPeI13-gPELBDwuJ" -r -A 'uc*' -e robots=off -nd -O yolov4.pth
python3 demo_pytorch2onnx.py yolov4.pth data/dog.jpg 8 80 416 416

Error:

(venv) root@ip-172-31-9-127:/src/pytorch-YOLOv4# python3 demo_pytorch2onnx.py ./yolov4.pth data/dog.jpg 8 80 416 416
Converting to onnx and running demo ...
Traceback (most recent call last):
  File "demo_pytorch2onnx.py", line 96, in <module>
    main(weight_file, image_path, batch_size, n_classes, IN_IMAGE_H, IN_IMAGE_W)
  File "demo_pytorch2onnx.py", line 72, in main
    transform_to_onnx(weight_file, batch_size, n_classes, IN_IMAGE_H, IN_IMAGE_W)
  File "demo_pytorch2onnx.py", line 19, in transform_to_onnx
    pretrained_dict = torch.load(weight_file, map_location=torch.device('cuda'))
  File "/src/pytorch-YOLOv4/venv/lib/python3.8/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/src/pytorch-YOLOv4/venv/lib/python3.8/site-packages/torch/serialization.py", line 692, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

yolov7 Error Code 3: API Usage Error (pamameter check failed at: runtime/rt/runtime.cpp)

follow the yolov7 steps. able to generate the engine file,

running: ./build/detect --engine==../../data/models/yolov7fp32.engine --img=./imgs/hoses.jpg,./imgs/zidane.jpg

ERROR: 3: [runtime.cpp::deserializeCudaEngine::36] Error Code 3: API Usage Error (Parameter check failed at: runtime/rt/runtime.cpp::deserializeCudaEngine::36, condition: (blob) != nullptr
)
Segmentation fault (core dumped)

Lower Performance with Yolov7-Tiny Quantization

Hi,
After doing quantization on the yolov7-tiny model with the recommended settings, I am getting a lower throughput with the result (qat-tiny.pt) on the benchmark when compared to doing the same benchmark on the model before quantization (yolov7-tiny.pt). I double checked that there were no tasks running in the background so I am pretty sure it is caused by the weights. I was wondering what went wrong during the quantization.
Thanks!

No module named 'trex'

In draw-engine.py,it import trex, But this cannot be installed through pip。Is “trex“ a code?

Performance report for tensorrt_yolov4

As an extension to the preliminary benchmark for tensorrt_yolov4, batch inference performance is provided as follows:

repo. batch=1 batch=2 batch=4 batch=8
tkDNN N/A (N/A) 207.81 N/A N/A (N/A) 443.32 N/A
isarsoft 7.96 (N/A) 125.4 N/A 21.0 (N/A) 189.6 38.3 (N/A) 208.0
this 7.023 (2.61747) 120.831 4.393 (1.76344) 186.44 3.688 (1.26853) 223.68 3.42267 (0.888971) 239.063

where the metrics are formatted as: wall-time in ms (standard deviation of wall-time) frame-per-second. Wall-time only considers pre-processing + inference + post-processing times, while FPS is calculated based on end-to-end process; from image acquisition to image overlays without display.

For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion. While all repositories are based on 320x320 input size and FP16 precision, the accompanying repositories are not to be directly compared as each is having its own metrics. More so, both are using NVIDIA GeForce RTX 2080 Ti, where as for this repository, I'm using NVIDIA GeForce RTX 2070.

Compile nvdsparsebbox_Yolo.cpp Error

While Make (Compiling) nvdsparsebbox_Yolo.cpp getting below error

g++ -o libnvdsinfer_custom_impl_Yolo.so nvdsparsebbox_Yolo.o -shared -Wl,--start-group -lnvinfer_plugin -lnvinfer -lnvparsers -L/usr/local/cuda-11.1/lib64 -lcudart -lcublas -lstdc++fs -Wl,--end-group /usr/bin/ld: cannot find -lnvinfer_plugin /usr/bin/ld: cannot find -lnvinfer /usr/bin/ld: cannot find -lnvparsers collect2: error: ld returned 1 exit status Makefile:47: recipe for target 'libnvdsinfer_custom_impl_Yolo.so' failed make: *** [libnvdsinfer_custom_impl_Yolo.so] Error 1

Can anybody help me

How do I use qat-yolov5.py?

How do I use qat-yolov5.py? Who can provide detailed tutorials, such as which sub-version of yolov5 to use? Do I need to modify the yolov5 model?

Yolov7-QAT: Different Graph exported in PTQ int8 compare with the guide

I downloaded the yolov7 onnx file according to https://github.com/NVIDIA-AI-IOT/yolo_deepstream, and then convert the onnx file into tensorrt int8 engine file in ptq mode, the platform in drive AGX Orin iGPU, however, the graph is different with the guidiance show in https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/yolov7_qat/doc/Guidance_of_QAT_performance_optimization.md

  1. platform: drive agx orin
  2. tensorrt: 8.4.11

Send detection results

Hi all,

I want to send the detection results to an external system.
In order to do that I need the relative result center point coordinates in plain yolo format.
Any hint on how to achieve this? The output of decodeYoloV4Tensor does not seem to match or is somehow already scaled?

Regards,

Trtexec multi-source (streams) and multi-batch performance test failed

Description
I want to test the performance of the model in multi-streams and multi-batches (https://github.com/NVIDIA-AI-IOT/yolo_deepstream#performance) with the trtexec command, and I test it with the following command

/usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_b16_int8_qat_640.engine --shapes=images:4x3x640x640 --streams=4

ps:
The .engine model source is converted by the following command(dynamic batch)

/usr/src/tensorrt/bin/trtexec --verbose --onnx=yolov7_qat_640.onnx --workspace=4096 --minShapes=images:1x3x640x640 --optShapes=images:12x3x640x640 --maxShapes=images:16x3x640x640 --saveEngine=yolov7_b16_int8_qat_640.engine --fp16 --int8

but the following error occurs.

[06/02/2023-09:24:37] [I] === Model Options ===
[06/02/2023-09:24:37] [I] Format: *
[06/02/2023-09:24:37] [I] Model: 
[06/02/2023-09:24:37] [I] Output:
[06/02/2023-09:24:37] [I] === Build Options ===
[06/02/2023-09:24:37] [I] Max batch: explicit batch
[06/02/2023-09:24:37] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/02/2023-09:24:37] [I] minTiming: 1
[06/02/2023-09:24:37] [I] avgTiming: 8
[06/02/2023-09:24:37] [I] Precision: FP32
[06/02/2023-09:24:37] [I] LayerPrecisions: 
[06/02/2023-09:24:37] [I] Calibration: 
[06/02/2023-09:24:37] [I] Refit: Disabled
[06/02/2023-09:24:37] [I] Sparsity: Disabled
[06/02/2023-09:24:37] [I] Safe mode: Disabled
[06/02/2023-09:24:37] [I] DirectIO mode: Disabled
[06/02/2023-09:24:37] [I] Restricted mode: Disabled
[06/02/2023-09:24:37] [I] Build only: Disabled
[06/02/2023-09:24:37] [I] Save engine: 
[06/02/2023-09:24:37] [I] Load engine: yolov7_b16_int8_qat_640.engine
[06/02/2023-09:24:37] [I] Profiling verbosity: 0
[06/02/2023-09:24:37] [I] Tactic sources: Using default tactic sources
[06/02/2023-09:24:37] [I] timingCacheMode: local
[06/02/2023-09:24:37] [I] timingCacheFile: 
[06/02/2023-09:24:37] [I] Heuristic: Disabled
[06/02/2023-09:24:37] [I] Preview Features: Use default preview flags.
[06/02/2023-09:24:37] [I] Input(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Output(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Input build shape: images=4x3x640x640+4x3x640x640+4x3x640x640
[06/02/2023-09:24:37] [I] Input calibration shapes: model
[06/02/2023-09:24:37] [I] === System Options ===
[06/02/2023-09:24:37] [I] Device: 0
[06/02/2023-09:24:37] [I] DLACore: 
[06/02/2023-09:24:37] [I] Plugins:
[06/02/2023-09:24:37] [I] === Inference Options ===
[06/02/2023-09:24:37] [I] Batch: Explicit
[06/02/2023-09:24:37] [I] Input inference shape: images=4x3x640x640
[06/02/2023-09:24:37] [I] Iterations: 10
[06/02/2023-09:24:37] [I] Duration: 3s (+ 200ms warm up)
[06/02/2023-09:24:37] [I] Sleep time: 0ms
[06/02/2023-09:24:37] [I] Idle time: 0ms
[06/02/2023-09:24:37] [I] Streams: 4
[06/02/2023-09:24:37] [I] ExposeDMA: Disabled
[06/02/2023-09:24:37] [I] Data transfers: Enabled
[06/02/2023-09:24:37] [I] Spin-wait: Disabled
[06/02/2023-09:24:37] [I] Multithreading: Disabled
[06/02/2023-09:24:37] [I] CUDA Graph: Disabled
[06/02/2023-09:24:37] [I] Separate profiling: Disabled
[06/02/2023-09:24:37] [I] Time Deserialize: Disabled
[06/02/2023-09:24:37] [I] Time Refit: Disabled
[06/02/2023-09:24:37] [I] NVTX verbosity: 0
[06/02/2023-09:24:37] [I] Persistent Cache Ratio: 0
[06/02/2023-09:24:37] [I] Inputs:
[06/02/2023-09:24:37] [I] === Reporting Options ===
[06/02/2023-09:24:37] [I] Verbose: Disabled
[06/02/2023-09:24:37] [I] Averages: 10 inferences
[06/02/2023-09:24:37] [I] Percentiles: 90,95,99
[06/02/2023-09:24:37] [I] Dump refittable layers:Disabled
[06/02/2023-09:24:37] [I] Dump output: Disabled
[06/02/2023-09:24:37] [I] Profile: Disabled
[06/02/2023-09:24:37] [I] Export timing to JSON file: 
[06/02/2023-09:24:37] [I] Export output to JSON file: 
[06/02/2023-09:24:37] [I] Export profile to JSON file: 
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] === Device Information ===
[06/02/2023-09:24:37] [I] Selected Device: Xavier
[06/02/2023-09:24:37] [I] Compute Capability: 7.2
[06/02/2023-09:24:37] [I] SMs: 8
[06/02/2023-09:24:37] [I] Compute Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] Device Global Memory: 31002 MiB
[06/02/2023-09:24:37] [I] Shared Memory per SM: 96 KiB
[06/02/2023-09:24:37] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/02/2023-09:24:37] [I] Memory Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] TensorRT version: 8.5.2
[06/02/2023-09:24:38] [I] Engine loaded in 0.0275892 sec.
[06/02/2023-09:24:38] [I] [TRT] Loaded engine size: 39 MiB
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[06/02/2023-09:24:39] [I] Engine deserialized in 1.04122 sec.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +364, now: CPU 0, GPU 405 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 768 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 1131 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +364, now: CPU 1, GPU 1495 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Starting inference
[06/02/2023-09:24:39] [E] Error[2]: [executionContext.cpp::enqueueV3::2386] Error Code 2: Internal Error (Assertion mOptimizationProfile >= 0 failed. )
[06/02/2023-09:24:39] [E] Error occurred during inference

Environment

TensorRT Version : 8.5.2
GPU Type : J etson AGX Xavier
Nvidia Driver Version :
CUDA Version : 11.4.315
CUDNN Version : 8.6.0.166
Operating System + Version : 35.2.1 ( Jetpack: 5.1)
Python Version (if applicable) : Python 3.8.10
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) : 1.12.0a0+2c916ef.nv22.3

How to auto insert qdq in shortcut branch of network with residual structure ?

I see you define some rules to align scale settings and batch rename module(from torch.nn. to quant_nn.), very good design !

Now if I donot modify the forward code, I try to use torch.fx to replace [operator.add, torch.add, "add"] (as call_function) to call_module (I custom a nn.module which include residual_quantizer
self.residual_quantizer = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)
)

How to auto insert qdq in shortcut branch of network with residual structure,
Can you give me some advice or other method, Ths ?

Why in the config files it is recommended batch size-size=1?

Hi all, I need to deploy the model with dynamic batching on DS-Triton, but the YOLOV4 example in DeepStream says Following properties are always recommended: # batch-size(Default=1)

I run a test with yolov3 comparing BS=8 vs BS=1 and it exposed poor performance 0.24X , running TensorRT engine with DeepStream 5.1:

**Throughput FPS (avg) | INT8 **

BS =1 → **PERF: 246.29 (245.98)
BS =8 → **PERF: 60.31 (60.63)

What do you recommend to work with BS>1?

How to convert YOLOv4 Pytorch-ONNX --> TensorRT engine INT8 mode with INT8 calibration and Dynamic Shape

Hi all, is there a sample that shows how to optimize a YOLOv4 Pytorch-ONNX to TensorRT engine INT8 mode with full INT8 calibration and dynamic input shapes?. I have generated the TensorRT engine INT8 mode in runtime with DeepStream but it doesn't generate the engine with dynamic input shapes. Also using trtexec it built the engine with static input shapes (default BS=1) and doesn't provide calibration capability.

Hang issue in Tesla T4 GPU

I am able to build .so file for deepstream_yolo, but on runningdeepstream-app, I am getting hang issue.
$ deepstream-app -c deepstream_app_config_yolo.txt

I am getting this log:

`WARNING: [TRT]: CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: ../nvdsinfer/nvdsinfer_model_builder.cpp:1487 Deserialize engine failed because file path: /mnt/home/nawin.ks/Model_inference/deepstream_yolo_configTry/yolov4_-1_3_416_416_nms_dynamic.onnx_b16_gpu0_fp16.engine open error
0:00:03.335773476  6275 0x55a15ac71300 WARN                 nvinfer gstnvinfer.cpp:677:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1897> [UID = 1]: deserialize engine from file :/mnt/home/nawin.ks/Model_inference/deepstream_yolo_configTry/yolov4_-1_3_416_416_nms_dynamic.onnx_b16_gpu0_fp16.engine failed
0:00:03.384774338  6275 0x55a15ac71300 WARN                 nvinfer gstnvinfer.cpp:677:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2002> [UID = 1]: deserialize backend context from engine from file :/mnt/home/nawin.ks/Model_inference/deepstream_yolo_configTry/yolov4_-1_3_416_416_nms_dynamic.onnx_b16_gpu0_fp16.engine failed, try rebuild
0:00:03.384804720  6275 0x55a15ac71300 INFO                 nvinfer gstnvinfer.cpp:680:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1923> [UID = 1]: Trying to create engine from model files
WARNING: [TRT]: CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [TRT]: onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: builtin_op_importers.cpp:5245: Attribute scoreBits not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
WARNING: [TRT]: builtin_op_importers.cpp:5245: Attribute caffeSemantics not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
WARNING: [TRT]: Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
WARNING: [TRT]: Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
WARNING: [TRT]: TensorRT was linked against cuDNN 8.6.0 but loaded cuDNN 8.2.2`

After last line of log, it does nothing and does not terminate itself.

I am not getting hang issue on replacing YOLOv4 ONNX model to lighter Resnet10 OMMX model and disabling use of libnvdsinfer_custom_impl_Yolo.so and NvDsInferParseCustomYoloV4 function.

Deepstream version: 6.2
Tensorrt version: 2.5.2
Platform: AWS EC2 g4dn.2xlarge

Can you please help me on this?

Using NHWC format instead of NCHW for deepstream

Hi, I’m the deepstream-test1 and yolo_deepstream app to use my own int8 tensorrt engine. The model takes input in NHWC format but the app expects input in the NCHW format.
I had followed this link to get the int8 tensorRT engine.
If deepstream doesn't work with NHWC, then please guide me to convert the engine/onnx/saved_model(tensorflow pb file) to NCHW.

Currently I'm getting this:
INFO: [Implicit Engine Info]: layers num: 5
0 INPUT kFLOAT image_arrays:0 640x640x3
1 OUTPUT kINT32 num_detections 0
2 OUTPUT kFLOAT detection_boxes 1024x4
3 OUTPUT kFLOAT detection_scores 1024
4 OUTPUT kFLOAT detection_classes 1024

0:00:08.703674449 4915 0x558fe32520 ERROR nvinfer gstnvinfer.cpp:613:gst_nvinfer_logger: NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::preparePreprocess() <nvdsinfer_context_impl.cpp:874> [UID = 1]: RGB/BGR input format specified but network input channels is not 3

• Hardware Platform Jetson
• DeepStream Version 5.0
• JetPack Version 4.4
• TensorRT Version 7.2
• Issue Type questions
• Jetson NX

yolov5 qat Add graph still use useless data conversion node.

Thank you for your repo. it's really good for understanding qat tool.

Now i am currently working on qat for yolov5.
I added Q/DQ node with your advice. see the below.
image

but still they have useless conversion fp 16 to int8. see the below trex graph.
image

how can i remove the useless conversion node ?
Thanks,
Maro JEON

error when build code.

../Makefile.config:26: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
../Makefile.config:31: CUDNN_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.
../Makefile.config:44: TRT_LIB_DIR is not specified, searching ../../lib, ../../lib, ../lib by default, use TRT_LIB_DIR=<trt_lib_directory> to change.
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: SampleYolo.cpp
In file included from SampleYolo.cpp:23:0:
SampleYolo.hpp:41:10: fatal error: opencv2/opencv.hpp: No such file or directory
#include <opencv2/opencv.hpp>
^~~~~~~~~~~~~~~~~~~~
compilation terminated.
../Makefile.config:338: recipe for target '../bin/dchobj/SampleYolo.o' failed
make: *** [../bin/dchobj/SampleYolo.o] Error 1

yolov7_qat

Will yolov7_qat support custom data training instead of COCO in the near future?

export onnx error

in tensor_quantizer.py line 293

outputs = torch.fake_quantize_per_channel_affine(
inputs, scale.data, torch.zeros_like(scale, dtype=torch.int32).data, quant_dim,
-bound - 1 if not self._unsigned else 0, bound)

RuntimeError: Zero-point must be Long, found Int

Plugin is not working on TX2 NX for Yolo V4

Hi,

I am using Jetpack 4.6. I tried both the prebuild libnvinfer_plugin.so.8.0.1 from this repo (https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/tree/master/TRT-OSS/Jetson/TRT8.0) and I also tried building the plugin by myself. Both show me the following error.

[04/25/2022-16:52:55] [E] [TRT] 2: [pluginV2Runner.cpp::execute::267] Error Code 2: Internal Error (Assertion status == kSTATUS_SUCCESS failed.)
&&&& FAILED TensorRT.sample_yolo [TensorRT v8001] # ../bin/yolov4 --fp16

Jetpack 4.6 comes with:
TensorRT 8.0.1
CuDNN 8.2.1
CUDA 10.2

Need help.
Thanks,

[E] [TRT] ../rtSafe/cuda/cudaConvolutionRunner.cpp (457) - Cudnn Error in execute: 3 (CUDNN_STATUS_BAD_PARAM)

my env:
Ubuntu 18.04
cuda 10.2
cudnn 7.6.5
tensorrt 7.1.3
tensorrt oss libnvinfer_plugin.so.7.1.3

cudnn 7.6.5 installed from tar archive, CUDNN_INSTALL_DIR=/user/local/cuda
onnx created with https://github.com/Tianxiaomo/pytorch-YOLOv4
in SampleYolo.cpp in bool SampleYolo::build() add this code:

       profile->setDimensions("input", OptProfileSelector::kMIN, Dims4{1, 3, 320, 320});
       profile->setDimensions("input", OptProfileSelector::kOPT, Dims4{1, 3, 320, 320});
       profile->setDimensions("input", OptProfileSelector::kMAX, Dims4{1, 3, 320, 320});
       config->addOptimizationProfile(profile);

in onnx_add_nms_plugin.py changed mns_node to "BatchedNMSDynamic_TRT"

$ make
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: SampleYolo.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: main.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/sampleInference.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/sampleOptions.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/logger.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/getOptions.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/sampleReporting.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/dchobj/../common; fi; :
Compiling: ../common/sampleEngines.cpp
Linking: ../bin/yolov4_debug
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/chobj/../common; fi; :
Compiling: SampleYolo.cpp
if [ ! -d ../bin/chobj/../common ]; then mkdir -p ../bin/chobj/../common; fi; :
Compiling: main.cpp
Linking: ../bin/yolov4
# Copy every EXTRA_FILE of this sample to bin dir
$ ../bin/yolov4 -demo
&&&& RUNNING TensorRT.sample_yolo # ../bin/yolov4 -demo
There are 0 coco images to process
[07/18/2021-13:20:19] [I] Building and running a GPU inference engine for Yolo
[07/18/2021-13:20:19] [I] Parsing ONNX file: ../data/yolov4.onnx
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[07/18/2021-13:20:20] [I] [TRT] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[07/18/2021-13:20:20] [I] [TRT] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace: 
[07/18/2021-13:20:20] [I] [TRT] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMSDynamic_TRT
[07/18/2021-13:20:20] [W] [TRT] Output type must be INT32 for shape outputs
[07/18/2021-13:20:20] [W] [TRT] Output type must be INT32 for shape outputs
[07/18/2021-13:20:20] [I] Building TensorRT engine../data/yolov4.engine
[07/18/2021-13:21:00] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[07/18/2021-13:22:05] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[07/18/2021-13:22:07] [I] TRT Engine file saved to: ../data/yolov4.engine
4
[07/18/2021-13:22:07] [I] Loading or building yolo model done
[07/18/2021-13:22:07] [E] [TRT] ../rtSafe/cuda/cudaConvolutionRunner.cpp (457) - Cudnn Error in execute: 3 (CUDNN_STATUS_BAD_PARAM)
[07/18/2021-13:22:07] [E] [TRT] FAILED_EXECUTION: std::exception
Time consumed in preProcess: 0
Time consumed in model: 0
Time consumed in postProcess: 0
[07/18/2021-13:22:07] [I] Inference of yolo model done

shared memory using Jetson devices

in pushImg, we have to copy memory from cpu to gpu or device to device. for Jetson devices, such as AGX orin, can we save this step by leveraging shared memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.