Git Product home page Git Product logo

yolov9-qat's Introduction

YOLOv9 QAT for TensorRT

This repository contains an implementation of YOLOv9 with Quantization-Aware Training (QAT), specifically designed for deployment on platforms utilizing TensorRT for hardware-accelerated inference.
This implementation aims to provide an efficient, low-latency version of YOLOv9 for real-time detection applications.
If you do not intend to deploy your model using TensorRT, it is recommended not to proceed with this implementation.

  • The files in this repository represent a patch that adds QAT functionality to the original YOLOv9 repository.
  • This patch is intended to be applied to the main YOLOv9 repository to incorporate the ability to train with QAT.
  • The implementation is optimized to work efficiently with TensorRT, an inference library that leverages hardware acceleration to enhance inference performance.
  • Users interested in implementing object detection using YOLOv9 with QAT on TensorRT platforms can benefit from this repository as it provides a ready-to-use solution.

We use TensorRT's pytorch quntization tool to finetune training QAT yolov9 from the pre-trained weight, then export the model to onnx and deploy it with TensorRT. The accuray and performance can be found in below table.

For those who are not familiar with QAT, I highly recommend watching this video:
Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Currently, quantization is only available for object detection models. However, since quantization has the greatest impact on the YOLOv9 model's backbone and the backbone remains unchanged, quantization is essentially ready for all models yolov9 types.
Quantization support for segmentation models has not yet been released, as it requires the development of evaluation and the validation of quantization of last layers of the model.
๐ŸŒŸ We still have plenty of nodes to improve Q/DQ, and we rely on the community's contribution to enhance this project, benefiting us all. Let's collaborate and make it even better! ๐Ÿš€

Release Highlights

  • This release includes an upgrade from TensorRT 8 to TensorRT 10, ensuring compatibility with the CUDA version supported - by the latest NVIDIA Ada Lovelace GPUs.
  • The inference has been upgraded utilizing enqueueV3 instead enqueueV2.
  • To maintain legacy support for TensorRT 8, a dedicated branch has been created. Outdated
  • We've added a new option --generate-graph which enables Graph Rendering functionality. This feature facilitates the creation of graphical representations of the engine plan in SVG image format.

Perfomance / Accuracy

Full Report

Accuracy Report


Evaluation Results

Activation SiLU

Eval Model AP AP50 Precision Recall
Origin (Pytorch) 0.529 0.699 0.743 0.634
INT8 (Pytorch) 0.529 0.702 0.742 0.63
INT8 (TensorRT) 0.529 0.696 0.739 0.635

Activation ReLU

Eval Model AP AP50 Precision Recall
Origin (Pytorch) 0.519 0.69 0.719 0.629
INT8 (Pytorch) 0.518 0.69 0.726 0.625
INT8 (TensorRT) 0.517 0.685 0.723 0.626

Evaluation Comparison

Activation SiLU

Eval Model AP AP50 Precision Recall
INT8 (TensorRT) vs Origin (Pytorch)
0.000 -0.003 -0.004 +0.001

Activation ReLU

Eval Model AP AP50 Precision Recall
INT8 (TensorRT) vs Origin (Pytorch)
-0.002 -0.005 +0.004 -0.003

Latency/Throughput Report - TensorRT



Device NVIDIA GeForce RTX 4090
Compute Capability 8.9
SMs 128
Device Global Memory 24207 MiB
Application Compute Clock Rate 2.58 GHz
Application Memory Clock Rate 10.501 GHz


Model Name Batch Size Latency (99%) Throughput (qps) Total Inferences (IPS)
(FP16) SiLU 1 1.25 ms 803 803
4 3.37 ms 300 1200
8 6.6 ms 153 1224
12 10 ms 99 1188
INT8 (SiLU) 1 0.97 ms 1030 1030
4 2,06 ms 486 1944
8 3.69 ms 271 2168
12 5.36 ms 189 2268
INT8 (ReLU) 1 0.87 ms 1150 1150
4 1.78 ms 562 2248
8 3.06 ms 327 2616
12 4.63 ms 217 2604

Latency/Throughput Comparison (INT8 vs FP16)

Model Name Batch Size Latency (99%) Change Throughput (qps) Change Total Inferences (IPS) Change
INT8(SiLU) vs FP16 1 -20.8% +28.4% +28.4%
4 -37.1% +62.0% +62.0%
8 -41.1% +77.0% +77.0%
12 -46.9% +90.9% +90.9%

QAT Training (Finetune)

In this section, we'll outline the steps to perform Quantization-Aware Training (QAT) using fine-tuning.
Please note that the supported quantization mode is fine-tuning only.
The model should be trained using the original implementation, and after training and reparameterization of the model, the user should proceed with quantization.


  1. Train the Model Using Training Session:

    • Utilize the original implementation to train your YOLOv9 model with your dataset and desired configurations.
    • Follow the training instructions provided in the original YOLOv9 repository to ensure proper training.
  2. Reparameterize the Model

    • After completing the training, reparameterize the trained model to prepare it for quantization. This step is crucial for ensuring that the model's weights are in a suitable format for quantization.
  3. Proceed with Quantization:

    • Once the model is reparameterized, proceed with the quantization process. This involves applying the Quantization-Aware Training technique to fine-tune the model's weights, taking into account the quantization effects.
  4. Eval Pytorch / Eval TensorRT:

    • After quantization, it's crucial to validate the performance of the quantized model to ensure that it meets your requirements in terms of accuracy and efficiency.
    • Test the quantized model thoroughly at both stages: during the quantization phase using PyTorch and after training using TensorRT.
    • Please note that different versions of TensorRT may yield varying results and perfomance
  5. Export to ONNX:

    • Export ONNX
    • Once you are satisfied with the quantized model's performance, you can proceed to export it to ONNX format.
  6. Deploy with TensorRT:

    • Deployment with TensorRT
    • After exporting to ONNX, you can deploy the model using TensorRT for hardware-accelerated inference on platforms supporting TensorRT.

By following these steps, you can successfully perform Quantization-Aware Training (QAT) using fine-tuning with your YOLOv9 model.

How to Install and Training

Suggest to use docker environment. NVIDIA PyTorch image (

Release 23.02 is based on CUDA 12.0.1, which requires NVIDIA Driver release 525 or later.


docker pull

## clone original yolov9
git clone

docker run --gpus all  \
 -it \
 --net host  \
 --ipc=host \
 -v $(pwd)/yolov9:/yolov9 \
 -v $(pwd)/coco/:/yolov9/coco \
 -v $(pwd)/runs:/yolov9/runs \
  1. Clone and apply patch (Inside Docker)
cd /
git clone
cd /yolov9-qat
./ /yolov9
  1. Install dependencies
  • This release upgrade TensorRT from 8.5 to 10.0
  • ./ --defaults [--trex]
  • --defaults Install/Upgrade required packages
  • --trex Install TensoRT Explorer (trex) on virtual env. Required only if you want generate Graph SVG for visualizing the profiling of a TensorRT engine.
cd /yolov9-qat
./ --defaults
cd /yolov9
  1. Download dataset and pretrained model
$ cd /yolov9
$ bash scripts/
$ wget


Quantize Model

To quantize a YOLOv9 model, run:

python3 quantize --weights  --name yolov9_qat --exist-ok

python quantize --weights <weights_path> --data <data_path> --hyp <hyp_path> ...

Quantize Command Arguments


This command is used to perform PTQ/QAT finetuning.


  • --weights: Path to the model weights (.pt). Default: ROOT/runs/models_original/
  • --data: Path to the dataset configuration file (data.yaml). Default: ROOT/data/coco.yaml.
  • --hyp: Path to the hyperparameters file (hyp.yaml). Default: ROOT/data/hyps/hyp.scratch-high.yaml.
  • --device: Device to use for training/evaluation (e.g., "cuda:0"). Default: "cuda:0".
  • --batch-size: Total batch size for training/evaluation. Default: 10.
  • --imgsz, --img, --img-size: Train/val image size (pixels). Default: 640.
  • --project: Directory to save the training/evaluation outputs. Default: ROOT/runs/qat.
  • --name: Name of the training/evaluation experiment. Default: 'exp'.
  • --exist-ok: Flag to indicate if existing project/name should be overwritten.
  • --iters: Iterations per epoch. Default: 200.
  • --seed: Global training seed. Default: 57.
  • --supervision-stride: Supervision stride. Default: 1.
  • --no-eval-origin: Disable eval for origin model.
  • --no-eval-ptq: Disable eval for ptq model.

Sensitive Layer Analysis

python sensitive --weights --data data/coco.yaml --hyp hyp.scratch-high.yaml ...

Sensitive Command Arguments


This command is used for sensitive layer analysis.


  • --weights: Path to the model weights (.pt). Default: ROOT/runs/models_original/
  • --device: Device to use for training/evaluation (e.g., "cuda:0"). Default: "cuda:0".
  • --data: Path to the dataset configuration file (data.yaml). Default: data/coco.yaml.
  • --batch-size: Total batch size for training/evaluation. Default: 10.
  • --imgsz, --img, --img-size: Train/val image size (pixels). Default: 640.
  • --hyp: Path to the hyperparameters file (hyp.yaml). Default: data/hyps/hyp.scratch-high.yaml.
  • --project: Directory to save the training/evaluation outputs. Default: ROOT/runs/qat_sentive.
  • --name: Name of the training/evaluation experiment. Default: 'exp'.
  • --exist-ok: Flag to indicate if existing project/name should be overwritten.
  • --num-image: Number of images to evaluate. Default: None.

Evaluate QAT Model

Evaluate using Pytorch

python3 eval --weights runs/qat/yolov9_qat/weights/  --name eval_qat_yolov9

Evaluation Command Arguments


This command is used to perform evaluation on QAT Models.


  • --weights: Path to the model weights (.pt). Default: ROOT/runs/models_original/
  • --data: Path to the dataset configuration file (data.yaml). Default: data/coco.yaml.
  • --batch-size: Total batch size for evaluation. Default: 10.
  • --imgsz, --img, --img-size: Validation image size (pixels). Default: 640.
  • --device: Device to use for evaluation (e.g., "cuda:0"). Default: "cuda:0".
  • --conf-thres: Confidence threshold for evaluation. Default: 0.001.
  • --iou-thres: NMS threshold for evaluation. Default: 0.7.
  • --project: Directory to save the evaluation outputs. Default: ROOT/runs/qat_eval.
  • --name: Name of the evaluation experiment. Default: 'exp'.
  • --exist-ok: Flag to indicate if existing project/name should be overwritten.

Evaluate using TensorRT

./scripts/ <weights> <data yaml>  <image_size>

./scripts/ runs/qat/yolov9_qat/weights/ data/coco.yaml 640

Generate TensoRT Profiling and SVG image

TensorRT Explorer can be installed by executing ./ --trex.
This installation is necessary to enable the generation of Graph SV, allowing visualization of the profiling data for a TensorRT engine.

./scripts/ runs/qat/yolov9_qat/weights/ data/coco.yaml 640 --generate-graph

Export ONNX

The goal of exporting to ONNX is to deploy to TensorRT, not to ONNX runtime. So we only export fake quantized model into a form TensorRT will take. Fake quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. TensorRT will take the generated ONNX graph, and execute it in int8 in the most optimized way to its capability.

Export ONNX Model without End2End

python3 --weights runs/qat/yolov9_qat/weights/ --include onnx --dynamic --simplify --inplace

Export ONNX Model End2End

python3  --weights runs/qat/yolov9_qat/weights/ --include onnx_end2end

Deployment with Tensorrt

 /usr/src/tensorrt/bin/trtexec \
  --onnx=runs/qat/yolov9_qat/weights/qat_best_yolov9-c-converted.onnx \
  --int8 --fp16  \
  --useCudaGraph \
  --minShapes=images:1x3x640x640 \
  --optShapes=images:4x3x640x640 \
  --maxShapes=images:8x3x640x640 \


Note: To test FP16 Models (such as Origin) remove flag --int8

# Set variable batch_size  and model_path_no_ext
export batch_size=4
export filepath_no_ext=runs/qat/yolov9_qat/weights/qat_best_yolov9-c-converted
trtexec \
	--onnx=${filepath_no_ext}.onnx \
	--fp16 \
	--int8 \
	--saveEngine=${filepath_no_ext}.engine \
	--timingCacheFile=${filepath_no_ext}.engine.timing.cache \
	--warmUp=500 \
	--duration=10  \
	--useCudaGraph \
	--useSpinWait \
	--noDataTransfers \
	--minShapes=images:1x3x640x640 \
	--optShapes=images:${batch_size}x3x640x640 \


=== Device Information ===
Available Devices:
  Device 0: "NVIDIA GeForce RTX 4090" 
Selected Device: NVIDIA GeForce RTX 4090
Selected Device ID: 0
Compute Capability: 8.9
SMs: 128
Device Global Memory: 24207 MiB
Shared Memory per SM: 100 KiB
Memory Bus Width: 384 bits (ECC disabled)
Application Compute Clock Rate: 2.58 GHz
Application Memory Clock Rate: 10.501 GHz

Output Details

  • Latency: refers to the [min, max, mean, median, 99% percentile] of the engine latency measurements, when timing the engine w/o profiling layers.
  • Throughput: is measured in query (inference) per second (QPS).


Batch Size 1

Throughput: 1026.71 qps
Latency: min = 0.969727 ms, max = 0.975098 ms, mean = 0.972263 ms, median = 0.972656 ms, percentile(90%) = 0.973145 ms, percentile(95%) = 0.973633 ms, percentile(99%) = 0.974121 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0195312 ms, mean = 0.00228119 ms, median = 0.00219727 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 0.969727 ms, max = 0.975098 ms, mean = 0.972263 ms, median = 0.972656 ms, percentile(90%) = 0.973145 ms, percentile(95%) = 0.973633 ms, percentile(99%) = 0.974121 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0019 s
Total GPU Compute Time: 9.98417 s

BatchSize 4

=== Performance summary ===
Throughput: 485.73 qps
Latency: min = 2.05176 ms, max = 2.06152 ms, mean = 2.05712 ms, median = 2.05713 ms, percentile(90%) = 2.05908 ms, percentile(95%) = 2.05957 ms, percentile(99%) = 2.06055 ms
Enqueue Time: min = 0.00195312 ms, max = 0.00708008 ms, mean = 0.00230195 ms, median = 0.00219727 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00415039 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 2.05176 ms, max = 2.06152 ms, mean = 2.05712 ms, median = 2.05713 ms, percentile(90%) = 2.05908 ms, percentile(95%) = 2.05957 ms, percentile(99%) = 2.06055 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0035 s
Total GPU Compute Time: 9.99553 s

BatchSize 8

=== Performance summary ===
Throughput: 271.107 qps
Latency: min = 3.6792 ms, max = 3.69775 ms, mean = 3.68694 ms, median = 3.68652 ms, percentile(90%) = 3.69043 ms, percentile(95%) = 3.69141 ms, percentile(99%) = 3.69336 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0090332 ms, mean = 0.0023588 ms, median = 0.00231934 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00476074 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 3.6792 ms, max = 3.69775 ms, mean = 3.68694 ms, median = 3.68652 ms, percentile(90%) = 3.69043 ms, percentile(95%) = 3.69141 ms, percentile(99%) = 3.69336 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0071 s
Total GPU Compute Time: 10.0027 s

BatchSize 12

=== Performance summary ===
Throughput: 188.812 qps
Latency: min = 5.25 ms, max = 5.37097 ms, mean = 5.2946 ms, median = 5.28906 ms, percentile(90%) = 5.32129 ms, percentile(95%) = 5.32593 ms, percentile(99%) = 5.36475 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0898438 ms, mean = 0.00248513 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00463867 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 5.25 ms, max = 5.37097 ms, mean = 5.2946 ms, median = 5.28906 ms, percentile(90%) = 5.32129 ms, percentile(95%) = 5.32593 ms, percentile(99%) = 5.36475 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.01 s
Total GPU Compute Time: 10.0068 s


Batch Size 1

 === Performance summary ===
 Throughput: 1149.49 qps
 Latency: min = 0.866211 ms, max = 0.871094 ms, mean = 0.868257 ms, median = 0.868164 ms, percentile(90%) = 0.869385 ms, percentile(95%) = 0.869629 ms, percentile(99%) = 0.870117 ms
 Enqueue Time: min = 0.00195312 ms, max = 0.0180664 ms, mean = 0.00224214 ms, median = 0.00219727 ms, percentile(90%) = 0.00268555 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
 H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
 GPU Compute Time: min = 0.866211 ms, max = 0.871094 ms, mean = 0.868257 ms, median = 0.868164 ms, percentile(90%) = 0.869385 ms, percentile(95%) = 0.869629 ms, percentile(99%) = 0.870117 ms
 D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
 Total Host Walltime: 10.0018 s
 Total GPU Compute Time: 9.98235 s

BatchSize 4

=== Performance summary ===
Throughput: 561.857 qps
Latency: min = 1.77344 ms, max = 1.78418 ms, mean = 1.77814 ms, median = 1.77832 ms, percentile(90%) = 1.77979 ms, percentile(95%) = 1.78076 ms, percentile(99%) = 1.78174 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0205078 ms, mean = 0.00233018 ms, median = 0.0022583 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00439453 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 1.77344 ms, max = 1.78418 ms, mean = 1.77814 ms, median = 1.77832 ms, percentile(90%) = 1.77979 ms, percentile(95%) = 1.78076 ms, percentile(99%) = 1.78174 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0043 s
Total GPU Compute Time: 9.99494 s

BatchSize 8

=== Performance summary ===
Throughput: 326.86 qps
Latency: min = 3.04126 ms, max = 3.06934 ms, mean = 3.05773 ms, median = 3.05859 ms, percentile(90%) = 3.06152 ms, percentile(95%) = 3.0625 ms, percentile(99%) = 3.06396 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0209961 ms, mean = 0.00235826 ms, median = 0.00231934 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00463867 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 3.04126 ms, max = 3.06934 ms, mean = 3.05773 ms, median = 3.05859 ms, percentile(90%) = 3.06152 ms, percentile(95%) = 3.0625 ms, percentile(99%) = 3.06396 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0043 s
Total GPU Compute Time: 9.99877 s

BatchSize 12

=== Performance summary ===
Throughput: 216.441 qps
Latency: min = 4.60742 ms, max = 4.63184 ms, mean = 4.61852 ms, median = 4.61816 ms, percentile(90%) = 4.62305 ms, percentile(95%) = 4.62439 ms, percentile(99%) = 4.62744 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0131836 ms, mean = 0.00250633 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00341797 ms, percentile(99%) = 0.00531006 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 4.60742 ms, max = 4.63184 ms, mean = 4.61852 ms, median = 4.61816 ms, percentile(90%) = 4.62305 ms, percentile(95%) = 4.62439 ms, percentile(99%) = 4.62744 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0074 s
Total GPU Compute Time: 10.0037 s


Batch Size 1

=== Performance summary ===
Throughput: 802.984 qps
Latency: min = 1.23901 ms, max = 1.25439 ms, mean = 1.24376 ms, median = 1.24316 ms, percentile(90%) = 1.24805 ms, percentile(95%) = 1.24902 ms, percentile(99%) = 1.24951 ms
Enqueue Time: min = 0.00195312 ms, max = 0.00756836 ms, mean = 0.00240711 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 1.23901 ms, max = 1.25439 ms, mean = 1.24376 ms, median = 1.24316 ms, percentile(90%) = 1.24805 ms, percentile(95%) = 1.24902 ms, percentile(99%) = 1.24951 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0027 s
Total GPU Compute Time: 9.98985 s

BatchSize 4

=== Performance summary ===
Throughput: 300.281 qps
Latency: min = 3.30341 ms, max = 3.38025 ms, mean = 3.32861 ms, median = 3.3291 ms, percentile(90%) = 3.33594 ms, percentile(95%) = 3.34229 ms, percentile(99%) = 3.37 ms
Enqueue Time: min = 0.00195312 ms, max = 0.00830078 ms, mean = 0.00244718 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 3.30341 ms, max = 3.38025 ms, mean = 3.32861 ms, median = 3.3291 ms, percentile(90%) = 3.33594 ms, percentile(95%) = 3.34229 ms, percentile(99%) = 3.37 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0073 s
Total GPU Compute Time: 10.0025 s

BatchSize 8

=== Performance summary ===
Throughput: 153.031 qps
Latency: min = 6.47882 ms, max = 6.64679 ms, mean = 6.53299 ms, median = 6.5332 ms, percentile(90%) = 6.55029 ms, percentile(95%) = 6.55762 ms, percentile(99%) = 6.59766 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0117188 ms, mean = 0.00248772 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 6.47882 ms, max = 6.64679 ms, mean = 6.53299 ms, median = 6.5332 ms, percentile(90%) = 6.55029 ms, percentile(95%) = 6.55762 ms, percentile(99%) = 6.59766 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.011 s
Total GPU Compute Time: 10.0085 s

BatchSize 8

=== Performance summary ===
Throughput: 99.3162 qps
Latency: min = 10.0372 ms, max = 10.0947 ms, mean = 10.0672 ms, median = 10.0674 ms, percentile(90%) = 10.0781 ms, percentile(95%) = 10.0811 ms, percentile(99%) = 10.0859 ms
Enqueue Time: min = 0.00195312 ms, max = 0.0078125 ms, mean = 0.00248219 ms, median = 0.00244141 ms, percentile(90%) = 0.00292969 ms, percentile(95%) = 0.00292969 ms, percentile(99%) = 0.00390625 ms
H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
GPU Compute Time: min = 10.0372 ms, max = 10.0947 ms, mean = 10.0672 ms, median = 10.0674 ms, percentile(90%) = 10.0781 ms, percentile(95%) = 10.0811 ms, percentile(99%) = 10.0859 ms
D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
Total Host Walltime: 10.0286 s
Total GPU Compute Time: 10.0269 s

yolov9-qat's People


levipereira avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar

yolov9-qat's Issues

No reduction in inference time for qat model

I have deployed yolov9-qat model using C++ tensort in RTX 3090, But I find the inference time is same comparing with fp16.
The modification I made on the fp16 code was simply to add a this:


i.e. I set both fp16 and int8:


And I tested infer-yolov9-c-qat-end2end.onnx, There is still no reduction in inference time for this model.

Cannot evaluate qat model and export end2end onnx.

Hi, thx for your great jobs.
I have used your code to quantize these yolov9 models. But I can't valuate qat model and export end2end onnx.
The error logs and information of my device are below.
However, I can use to export normal onnx and to export end2end onnx. It seems kind of weird.

root@Ryan-Windows-PC:/yolov9# python3 eval --weights runs/qat/yolov9-c-converted-qat/weights/ --name eval_qat_yolov9
Namespace(batch_size=10, cmd='eval', conf_thres=0.001, data='data/coco.yaml', device='cuda:0', exist_ok=False, imgsz=640, iou_thres=0.7, name='eval_qat_yolov9', project=PosixPath('runs/qat_eval'), save_dir='runs/qat_eval/eval_qat_yolov9', weights='runs/qat/yolov9-c-converted-qat/weights/')
Traceback (most recent call last):
File "", line 542, in
run_eval(opt.weights, opt.device,,
TypeError: run_eval() missing 1 required positional argument: 'eval_pycocotools'

root@Ryan-Windows-PC:/yolov9# python3 --weights runs/qat/yolov9-c-converted-qat/weights/ --include onnx_end2end
export_qat: data=data/coco.yaml, weights=['runs/qat/yolov9-c-converted-qat/weights/'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=True, keras=False, optimize=False, int8=False, dynamic=True, simplify=True, opset=12, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx_end2end']
YOLO ๐Ÿš€ v0.1-85-gbad4f4b Python-3.8.10 torch-1.14.0a0+44dac51 CPU

Fusing layers...
gelan-c summary: 676 layers, 25288768 parameters, 25288768 gradients, 0.4 GFLOPs

PyTorch: starting from runs/qat/yolov9-c-converted-qat/weights/ with output shape (1, 84, 8400) (99.8 MB)

ONNX END2END: starting export with onnx 1.16.0...
ONNX END2END: export failure โŒ 0.0s: 'DetectionModel' object is not subscriptable

=== Device Information ===
Available Devices:
Device 0: "NVIDIA GeForce RTX 4080 SUPER" UUID: GPU-91544520-3ee2-efb6-e8ca-c6e6536f93f7
Device 1: "NVIDIA GeForce GTX 1080 Ti" UUID: GPU-c20a7a00-79b8-e596-d431-59989dc5c026
Selected Device: NVIDIA GeForce RTX 4080 SUPER
Selected Device ID: 0
Selected Device UUID: GPU-91544520-3ee2-efb6-e8ca-c6e6536f93f7
Compute Capability: 8.9
SMs: 80
Device Global Memory: 16375 MiB
Shared Memory per SM: 100 KiB
Memory Bus Width: 256 bits (ECC disabled)
Application Compute Clock Rate: 2.55 GHz
Application Memory Clock Rate: 11.501 GHz
TensorRT version: 10.0.0

Quantization in SPPELAN

We're currently working on improving our YOLOv9 model by implementing quantization in the class SPPELAN(nn.Module): . We've observed that the current implementation is generating reformat operations.

Any contributions, suggestions, or shared experiences will be valued.

Implementing Quantization in ADown Downsampling Class

We're currently working on improving our YOLOv9 model by implementing quantization in the ADown downsampling class. We've observed that the current implementation is generating reformat operations and increasing the latency of the model.

To address this issue, we plan to implement quantization in two steps:

  1. Creating a Quantized Version of ADown:
    We will create a new class called QuantADown . This class will contain a method named adown_quant_forward(self, x) to handle the quantized forward pass.

  2. Integration into the Model:
    We will integrate the QuantADown class into our model by modifying the replace_custom_module_forward function in our quantization script. This function is responsible for replacing custom modules with their quantized counterparts during the quantization process.

We believe that implementing quantization in the ADown class will help optimize the model's performance and reduce latency. We welcome any feedback or suggestions from the community regarding this approach.

Thank you for your support and collaboration!


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.