Why does the output processing (yolov7 tiny) takes so long compere to SSD_mobilenet ?<

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I don't see detect.py can support iGPU: <a href="

O.k. retested it on CPU <a target="_blank" rel="noopener noreferrer

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

output processing is slow,about openvino-dev-contest/yolov7_openvino_cpp-python

Comments (24)

OpenVINO-dev-contest commented on July 21, 2024

Hi @nevoGoldamn Could you give some hints on which part of the code you think should be updated ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

Hi @OpenVINO-dev-contest, I know the tensor shape is what it is. but is it logical to make inference on yolov7-tiny (train on 640x640) at 400ms ? because when I run detect.py (from yolov7 repository) it makes inference at 8ms.

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

Did you run detect.py with dGPU device ? this repository only works for Intel's CPU and iGPU

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

both on iGPU

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

I don't see detect.py can support iGPU:
https://github.com/WongKinYiu/yolov7/blob/main/detect.py#L173

May I know the full command you used to run detect.py ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

O.k. retested it on CPU

this is output after detect.py

and this is the output after the cpp code (in seconds).

same model, same CPU....

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

May I know the full command you used to run detect.py ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

python detect.py --weights best_costum.pt --conf 0.2 --img-size 640 --source test7.avi --no-trace

made the change in the code to --> device = select_device('cpu') #opt.device

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

How do you measure the time cost in cpp example ? Does it include the total process of detection or just inference time ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

double start, end, res;
start = cv::getTickCount();

//cv::cvtColor(boxed, img, cv::COLOR_BGR2RGB);

// -------- Step 5. Prepare input --------
//ov::Tensor input_tensor1 = infer_request.get_input_tensor(0);
// NHWC => NCHW

boxed.convertTo(boxed, CV_32FC3);
//ov::Tensor input_tensor(input_port.get_element_type(), input_port.get_shape(), data1);
ov::Tensor input_tensor(input_port.get_element_type(), input_port.get_shape(), (float*)boxed.data);
infer_request.set_input_tensor(input_tensor);
// -------- Step 6. Start inference --------
infer_request.infer();

// -------- Step 7. Process output --------
//auto output_tensor_p8 = infer_request.get_output_tensor(0);
//const float* result_p8 = output_tensor_p8.data<const float>();
//auto output_tensor_p16 = infer_request.get_output_tensor(1);
//const float* result_p16 = output_tensor_p16.data<const float>();
auto output_tensor_p32 = infer_request.get_output_tensor(2);
const float* result_p32 = output_tensor_p32.data<const float>();

std::vector<Object> proposals;
std::vector<Object> objects8;
std::vector<Object> objects16;
std::vector<Object> objects32;
std::vector<Object> objects;

//generate_proposals(8, result_p8, prob_threshold, objects8);
//proposals.insert(proposals.end(), objects8.begin(), objects8.end());
//generate_proposals(16, result_p16, prob_threshold, objects16);
//proposals.insert(proposals.end(), objects16.begin(), objects16.end());
generate_proposals(32, result_p32, prob_threshold, objects32);
proposals.insert(proposals.end(), objects32.begin(), objects32.end());

std::vector<int> classIds;
std::vector<float> confidences;
std::vector<cv::Rect> boxes;

for (size_t i = 0; i < proposals.size(); i++)
{
    classIds.push_back(proposals[i].label);
    confidences.push_back(proposals[i].prob);
    boxes.push_back(proposals[i].rect);
}

std::vector<int> picked;

// do non maximum suppression for each bounding boxx
cv::dnn::NMSBoxes(boxes, confidences, prob_threshold, nms_threshold, picked);

float raw_h = src_img.rows;
float raw_w = src_img.cols;
float ratio_x = (float)raw_w / img_w;
float ratio_y = (float)raw_h / img_h;
end = cv::getTickCount();

for (size_t i = 0; i < picked.size(); i++)
{
    int idx = picked[i];
    cv::Rect box = boxes[idx];
    cv::Rect scaled_box = scale_box(box, padd);
    drawPred(classIds[idx], confidences[idx], scaled_box, padd[2], raw_h, raw_w, src_img, class_names);
}

res = (end - start) / cv::getTickFrequency();
cout << "time of output --> " << res;

`
all the preprocessing is no included in the time calculations.

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

Thanks, I see.
Did you measure the time cost in detect.py for same part of code ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

It seems that in detect.py they were doing the same as I did in the cpp code... don't get why there's this huge time difference..

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

hi I implemented your method to measure the time cost:
time of output --> 0.0252228

Can you also try command benchmark_app -m ./yolov7-tiny.onnx to test the model's pure performance ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

on what CPU ?
"Can you also try command benchmark_app -m ./yolov7-tiny.onnx to test the model's pure performance ?"
where to implement the command ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

can you show how did you get the result "time of output --> 0.0252228" ?

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

Im with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz.

double start, end, res;
start = cv::getTickCount();

//cv::cvtColor(boxed, img, cv::COLOR_BGR2RGB);

// -------- Step 5. Prepare input --------
//ov::Tensor input_tensor1 = infer_request.get_input_tensor(0);
// NHWC => NCHW

boxed.convertTo(boxed, CV_32FC3);
//ov::Tensor input_tensor(input_port.get_element_type(), input_port.get_shape(), data1);
ov::Tensor input_tensor(input_port.get_element_type(), input_port.get_shape(), (float*)boxed.data);
infer_request.set_input_tensor(input_tensor);
// -------- Step 6. Start inference --------
infer_request.infer();

// -------- Step 7. Process output --------
//auto output_tensor_p8 = infer_request.get_output_tensor(0);
//const float* result_p8 = output_tensor_p8.data<const float>();
//auto output_tensor_p16 = infer_request.get_output_tensor(1);
//const float* result_p16 = output_tensor_p16.data<const float>();
auto output_tensor_p32 = infer_request.get_output_tensor(2);
const float* result_p32 = output_tensor_p32.data<const float>();

std::vector<Object> proposals;
std::vector<Object> objects8;
std::vector<Object> objects16;
std::vector<Object> objects32;
std::vector<Object> objects;

//generate_proposals(8, result_p8, prob_threshold, objects8);
//proposals.insert(proposals.end(), objects8.begin(), objects8.end());
//generate_proposals(16, result_p16, prob_threshold, objects16);
//proposals.insert(proposals.end(), objects16.begin(), objects16.end());
generate_proposals(32, result_p32, prob_threshold, objects32);
proposals.insert(proposals.end(), objects32.begin(), objects32.end());

std::vector<int> classIds;
std::vector<float> confidences;
std::vector<cv::Rect> boxes;

for (size_t i = 0; i < proposals.size(); i++)
{
    classIds.push_back(proposals[i].label);
    confidences.push_back(proposals[i].prob);
    boxes.push_back(proposals[i].rect);
}

std::vector<int> picked;

// do non maximum suppression for each bounding boxx
cv::dnn::NMSBoxes(boxes, confidences, prob_threshold, nms_threshold, picked);

float raw_h = src_img.rows;
float raw_w = src_img.cols;
float ratio_x = (float)raw_w / img_w;
float ratio_y = (float)raw_h / img_h;
end = cv::getTickCount();

for (size_t i = 0; i < picked.size(); i++)
{
    int idx = picked[i];
    cv::Rect box = boxes[idx];
    cv::Rect scaled_box = scale_box(box, padd);
    drawPred(classIds[idx], confidences[idx], scaled_box, padd[2], raw_h, raw_w, src_img, class_names);
}

res = (end - start) / cv::getTickFrequency();
cout << "time of output --> " << res;

` all the preprocessing is no included in the time calculations.

Im with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz.
and use the same method as you did to measure the time.

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

on what CPU ? "Can you also try command benchmark_app -m ./yolov7-tiny.onnx to test the model's pure performance ?" where to implement the command ?

you can try install this tool by pip install openvino-dev, and run it in a terminal.

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

C:\Users\Nevo>cd Desktop

C:\Users\Nevo\Desktop>benchmark_app -m best_costum.onnx
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2022.3.0-9052-9752fafe8eb-releases/2022/3
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 968.16 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] images (node: images) : f32 / [...] / [1,3,640,640]
[ INFO ] Model outputs:
[ INFO ] output (node: output) : f32 / [...] / [1,3,80,80,11]
[ INFO ] 1387 (node: 1387) : f32 / [...] / [1,3,40,40,11]
[ INFO ] 1401 (node: 1401) : f32 / [...] / [1,3,20,20,11]
[ INFO ] 1415 (node: 1415) : f32 / [...] / [1,3,10,10,11]
[ INFO ] 1350 (node: 1350) : f32 / [...] / [1,320,80,80]
[ INFO ] 1353 (node: 1353) : f32 / [...] / [1,640,40,40]
[ INFO ] 1356 (node: 1356) : f32 / [...] / [1,960,20,20]
[ INFO ] 1359 (node: 1359) : f32 / [...] / [1,1280,10,10]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] images (node: images) : u8 / [N,C,H,W] / [1,3,640,640]
[ INFO ] Model outputs:
[ INFO ] output (node: output) : f32 / [...] / [1,3,80,80,11]
[ INFO ] 1387 (node: 1387) : f32 / [...] / [1,3,40,40,11]
[ INFO ] 1401 (node: 1401) : f32 / [...] / [1,3,20,20,11]
[ INFO ] 1415 (node: 1415) : f32 / [...] / [1,3,10,10,11]
[ INFO ] 1350 (node: 1350) : f32 / [...] / [1,320,80,80]
[ INFO ] 1353 (node: 1353) : f32 / [...] / [1,640,40,40]
[ INFO ] 1356 (node: 1356) : f32 / [...] / [1,960,20,20]
[ INFO ] 1359 (node: 1359) : f32 / [...] / [1,1280,10,10]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1407.88 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: torch-jit-export
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] NUM_STREAMS: 4
[ INFO ] AFFINITY: Affinity.NONE
[ INFO ] INFERENCE_NUM_THREADS: 8
[ INFO ] PERF_COUNT: False
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'images'!. This input will be filled with random values!
[ INFO ] Fill input 'images' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 1067.40 ms
[Step 11/11] Dumping statistics report
[ INFO ] Count: 96 iterations
[ INFO ] Duration: 63219.74 ms
[ INFO ] Latency:
[ INFO ] Median: 2594.64 ms
[ INFO ] Average: 2625.73 ms
[ INFO ] Min: 2238.78 ms
[ INFO ] Max: 3206.64 ms
[ INFO ] Throughput: 1.52 FPS

what can you infer from that ?
p.s. thank you for the help.

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

maybe something with the conversion to onnx ?

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

You can get a baseline performance of your model from benchmark_app on you device , which is about 1s for each inference with OpenVINO. Is your model yolov7 or yolov7-tiny in this case ?

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

I think it's yolov7-tiny but I'll re-check that, moreover I have the same CPU as yours, what can be reason of this gap in inference time ?

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

Maybe your CPU is busy with other applications ? Seems the model you feed-in to benchmark_app is YOLOv7, 1s for Tiny is too slow.

from yolov7_openvino_cpp-python.

nevoGoldamn commented on July 21, 2024

I run an ssd model at 40ms can explain ? same CPU situation...

from yolov7_openvino_cpp-python.

OpenVINO-dev-contest commented on July 21, 2024

Sorry i have no idea on it. maybe you could test the SDD model on benchmark_app as well, and see whether the performance gap make sense for you.

from yolov7_openvino_cpp-python.

output processing is slow about yolov7_openvino_cpp-python HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent