Hi, I set max_batch_size = 3 and I want to speed up model with 3 input image as pa

Duplicated issue: <a class="issue-link js-issue-link" data-error-text="Failed to load

slow with batch size > 1 about tensorrt_demos HOT 5 CLOSED

jkjung-avt commented on May 14, 2024

slow with batch size > 1

from tensorrt_demos.

Comments (5)

jkjung-avt commented on May 14, 2024

Does your processing time include image preprocessing? (Do you get a different result with trt_ssd_async.py?)

from tensorrt_demos.

PythonImageDeveloper commented on May 14, 2024

time is only for trt_ssd.detect(img, 0.3). this object is included preprocessing step, but for simplicity I concatenate same image for 3 times before feed to the np.copyto(self.host_inputs[0], img_resized.ravel()), like this :
img_pred = np.concatenate([img_pred,img_pred,img_pred])
I don't test with trt_ssd_async.py.

I do following steps:
1- change input size from (1,3,300,300) to (3,3,300,300)
2- change to builder.max_batch_size = 3
4- change self.context.execute_async(
batch_size=3,
bindings=self.bindings,
stream_handle=self.stream.handle)

Notice that when I comment self.stream.synchronize() in the ssd.py, I get first few result with 0.002 sec and then the time is growing reach to 0.06, and then the line self.stream.synchronize() remain uncomment, I get 0.06 for all result, why?
in my opinion the self.stream.synchronize() likely be asynchronous, not synchronize if possible.

from tensorrt_demos.

jkjung-avt commented on May 14, 2024

Instead of timing the whole trt_ssd.detect() function, I think it makes more sense for you to only time the "cuda.memcpy_xxx"s, "context.execute_async" and "cuda.stream.synchronize" in that function.

By the way, the "self.stream.synchronize" call cannot be commented out. Otherwise, you cannot be sure GPU has finished processing the image.

from tensorrt_demos.

PythonImageDeveloper commented on May 14, 2024

This is my TensorRT OCR custom model when I use batch_size = 1, I get 0.02 sec and when I use batch_size= 10, I get 0.2 sec, which means, this batch_size input images running as serializing, not parallel, why?

Batch_size = 1

TensorRT All Time: 0.02888178825378418
cuda.memcpy_htod_async: cuda_inputs: 0.00016927719116210938
self.context.execute_async: 0.0031588077545166016
cuda.memcpy_dtoh_async : host_outputs: 9.1552734375e-05
stream.synchronize(): 0.018606901168823242

Batch_size = 10

TensorRT All Time: 0.22867369651794434
cuda.memcpy_htod_async: cuda_inputs: 0.00018334388732910156
self.context.execute_async: 0.0013976097106933594
cuda.memcpy_dtoh_async : host_outputs: 9.894371032714844e-05
stream.synchronize(): 0.20677971839904785

from tensorrt_demos.

jkjung-avt commented on May 14, 2024

Duplicated issue: #106

from tensorrt_demos.

Recommend Projects

slow with batch size > 1 about tensorrt_demos HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent