I have a dali model and bls model, the bls model just collect input from all the reque

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

why concurrency multiply batch_size same in perf_analyzer but throughtput are very different about server HOT 6 CLOSED

Joenhle commented on September 25, 2024

why concurrency multiply batch_size same in perf_analyzer but throughtput are very different

from server.

Comments (6)

indrajit96 commented on September 25, 2024

CC @rmccorm4

from server.

rmccorm4 commented on September 25, 2024

Hi @Joenhle, thanks for opening the issue with such detail!

A few follow-up questions to help eliminate some factors:

Do you see the expected scaling in performance when querying your DALI model directly from perf analyzer? (perf_analyzer -m my_dali_model ...
Can you share your BLS model and the logic being used for gathering/scattering?
Sharing the full PA outputs with latency breakdowns (network, copy, compute, etc.) will help understand the bottlenecks, if any, as well.

from server.

Joenhle commented on September 25, 2024

@rmccorm4 thank your reply，I have resolved the question. In the last slow version, my bls gathering logic is that construct 32 independent async requetst, each request input data's shape is [1, C, H, W]，then use asyncio.gather to wait the 32 response. Then in the latest version, i use torch.cat to concatenate the 32 torch data matually, it's shape is [32, C, H, W] and then send in one request, it's performance is better much. I don't konw why it's so a big effection gather data matually in bls, rather than hand to the dynamic bacher.

from server.

Tabrizian commented on September 25, 2024

@Joenhle this is an interesting observation. I was curious whether you're sending the BLS tensors that you are transferring are on GPU or CPU?

For GPU, if you run out of CUDA memory pool it could be that falling back to cudaIpc calls can hinder performance.

Also, can you share the Triton version that you're using?

from server.

Joenhle commented on September 25, 2024

@Tabrizian 1. On GPU 2. didn't run out of CUDA memory 3. triton version 2.21.0

from server.

Tabrizian commented on September 25, 2024

Could you retry your experiments on the latest version of Triton? @krishung5 had some changes in this area that should improve the performance.

from server.

Recommend Projects

why concurrency multiply batch_size same in perf_analyzer but throughtput are very different about server HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent