Comments (6)
CC @rmccorm4
from server.
Hi @Joenhle, thanks for opening the issue with such detail!
A few follow-up questions to help eliminate some factors:
- Do you see the expected scaling in performance when querying your DALI model directly from perf analyzer? (
perf_analyzer -m my_dali_model ...
- Can you share your BLS model and the logic being used for gathering/scattering?
- Sharing the full PA outputs with latency breakdowns (network, copy, compute, etc.) will help understand the bottlenecks, if any, as well.
from server.
@rmccorm4 thank your reply,I have resolved the question. In the last slow version, my bls gathering logic is that construct 32 independent async requetst, each request input data's shape is [1, C, H, W],then use asyncio.gather to wait the 32 response. Then in the latest version, i use torch.cat to concatenate the 32 torch data matually, it's shape is [32, C, H, W] and then send in one request, it's performance is better much. I don't konw why it's so a big effection gather data matually in bls, rather than hand to the dynamic bacher.
from server.
@Joenhle this is an interesting observation. I was curious whether you're sending the BLS tensors that you are transferring are on GPU or CPU?
For GPU, if you run out of CUDA memory pool it could be that falling back to cudaIpc
calls can hinder performance.
Also, can you share the Triton version that you're using?
from server.
@Tabrizian 1. On GPU 2. didn't run out of CUDA memory 3. triton version 2.21.0
from server.
Could you retry your experiments on the latest version of Triton? @krishung5 had some changes in this area that should improve the performance.
from server.
Related Issues (20)
- model_repository_manager.cc:1186] failed to load 'bert' version 1: Internal: failed to load model 'bert': PytorchStreamReader failed reading zip archive: failed finding central directory HOT 1
- Discrepancy between nv_cpu_utilization , container_cpu_usage_seconds_total
- Wrong output of sentence_transformer with tensorrt_accelerator on onnxruntime HOT 1
- Incorrect data received on python backend from client. HOT 1
- Ensemble retrieve the input from the incorrect memory address HOT 1
- Queue timeouts not working as expected
- Docker build of Triton Server r24.07 on Ubuntu 22.04/Arm fails HOT 6
- SSLEOFError when result from async_infer is not available in http client
- High GPU memory use
- Stateful decoupled bls model: malloc_consolidate(): unaligned fastbin chunk detected
- triton need api docs like vllm fastapi docs HOT 1
- How to use StopStream when use AsyncStreamInfer?
- ValidateBytesInputs() check failed in Big Endian Machines HOT 1
- How to send the byte or string data in array in perf analyzer HOT 2
- vllm backend - UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'numpy' HOT 1
- Support passing variables in config.pbtxt
- Failed to stat file model.onxx while using conda-pack in configs HOT 1
- Support request cancellation on timeout for sync grpc client
- Discrepancy in Inference Timing between trtexec and Triton Server(TensorRT backend) with gRPC Communication for YOLOV8 HOT 1
- Inconsistent prediction results using onnx backend with tensorrt enabled
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from server.