Comments (6)
from server.
Hi @asaff1, thanks for the detailed description!
Warmup criteria and behavior can vary a bit with each framework. One suggestion I'd be interested in seeing the results of - can you try doing server-side warmup? This way when PA or a client starts hitting the server, ideally it is already warmed up or closer to warmed up.
There are docs here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup
And you can see an example warmup configuration here:
server/qa/L0_warmup/failing_infer/config.pbtxt
Lines 44 to 56 in a168d51
You can choose random data, or use an input data file that is more representative of data you'd expect to see at runtime in your use case: https://github.com/triton-inference-server/common/blob/00b3a71519e32e3bc954e9f0d067e155ef8f1a6c/protobuf/model_config.proto#L1721
from server.
Hi @rmccorm4 ,
I've did some experiments with warmup. I've tried setting 100 warmup requests, with batch size = 1, yet it still taking a few minutes until response time get stable.
Another interesting thing to note is after the server is "warmed up" (= after sending requests at throughput of 800 image /sec for 6 minutes). Then even if I stop the client, so the server (and the GPU) is idle for a few hours, the next time the client starts the server will still be "warmed up" and will answer fast. Only if I stop and start the triton server process then I need to do warmup all over again.
So I can assume that the cause is somewhere in the software (either tensorflow, cuda, triton etc.), something might be doing optimization in real time? Or has some lazy initialization parts. I'm looking for info about that.
from server.
Hi @asaff1, does batch_size=1
encapsulate the types of requests you're expecting to see at runtime too? Or are you sending requests with greater batch sizes at runtime after model has loaded? Warmup data shapes should try to capture runtime expectations as much as possible, as different shapes can follow different inference paths, CUDA kernels, etc. - which may individually have some warmups based on per-framework details.
Another way to ask the question, after sending all of your 100 warmup requests for batch size 1, do you see at least stable response times for batch size 1? If not, is there a threshold of warmup requests (500, 1000, etc.) where you do see quicker stable response times? Does using random_data
vs zero_data
have a noticeable effect?
These are generally some framework/library specific concepts as you point out, at the tensorflow/cuda level for the majority of the "cold start penalty". CC @tanmayv25 if you have any more details/thoughts.
from server.
@rmccorm4 thanks for the detailed answer. Yes I do see improvements depending on the warmup batch size. Would be great to have a more in depth explanation about this. @tanmayv25
from server.
@asaff1 From the model configuration settings that you have provided, it seems that you are using dynamic batching with max_batch_size as 128. This means that depending upon the pending request counts, triton core can send request batches of sizes [1,128] to the tensorflow session for execution. Each tensorflow model would consume some memory resources for holding the model weights and dynamically allocate extra memory into the memory pool for the tensors depending upon its shape[including batch size].
I am assuming that it is taking you longer to get to the stable value because of the random batch sizes of the requests being forwarded to the TF model.
My recommendation would be to set batch_size to 128 and send realistic data(some models have data-dependent shapes as their outputs) as the warmup sample. This would ensure that resource pool is completely populated to handle requests with such a large batch size. You can also try sending 5 warmup requests.
from server.
Related Issues (20)
- error: creating server: Internal - s3:// file-system not supported. To enable, build with -DTRITON_ENABLE_S3=ON. HOT 1
- TritonSever does not register vLLM metrics HOT 1
- Problem with accumulating gpu memory usage in tritonserver
- I can't use vllm model from s3 model repository
- /v2/health/ready endpoint does not work as expected
- Implementing early exit in ensemble models HOT 2
- 50k-60k infer/sec limitation
- Ensemble Scheduler: Internal response allocation is not allocating memory at all HOT 3
- GPU memory is not released by Triton HOT 1
- Running separate DCGM on Kubernetes cluster
- Windows docker failed to load tritonserver module
- Can you add a feature : NanoFlow backend
- High GPU memory when load model use transformers HOT 2
- Why there aren't generate and generate_stream api in http client?
- Is it possible to imlement ensemble with BLS HOT 7
- Ability to make preferred_batch_size mandatory
- When downloading, execute ./fetch_models.sh the report
- What is the latest triton server release version available for jetpack 4.6.4 HOT 3
- version inconsistency:Tensorrt and Triton images HOT 1
- triton server python backend how to support streaming transmission
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from server.