Git Product home page Git Product logo

Comments (6)

indrajit96 avatar indrajit96 commented on September 25, 2024

CC @kthui @rmccorm4

from server.

rmccorm4 avatar rmccorm4 commented on September 25, 2024

Hi @asaff1, thanks for the detailed description!

Warmup criteria and behavior can vary a bit with each framework. One suggestion I'd be interested in seeing the results of - can you try doing server-side warmup? This way when PA or a client starts hitting the server, ideally it is already warmed up or closer to warmed up.

There are docs here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup

And you can see an example warmup configuration here:

model_warmup [
{
name : "zero sample"
batch_size: 1
inputs {
key: "INPUT"
value: {
data_type: TYPE_FP32
dims: 4
zero_data: true
}
}
}]

You can choose random data, or use an input data file that is more representative of data you'd expect to see at runtime in your use case: https://github.com/triton-inference-server/common/blob/00b3a71519e32e3bc954e9f0d067e155ef8f1a6c/protobuf/model_config.proto#L1721

from server.

asaff1 avatar asaff1 commented on September 25, 2024

Hi @rmccorm4 ,
I've did some experiments with warmup. I've tried setting 100 warmup requests, with batch size = 1, yet it still taking a few minutes until response time get stable.

Another interesting thing to note is after the server is "warmed up" (= after sending requests at throughput of 800 image /sec for 6 minutes). Then even if I stop the client, so the server (and the GPU) is idle for a few hours, the next time the client starts the server will still be "warmed up" and will answer fast. Only if I stop and start the triton server process then I need to do warmup all over again.

So I can assume that the cause is somewhere in the software (either tensorflow, cuda, triton etc.), something might be doing optimization in real time? Or has some lazy initialization parts. I'm looking for info about that.

from server.

rmccorm4 avatar rmccorm4 commented on September 25, 2024

Hi @asaff1, does batch_size=1 encapsulate the types of requests you're expecting to see at runtime too? Or are you sending requests with greater batch sizes at runtime after model has loaded? Warmup data shapes should try to capture runtime expectations as much as possible, as different shapes can follow different inference paths, CUDA kernels, etc. - which may individually have some warmups based on per-framework details.

Another way to ask the question, after sending all of your 100 warmup requests for batch size 1, do you see at least stable response times for batch size 1? If not, is there a threshold of warmup requests (500, 1000, etc.) where you do see quicker stable response times? Does using random_data vs zero_data have a noticeable effect?

These are generally some framework/library specific concepts as you point out, at the tensorflow/cuda level for the majority of the "cold start penalty". CC @tanmayv25 if you have any more details/thoughts.

from server.

asaff1 avatar asaff1 commented on September 25, 2024

@rmccorm4 thanks for the detailed answer. Yes I do see improvements depending on the warmup batch size. Would be great to have a more in depth explanation about this. @tanmayv25

from server.

tanmayv25 avatar tanmayv25 commented on September 25, 2024

@asaff1 From the model configuration settings that you have provided, it seems that you are using dynamic batching with max_batch_size as 128. This means that depending upon the pending request counts, triton core can send request batches of sizes [1,128] to the tensorflow session for execution. Each tensorflow model would consume some memory resources for holding the model weights and dynamically allocate extra memory into the memory pool for the tensors depending upon its shape[including batch size].

I am assuming that it is taking you longer to get to the stable value because of the random batch sizes of the requests being forwarded to the TF model.

My recommendation would be to set batch_size to 128 and send realistic data(some models have data-dependent shapes as their outputs) as the warmup sample. This would ensure that resource pool is completely populated to handle requests with such a large batch size. You can also try sending 5 warmup requests.

from server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.