Is your feature request related to a problem? Please describe. I'

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

non stable inference response time with tensorflow saved model about server HOT 6 OPEN

asaff1 commented on September 25, 2024

non stable inference response time with tensorflow saved model

from server.

Comments (6)

indrajit96 commented on September 25, 2024

CC @kthui @rmccorm4

from server.

rmccorm4 commented on September 25, 2024

Hi @asaff1, thanks for the detailed description!

Warmup criteria and behavior can vary a bit with each framework. One suggestion I'd be interested in seeing the results of - can you try doing server-side warmup? This way when PA or a client starts hitting the server, ideally it is already warmed up or closer to warmed up.

There are docs here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup

And you can see an example warmup configuration here:

server/qa/L0_warmup/failing_infer/config.pbtxt

Lines 44 to 56 in a168d51

 model_warmup [ 

 { 

  name : "zero sample" 

  batch_size: 1 

  inputs { 

  key: "INPUT" 

  value: { 

  data_type: TYPE_FP32 

  dims: 4 

  zero_data: true 

  } 

  } 

 }]

You can choose random data, or use an input data file that is more representative of data you'd expect to see at runtime in your use case: https://github.com/triton-inference-server/common/blob/00b3a71519e32e3bc954e9f0d067e155ef8f1a6c/protobuf/model_config.proto#L1721

from server.

asaff1 commented on September 25, 2024

Hi @rmccorm4 ,
I've did some experiments with warmup. I've tried setting 100 warmup requests, with batch size = 1, yet it still taking a few minutes until response time get stable.

Another interesting thing to note is after the server is "warmed up" (= after sending requests at throughput of 800 image /sec for 6 minutes). Then even if I stop the client, so the server (and the GPU) is idle for a few hours, the next time the client starts the server will still be "warmed up" and will answer fast. Only if I stop and start the triton server process then I need to do warmup all over again.

So I can assume that the cause is somewhere in the software (either tensorflow, cuda, triton etc.), something might be doing optimization in real time? Or has some lazy initialization parts. I'm looking for info about that.

from server.

rmccorm4 commented on September 25, 2024

Hi @asaff1, does batch_size=1 encapsulate the types of requests you're expecting to see at runtime too? Or are you sending requests with greater batch sizes at runtime after model has loaded? Warmup data shapes should try to capture runtime expectations as much as possible, as different shapes can follow different inference paths, CUDA kernels, etc. - which may individually have some warmups based on per-framework details.

Another way to ask the question, after sending all of your 100 warmup requests for batch size 1, do you see at least stable response times for batch size 1? If not, is there a threshold of warmup requests (500, 1000, etc.) where you do see quicker stable response times? Does using random_data vs zero_data have a noticeable effect?

These are generally some framework/library specific concepts as you point out, at the tensorflow/cuda level for the majority of the "cold start penalty". CC @tanmayv25 if you have any more details/thoughts.

from server.

asaff1 commented on September 25, 2024

@rmccorm4 thanks for the detailed answer. Yes I do see improvements depending on the warmup batch size. Would be great to have a more in depth explanation about this. @tanmayv25

from server.

tanmayv25 commented on September 25, 2024

@asaff1 From the model configuration settings that you have provided, it seems that you are using dynamic batching with max_batch_size as 128. This means that depending upon the pending request counts, triton core can send request batches of sizes [1,128] to the tensorflow session for execution. Each tensorflow model would consume some memory resources for holding the model weights and dynamically allocate extra memory into the memory pool for the tensors depending upon its shape[including batch size].

I am assuming that it is taking you longer to get to the stable value because of the random batch sizes of the requests being forwarded to the TF model.

My recommendation would be to set batch_size to 128 and send realistic data(some models have data-dependent shapes as their outputs) as the warmup sample. This would ensure that resource pool is completely populated to handle requests with such a large batch size. You can also try sending 5 warmup requests.

from server.

non stable inference response time with tensorflow saved model about server HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	model_warmup [
	{
	name : "zero sample"
	batch_size: 1
	inputs {
	key: "INPUT"
	value: {
	data_type: TYPE_FP32
	dims: 4
	zero_data: true
	}
	}
	}]