Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Understanding and customize the vLLM backend about server HOT 4 OPEN

CoolFish88 commented on August 15, 2024

Understanding and customize the vLLM backend

from server.

Comments (4)

oandreeva-nv commented on August 15, 2024 2

Hi @CoolFish88 ,
Thank you for your questions and your interest in Triton + vLLM.

Connection between model.json and model.py.
model.json specifies vLLM engine arguments, e.g. model_name, tensor_parallel_size, gpu_memory_utilization, etc. For the full list of parameters, please refer to https://github.com/vllm-project/vllm/blob/b422d4961a3052c5b4bcfc3747a1ad55acfe7eb8/vllm/engine/arg_utils.py#L23

model.py is a Triton-required file. You can learn more about it's components here.

Now, the following may sound confusing, so feel free to ask questions. vLLM backend depends on python_backend to load and serve models. To learn more about python-based backends, please refer here.
Alternatively to using vLLM backend, you can always deploy any vLLM model with python backend. However, in case of vLLM, it is sufficient to implement TritonPythonModel interface only once and re-use it across multiple models, specified in model.json. For this use-case, we've introduced python-based backend feature, when Triton Inference Server treats common model.py script as a backend; it loads libtriton_python.so first, this ensures that Triton knows how to send requests to the backend for execution and the backend knows how to communicate with Triton. Then, Triton makes sure to use common model.py from the backend's repository, and not look for it in the model repository.

Will this overwrite the contents of a user supplied config.pbtxt in the model registry?

No. If you happen to notice any unexpected behaviors, please file a bug.

If negative, will vLLM specific settings in auto_complete_config still be executed?

Yes

What would be the best approach in the vLLM backend context to operate in with user supplied config files?

Could you please clarify this question?

I already have a model.py which I would like to operate with the vLLM backend.
Wonderful! If you would like to use only your custom model.py as a new vllm_backend, simply replace/put it in backends/vllm_backend repo inside the docker container. Alternatively, you can specify bcakend: python in config.pbtxt file of your model and put model.py under model_repository/<your_vllm_model>/<model_version>/. However, if you would like to re-use model.py, than the first option works better.

from server.

dyastremsky commented on August 15, 2024 1

The model.py is what's used by Triton as the vLLM backend (technically an implementation of the Python backend). It'll exist in a vLLM directory under the backends directory. If you want to understand that directory structure better, you can read about how Triton backends and custom backends work in the backend repo.

If you're trying to modify model.py to work with the vLLM engine differently, that's the place to do it. You're essentially creating your own custom backend though, so that may require going through documentation a bit, including the links above. Otherwise, you'd want to leave this file as is and that tutorial should set you in the right direction.

As far as autocomplete, it should only run if the config.pbtxt is not provided or certain fields are not provided.

from server.

oandreeva-nv commented on August 15, 2024 1

When using the Python backend, Triton Inference Server would look into my model repository and fetch my custom model.py implementation as a backend to use

Not really. When you explicitly use python backend, it looks for libtriton_python.so (in your model's repo and under backends/python/) as a backend and your model.py is considered a model, which contains implimentation of execute function, which defines how to run your model.

libtriton_python.so is the library, that implements backend APIs, so that Triton Server knows how to run model.py. And model.py is basically your custom component.

When using a Python-based backend (e.g. vLLM), Triton Inference Server would ignore the model.py file located in the model repository and use the model.py file residing (together with the three file artefacts) in its own repository under /opt/tritonserver/backends/

It will try to find .so library first, when in case nothing is found, it will look for libtriton_python.so and use model.py as a common model definition for all custom models, that use this python-based backend.

as my existing model.py implementation of the TritonPythonModel interface is devoid of any vLLM sugar and contains custom code in the execute and initialize methods, it seems that I have to merge the vLLM model.py

If you don't need anything from vLLM's model.py we provide, there is no need to use it. You can put your model.py under backends/vllm_custom and then specify backend: vllm_custom in your model's config.pbtxt files, make sure backends/python is also present and contains 3 files. If you need something we provide in vllm python backend, then yes, feel free to merge your 2 files.

from server.

CoolFish88 commented on August 15, 2024

Hello @oandreeva-nv and @dyastremsky,

Thank you for addressing my questions with plenty of detail.

I had previously consulted the documentation of the python backend when drafting the custom model.py file according to the TritonPythonModel interface. This file would reside under the model repository path, as showcased here, and the config.pbtxt file would reference the python backend. And the world was a happy place to be in.

Then I decided to leverage the advantages of the vLLM framework and things started to get really foggy really fast. At that point in time, I came across the Python-based backends that you referenced in the answers, and this is when things started to get unclear, as my model repository wouldn't contain the artefacts: libtriton_python.so, triton_python_backend_stub, and triton_python_backend_utils.py.

Correct me if my understanding is wrong:

When using the Python backend, Triton Inference Server would look into my model repository and fetch my custom model.py implementation as a backend to use (if it doesn't exist, will it fetch the backend implementation by looking into /opt/tritonserver/backends/. ??), and use the three file artefacts (libtriton_python.so, triton_python_backend_stub, and triton_python_backend_utils.py) from the backend repository (since they are not present in the model repository).
When using a Python-based backend (e.g. vLLM), Triton Inference Server would ignore the model.py file located in the model repository and use the model.py file residing (together with the three file artefacts) in its own repository under /opt/tritonserver/backends/

Now, as my existing model.py implementation of the TritonPythonModel interface is devoid of any vLLM sugar and contains custom code in the execute and initialize methods, it seems that I have to merge the vLLM model.py with my own and resort to one of the two strategies that you @oandreeva-nv mentioned under bulletpoint 5. If I go for the first strategy, I may use model.json to pass parameters to the refactored version of vLLM backend containing bits of the custom code I needed.

Could you please validate my understanding?

from server.

Understanding and customize the vLLM backend about server HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent