<a href="https://github.com/NVIDIA/FasterTransformer/tree/main/src/fastertransformer/k

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="17

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I would like to know the difference between this project and Nvidia FastTransformer. about lmdeploy HOT 4 CLOSED

internlm commented on May 29, 2024

I would like to know the difference between this project and Nvidia FastTransformer.

from lmdeploy.

Comments (4)

tpoisonooo commented on May 29, 2024 2

TurboMind is indeed developed based on fastertransformer, but if you use beyond compare to see the difference, the two are so different that they are not the same repo

FT not support LLaMa, so you can not directy inference on it
FT has no KV Cache Manager, no reliable quantization, int8_model==2 actually does not work
FT has no fmha and many trivial optimizations

Hopes @lzhangzz give more description.

from lmdeploy.

lzhangzz commented on May 29, 2024

@happened in addition to @tpoisonooo's response,

FT's context decoder implementation requires k_len == q_len thus context decoder is only used in the first round of a conversation. Our implementation supports context decoding new input tokens for throughout the conversation
With our caching mechanism only new input tokens will be decoded (not the entire history) unless the sequence has been evicted from the cache
Our KV Cache Manager implements LRU policy so that least recently used sequence will be evicted into token indices (the most compact form of KV cache) and recomputed when requested, so you don't have to worry about OOM
We support persistent batch (you may know it as "continuous batching") for both Python API or serving with tritonserver

from lmdeploy.

tpoisonooo commented on May 29, 2024

#95

from lmdeploy.

tpoisonooo commented on May 29, 2024

@happened Please read this PR and give your comments #101

from lmdeploy.

I would like to know the difference between this project and Nvidia FastTransformer. about lmdeploy HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent