Comments (4)
TurboMind is indeed developed based on fastertransformer, but if you use beyond compare
to see the difference, the two are so different that they are not the same repo
- FT not support LLaMa, so you can not directy inference on it
- FT has no KV Cache Manager, no reliable quantization,
int8_model==2
actually does not work - FT has no
fmha
and many trivial optimizations
Hopes @lzhangzz give more description.
from lmdeploy.
@happened in addition to @tpoisonooo's response,
- FT's context decoder implementation requires
k_len == q_len
thus context decoder is only used in the first round of a conversation. Our implementation supports context decoding new input tokens for throughout the conversation - With our caching mechanism only new input tokens will be decoded (not the entire history) unless the sequence has been evicted from the cache
- Our KV Cache Manager implements LRU policy so that least recently used sequence will be evicted into token indices (the most compact form of KV cache) and recomputed when requested, so you don't have to worry about OOM
- We support persistent batch (you may know it as "continuous batching") for both Python API or serving with tritonserver
from lmdeploy.
from lmdeploy.
@happened Please read this PR and give your comments #101
from lmdeploy.
Related Issues (20)
- 您好,请问有C++部署QWEN-VL模型的demo吗?[Feature] HOT 1
- [Bug] Compilation Error HOT 1
- [Feature] Qwen1.5适配turbomind HOT 15
- [Bug] serve 加载 internlm-chat-7b的模型,有qwen的报错 HOT 4
- [Bug]weird bugs HOT 5
- [Bug] cannot use 'end' clear history context under pytorch backend with tp 2
- [Bug] 使用A100 每卡80G的机器,在部署qwen1.5 14B时,如果一张卡上启动两个进程,其中一个进程会被挤掉,能不能通过什么参数可以让两个进程平均分配显存 HOT 2
- [Bug] v0.3.0在多卡2080ti上运行llava34b出错 HOT 15
- [Bug] argparse.ArgumentError: argument command: conflicting subparser: chat HOT 4
- [Feature] 给 lmdeploy pytorch引擎,添加一个权重参数加载精度的参数。 HOT 2
- [Bug] asyncio: This event loop is already running HOT 8
- [Feature] deepseek-coder-base model HOT 17
- [Feature] 支持DocOwl1.5 HOT 4
- [Feature] Turbomind engine prefix caching HOT 18
- [Feature]请问支持qwen-vl-int4的kv-cache量化么? HOT 3
- Error when trying to load quantized llava-v1.6-34b HOT 7
- [Bug] 我尝试部署InternLM-xcomposer-7b出现了模型架构的报错 HOT 9
- [Bug] lmdeploy 启动报错,rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set. HOT 5
- [Feature] support InternVL-Chat-Chinese-V1-2-Plus HOT 2
- [Feature] prefix-cache HOT 24
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lmdeploy.