Git Product home page Git Product logo

internlm / lmdeploy Goto Github PK

View Code? Open in Web Editor NEW
2.5K 24.0 228.0 3.19 MB

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Home Page: https://lmdeploy.readthedocs.io/en/latest/

License: Apache License 2.0

Python 41.29% Shell 0.34% CMake 1.57% Cuda 19.34% C++ 37.37% C 0.05% Dockerfile 0.03% PowerShell 0.01%
cuda-kernels deepspeed fastertransformer llm-inference turbomind internlm llama llm codellama llama2

lmdeploy's Introduction

InternLM

👋 join us on Discord and WeChat

Introduction

InternLM2 series are released with the following features:

  • 200K Context window: Nearly perfect at finding needles in the haystack with 200K-long context, with leading performance on long-context tasks like LongBench and L-Eval. Try it with LMDeploy for 200K-context inference.

  • Outstanding comprehensive performance: Significantly better than the last generation in all dimensions, especially in reasoning, math, code, chat experience, instruction following, and creative writing, with leading performance among open-source models in similar sizes. In some evaluations, InternLM2-Chat-20B may match or even surpass ChatGPT (GPT-3.5).

  • Code interpreter & Data analysis: With code interpreter, InternLM2-Chat-20B obtains compatible performance with GPT-4 on GSM8K and MATH. InternLM2-Chat also provides data analysis capability.

  • Stronger tool use: Based on better tool utilization-related capabilities in instruction following, tool selection and reflection, InternLM2 can support more kinds of agents and multi-step tool calling for complex tasks. See examples.

News

[2024.03.26] We release InternLM2 technical report. See arXiv for details.

[2024.01.31] We release InternLM2-1.8B, along with the associated chat model. They provide a cheaper deployment option while maintaining leading performance.

[2024.01.23] We release InternLM2-Math-7B and InternLM2-Math-20B with pretraining and SFT checkpoints. They surpass ChatGPT with small sizes. See InternLM-Math for details and download.

[2024.01.17] We release InternLM2-7B and InternLM2-20B and their corresponding chat models with stronger capabilities in all dimensions. See model zoo below for download or model cards for more details.

[2023.12.13] InternLM-7B-Chat and InternLM-20B-Chat checkpoints are updated. With an improved finetuning strategy, the new chat models can generate higher quality responses with greater stylistic diversity.

[2023.09.20] InternLM-20B is released with base and chat versions.

Model Zoo

Model Transformers(HF) ModelScope(HF) OpenXLab(HF) OpenXLab(Origin) Release Date
InternLM2-1.8B 🤗internlm2-1.8b internlm2-1.8b Open in OpenXLab Open in OpenXLab 2024-01-31
InternLM2-Chat-1.8B-SFT 🤗internlm2-chat-1.8b-sft internlm2-chat-1.8b-sft Open in OpenXLab Open in OpenXLab 2024-01-31
InternLM2-Chat-1.8B 🤗internlm2-chat-1.8b internlm2-chat-1.8b Open in OpenXLab Open in OpenXLab 2024-02-19
InternLM2-Base-7B 🤗internlm2-base-7b internlm2-base-7b Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-7B 🤗internlm2-7b internlm2-7b Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-Chat-7B-SFT 🤗internlm2-chat-7b-sft internlm2-chat-7b-sft Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-Chat-7B 🤗internlm2-chat-7b internlm2-chat-7b Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-Base-20B 🤗internlm2-base-20b internlm2-base-20b Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-20B 🤗internlm2-20b internlm2-20b Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-Chat-20B-SFT 🤗internlm2-chat-20b-sft internlm2-chat-20b-sft Open in OpenXLab Open in OpenXLab 2024-01-17
InternLM2-Chat-20B 🤗internlm2-chat-20b internlm2-chat-20b Open in OpenXLab Open in OpenXLab 2024-01-17

Notes:

The release of InternLM2 series contains two model sizes: 7B and 20B. 7B models are efficient for research and application and 20B models are more powerful and can support more complex scenarios. The relation of these models are shown as follows.

  1. InternLM2-Base: Foundation models with high quality and high adaptation flexibility, which serve as a good starting point for downstream deep adaptations.
  2. InternLM2: Further pretrain with general domain data and domain-enhanced corpus, obtaining state-of-the-art performance in evaluation with good language capability. InternLM2 models are recommended for consideration in most applications.
  3. InternLM2-Chat-SFT: Intermediate version of InternLM2-Chat that only undergoes supervised fine-tuning (SFT), based on the InternLM2-Base model. We release them to benefit research on alignment.
  4. InternLM2-Chat: Further aligned on top of InternLM2-Chat-SFT through online RLHF. InternLM2-Chat exhibits better instruction following, chat experience, and function call, which is recommended for downstream applications.

Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.

Supplements: HF refers to the format used by HuggingFace in transformers, whereas Origin denotes the format adopted by the InternLM team in InternEvo.

Performance

Objective Evaluation

Dataset Baichuan2-7B-Chat Mistral-7B-Instruct-v0.2 Qwen-7B-Chat InternLM2-Chat-7B ChatGLM3-6B Baichuan2-13B-Chat Mixtral-8x7B-Instruct-v0.1 Qwen-14B-Chat InternLM2-Chat-20B
MMLU 50.1 59.2 57.1 63.7 58.0 56.6 70.3 66.7 66.5
CMMLU 53.4 42.0 57.9 63.0 57.8 54.8 50.6 68.1 65.1
AGIEval 35.3 34.5 39.7 47.2 44.2 40.0 41.7 46.5 50.3
C-Eval 53.9 42.4 59.8 60.8 59.1 56.3 54.0 71.5 63.0
TrivialQA 37.6 35.0 46.1 50.8 38.1 40.3 57.7 54.5 53.9
NaturalQuestions 12.8 8.1 18.6 24.1 14.0 12.7 22.5 22.9 25.9
C3 78.5 66.9 84.4 91.5 79.3 84.4 82.1 91.5 93.5
CMRC 8.1 5.6 14.6 63.8 43.2 27.8 5.3 13.0 50.4
WinoGrande 49.9 50.8 54.2 65.8 61.7 50.9 60.9 55.7 74.8
BBH 35.9 46.5 45.5 61.2 56.0 42.5 57.3 55.8 68.3
GSM-8K 32.4 48.3 44.1 70.7 53.8 56.0 71.7 57.7 79.6
Math 5.7 8.6 12.0 23.0 20.4 4.3 22.5 27.6 31.9
HumanEval 17.7 35.4 36.0 59.8 52.4 19.5 37.8 40.9 67.1
MBPP 37.7 25.7 33.9 51.4 55.6 40.9 40.9 30.0 65.8
  • Performance of MBPP is reported with MBPP(Sanitized)

Alignment Evaluation

  • We have evaluated our model on AlpacaEval 2.0 and InternLM2-Chat-20B surpass Claude 2, GPT-4(0613) and Gemini Pro.
Model Name Win Rate Length
GPT-4 Turbo 50.00% 2049
GPT-4 23.58% 1365
GPT-4 0314 22.07% 1371
Mistral Medium 21.86% 1500
XwinLM 70b V0.1 21.81% 1775
InternLM2 Chat 20B 21.75% 2373
Mixtral 8x7B v0.1 18.26% 1465
Claude 2 17.19% 1069
Gemini Pro 16.85% 1315
GPT-4 0613 15.76% 1140
Claude 2.1 15.73% 1096
  • According to the released performance of 2024-01-17.

Requirements

  • Python >= 3.8
  • PyTorch >= 1.12.0 (2.0.0 and above are recommended)
  • Transformers >= 4.34

Usages

We briefly show the usages with Transformers, ModelScope, and Web demos. The chat models adopt chatml format to support both chat and agent applications. To ensure a better usage effect, please make sure that the installed transformers library version meets the following requirements before performing inference with Transformers or ModelScope:

transformers >= 4.34

Import from Transformers

To load the InternLM2-7B-Chat model using Transformers, use the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM 7B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Output: Hello? How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

Import from ModelScope

To load the InternLM2-7B-Chat model using ModelScope, use the following code:

import torch
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM 7B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
  # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
  # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

Dialogue

You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:

pip install streamlit
pip install transformers>=4.34
streamlit run ./chat/web_demo.py

Deployment

We use LMDeploy for fast deployment of InternLM.

With only 4 lines of codes, you can perform internlm2-chat-7b inference after pip install lmdeploy>=0.2.1.

from lmdeploy import pipeline
pipe = pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Please refer to the guidance for more usages about model deployment. For additional deployment tutorials, feel free to explore here.

200K-long-context Inference

By enabling the Dynamic NTK feature of LMDeploy, you can acquire the long-context inference power.

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0, session_len=200000)
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
prompt = 'Use a long prompt to replace this sentence'
response = pipe(prompt)
print(response)

Agent

InternLM2-Chat models have excellent tool utilization capabilities and can work with function calls in a zero-shot manner. See more examples in agent session.

Fine-tuning

Please refer to finetune docs for fine-tuning with InternLM.

Note: We have migrated the whole training functionality in this project to InternEvo for easier user experience, which provides efficient pre-training and fine-tuning infra for training InternLM.

Evaluation

We utilize OpenCompass for model evaluation. In InternLM-2, we primarily focus on standard objective evaluation, long-context evaluation (needle in a haystack), data contamination assessment, agent evaluation, and subjective evaluation.

Objective Evaluation

To evaluate the InternLM model, please follow the guidelines in the OpenCompass tutorial. Typically, we use ppl for multiple-choice questions on the Base model and gen for all questions on the Chat model.

Long-Context Evaluation (Needle in a Haystack)

For the Needle in a Haystack evaluation, refer to the tutorial provided in the documentation. Feel free to try it out.

Data Contamination Assessment

To learn more about data contamination assessment, please check the contamination eval.

Agent Evaluation

  • To evaluate tool utilization, please refer to T-Eval.
  • For code interpreter evaluation, use the Math Agent Evaluation provided in the repository.

Subjective Evaluation

  • Please follow the tutorial for subjective evaluation.

Contribution

We appreciate all the contributors for their efforts to improve and enhance InternLM. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project.

License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表(中文). For other questions or collaborations, please contact [email protected].

Citation

@misc{cai2024internlm2,
      title={InternLM2 Technical Report},
      author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and Keyu Chen and Xin Chen and Xun Chen and Zehui Chen and Zhi Chen and Pei Chu and Xiaoyi Dong and Haodong Duan and Qi Fan and Zhaoye Fei and Yang Gao and Jiaye Ge and Chenya Gu and Yuzhe Gu and Tao Gui and Aijia Guo and Qipeng Guo and Conghui He and Yingfan Hu and Ting Huang and Tao Jiang and Penglong Jiao and Zhenjiang Jin and Zhikai Lei and Jiaxing Li and Jingwen Li and Linyang Li and Shuaibin Li and Wei Li and Yining Li and Hongwei Liu and Jiangning Liu and Jiawei Hong and Kaiwen Liu and Kuikun Liu and Xiaoran Liu and Chengqi Lv and Haijun Lv and Kai Lv and Li Ma and Runyuan Ma and Zerun Ma and Wenchang Ning and Linke Ouyang and Jiantao Qiu and Yuan Qu and Fukai Shang and Yunfan Shao and Demin Song and Zifan Song and Zhihao Sui and Peng Sun and Yu Sun and Huanze Tang and Bin Wang and Guoteng Wang and Jiaqi Wang and Jiayu Wang and Rui Wang and Yudong Wang and Ziyi Wang and Xingjian Wei and Qizhen Weng and Fan Wu and Yingtong Xiong and Chao Xu and Ruiliang Xu and Hang Yan and Yirong Yan and Xiaogui Yang and Haochen Ye and Huaiyuan Ying and Jia Yu and Jing Yu and Yuhang Zang and Chuyu Zhang and Li Zhang and Pan Zhang and Peng Zhang and Ruijie Zhang and Shuo Zhang and Songyang Zhang and Wenjian Zhang and Wenwei Zhang and Xingcheng Zhang and Xinyue Zhang and Hui Zhao and Qian Zhao and Xiaomeng Zhao and Fengzhe Zhou and Zaida Zhou and Jingming Zhuo and Yicheng Zou and Xipeng Qiu and Yu Qiao and Dahua Lin},
      year={2024},
      eprint={2403.17297},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

lmdeploy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lmdeploy's Issues

HTTP client question

Is there a regular HTTP request client that does not require complex lmdeploy package installation and gRPC calls, nor does it need streaming transmission, returning all answer results in a single response.

'InternLMTokenizer' object has no attribute 'backend_tokenizer'

在Inference by TurboMind的时候使用命令
python3 -m lmdeploy.turbomind.chat internlm ./workspace/报错
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 109, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 43, in main
tokenizer = Tokenizer(tokenizer_model_path)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 152, in init
self.model = HuggingFaceTokenizer(model_folder)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 84, in init
self.model.backend_tokenizer.save(backend_tokenizer_file)
AttributeError: 'InternLMTokenizer' object has no attribute 'backend_tokenizer

我看了下源码,好像这个backend_tokenizer确实没咋用到,这个有用吗

Question about internrm-chat-7b-8k

用tp=2 转换Internlm-chat-7b-8k模型 为turbomind格式,最终生成的weight/config.ini如下,8k不是最大支持8千多个token嘛? 这个在哪里设置的,我现在调用超过2048就报错了

[llama]
model_name = internlm-chat-7b
head_num = 32
size_per_head = 128
vocab_size = 103168
num_layer = 32
rotary_embedding = 128
inter_size = 11008
norm_eps = 1e-06
attn_bias = 1
start_id = 1
end_id = 2
weight_type = fp16
max_batch_size = 32
max_context_token_num = 4
session_len = 2056
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
tensor_para_size = 2

[QA] 如何将ckpt保存内容转换为pytorch_model文件

请教下如何将训练过程保存的ckpt内容转换为pytorch_model内容?谢谢

比如,使用 zero=4/tensor=2 + 自有数据 预训练了100步,保存的ckpt文件夹内容:
context.pt gpus-8_pp-0_tp-0_zo-3.pt gpus-8_pp-0_tp-0_zo-7.pt optimizer_tp0_pp0_zo1.pt optimizer_tp0_pp0_zo5.pt schedulder.pt
gpus-8_pp-0_tp-0_zo-0.pt gpus-8_pp-0_tp-0_zo-4.pt model_config.pt optimizer_tp0_pp0_zo2.pt optimizer_tp0_pp0_zo6.pt topo_tp0_pp0.json
gpus-8_pp-0_tp-0_zo-1.pt gpus-8_pp-0_tp-0_zo-5.pt model_tp0_pp0.pt optimizer_tp0_pp0_zo3.pt optimizer_tp0_pp0_zo7.pt
gpus-8_pp-0_tp-0_zo-2.pt gpus-8_pp-0_tp-0_zo-6.pt optimizer_tp0_pp0_zo0.pt optimizer_tp0_pp0_zo4.pt sampler.pt

目标:转成可以直接被 lmdeploy.serve.turbomind.deploy加载的模型文件
config.json modeling_internlm.py pytorch_model.bin.index.json tokenization_internlm.py
configuration_internlm.py pytorch_model-00001-of-00002.bin README.md tokenizer_config.json
generation_config.json pytorch_model-00002-of-00002.bin special_tokens_map.json

请问怎么进行batch inference

在batch为2的情况下,我执行了 python3 -m lmdeploy.turbomind.chat llama /workspace [0,1] 以及将input_ids填了batch个,但是出现RuntimeError: output with shape [1] doesn't match the broadcast shape [2]

Test

PB10.mp4

PB

PB11.mp4
PersistentBatchInference.mp4

PersistentBatchInference

【P0】support vicuna 7B

llmdeploy didn't support vicuna 7B well, because its preprocessor cannot tokenize <s> and </s> into bos and eos token respectively.

I think we'd better change the tokenizer (llmdeploy/fastertransformer/triton_models/preprcessing/1/model.py) with huggingface's AutoTokenizer

F.Y.I, here is an introduction to download and serve vicuna-7B v1.1 model

ModuleNotFoundError: No module named '_turbomind'

I installed with pip install -e . and tried to run python3 -m lmdeploy.turbomind.chat llama ... but got:

  File "/mnt//lmdeploy/lmdeploy/turbomind/__init__.py", line 3, in <module>
    from .turbomind import TurboMind
  File "/mnt//work/lmdeploy/lmdeploy/turbomind/turbomind.py", line 17, in <module>
    import _turbomind as _tm  # noqa: E402
ModuleNotFoundError: No module named '_turbomind'

The inference is stuck, possibly occurring before the invocation of the 'forward' method.

Hello, I have run the service successfully. However, when I use the app 'lmdeploy.app.py' and send a message to the server, I notice that the inference gets stuck.

These are the logs of Tritonserver.

[TM][INFO] [forward][rank=0] INPUT: step [1]
[TM][INFO] [forward][rank=0] INPUT: repetition_penalty [1]
[TM][INFO] [forward][rank=0] INPUT: temperature [1]
[TM][INFO] [forward][rank=0] INPUT: STOP [1]
[TM][INFO] [forward][rank=0] INPUT: START [1]
[TM][INFO] [forward][rank=0] INPUT: random_seed [1]
[TM][INFO] [forward][rank=0] INPUT: input_ids [1, 15]
[TM][INFO] [forward][rank=0] INPUT: stop_words_list [1, 2, 2]
[TM][INFO] [forward][rank=0] INPUT: runtime_top_p [1]
[TM][INFO] [forward][rank=0] INPUT: END [1]
[TM][INFO] [forward][rank=0] INPUT: input_lengths [1]
[TM][INFO] [forward][rank=0] INPUT: CORRID [1]
[TM][INFO] [forward][rank=0] INPUT: request_output_len [1, 1]
[TM][INFO] [forward][rank=0] INPUT: session_len [1]
[TM][INFO] [forward][rank=0] OUTPUT: sequence_length [1, 1]
[TM][INFO] [forward][rank=0] OUTPUT: output_ids [1, 1, 2056]
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [synchronize] batch_size = 0
[TM][INFO] [LlamaCacheManager][create] 140002462050048
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] free = 0
[TM][INFO] [init] infer_request_count = 1
[TM][INFO] [init] batch_size = 1
[TM][INFO] [init] session_len = 2056
[TM][INFO] [init] max_input_length = 15
[TM][INFO] [init] max_context_len = 15
[TM][INFO] [init] slot  sequence_id  history_len  input_len  context_len  tmp_input_len  token_ids.size  cache_len
[TM][INFO] [init]    0   3708069632            0         15           15             15               0          0
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 14, max_input_len = 14, max_context_len = 14
[TM][INFO] context decoding start

Based on the source code, I believe it might be stuck before the 'forward' method.

TM_LOG_INFO("context decoding start");

Could you give me some advice? Thanks.

huggingface safetensor support

I'm trying to deploy LLaMA2 70b chat model locally and find that this LMDeploy seems don't support huggingface safetensor ckpt. It just raise a confusing Exception:

Traceback (most recent call last):
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 592, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 562, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 482, in deploy_hf
    assert num_layer == i, f'miss matched layers: {num_layer} vs {i}'
AssertionError: miss matched layers: 80 vs 0

because it only read *.bin:

_files = [file for file in os.listdir(model_path) if file.endswith('.bin')]

[WIP] Support InternLM on 3rd-party inference toolboxes

This issue is to track progress on 3rd party toolboxes which is related to InternLM.

VLLM

https://github.com/wangruohui/vllm/tree/internlm

  • Inference with single GPU
    • There seems some bug, not sure from my implementation or from upstream
  • Tensor parallel

DeepSpeed

InternLM-7B is supported in Deepspeed inference and merged to main branch: microsoft/DeepSpeed#4137

  • Single GPU with kernel infection policy
  • Tensor parallel

Meta tensor for faster model loading: watching microsoft/DeepSpeed#3608

[documentation] run failed according the readme docs

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf

Users need to install requirements for internlm before then we can run python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf successfully. so can we add these tips in the document?

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.pytorch.chat /mnt/internlm-7b \ 
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 190, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 120, in main
    tokenizer, model = init_model(
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 62, in init_model
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 693, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/internlm-7b/tokenization_internlm.py", line 81, in __init__
    self.sp_model.Load(vocab_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

pip list

(lmdeploy) ➜  lmdeploy git:(main) pip list
Package            Version  Editable project location
------------------ -------- -------------------------
addict             2.4.0
brotlipy           0.7.0
certifi            2023.5.7
cffi               1.15.1
charset-normalizer 2.0.4
contourpy          1.1.0
cryptography       39.0.1
cycler             0.11.0
filelock           3.9.0
fire               0.5.0
fonttools          4.41.0
fsspec             2023.6.0
gmpy2              2.1.2
grpcio             1.56.0
huggingface-hub    0.16.4
idna               3.4
importlib-metadata 6.8.0
Jinja2             3.1.2
kiwisolver         1.4.4
lmdeploy           0.0.1    /mnt/lmdeploy
markdown-it-py     3.0.0
MarkupSafe         2.1.1
matplotlib         3.7.2
mdurl              0.1.2
mkl-fft            1.3.6
mkl-random         1.2.2
mkl-service        2.4.0
mmengine           0.8.2
mpmath             1.2.1
networkx           2.8.4
numpy              1.25.0
opencv-python      4.8.0.74
packaging          23.1
Pillow             9.4.0
pip                23.1.2
platformdirs       3.9.1
protobuf           4.23.4
pybind11           2.11.0
pycparser          2.21
Pygments           2.15.1
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
python-rapidjson   1.10
PyYAML             6.0
regex              2023.6.3
requests           2.29.0
rich               13.4.2
safetensors        0.3.1
sentencepiece      0.1.99
setuptools         67.8.0
six                1.16.0
sympy              1.11.1
termcolor          2.3.0
tokenizers         0.13.3
tomli              2.0.1
torch              2.0.0
torchaudio         2.0.0
torchvision        0.15.0
tqdm               4.65.0
transformers       4.29.2
triton             2.0.0
tritonclient       2.33.0
typing_extensions  4.6.3
urllib3            1.26.16
wheel              0.38.4
yapf               0.40.1
zipp               3.16.2

ziya启动不了

(lmdeploy) ➜ lmdeploy sudo bash workspace/service_docker_up.sh

=============================
== Triton Inference Server ==

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0721 03:34:28.851237 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6d00000000' with size 268435456
I0721 03:34:28.851696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0721 03:34:28.857984 1 model_lifecycle.cc:459] loading: postprocessing:1
I0721 03:34:28.858019 1 model_lifecycle.cc:459] loading: preprocessing:1
I0721 03:34:28.858036 1 model_lifecycle.cc:459] loading: turbomind:1
I0721 03:34:29.000309 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0721 03:34:29.000337 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0721 03:34:29.000340 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0721 03:34:29.002218 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0721 03:34:29.002902 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
num_nodes=1
tp_pp_size=1
gpu_size=1
world_size=1
model_instance_size=1
I0721 03:34:29.002929 1 libfastertransformer.cc:346] Sequence Batching: disabled
I0721 03:34:29.002934 1 libfastertransformer.cc:357] Dynamic Batching: disabled
[ERROR] Does not find the section llama with name model_name.

启动时报错

将llama-7b转换格式后docker启动报错
转换命令:
python3 -m lmdeploy.serve.turbomind.deploy llama-7b /home/nlp/lwp/pre_models/llama-7b-hf hf
启动命令:
docker run --gpus "device=8" --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest python3 -m lmdeploy.turbomind.chat llama /workspace
报错信息:
[WARNING] gemm_config.in is not found; using default GEMM algo
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 96, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 35, in main
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path,
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

使用workerspace下面的脚本运行的时候也会报错,会报error: creating server: Internal - failed to load all models

请问这是哪里出错了?

创建模型时候报错ModelLifeCycle::CreateModel()

========== step1 ============
我在一台机器上执行的 lmdeploy.serve.turbomind.deploy命令,并且参数tp=2,因为我想将模型放在两个gpu上运行
这一步是成功的

========== step2 ============
在另一台机器上执行service_docker_up.sh,tritonserver启动了turbomind后端,但是模型没有成功加载,报错了

这是报错内容

I0712 08:36:20.719380 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0c44000000' with size 268435456
I0712 08:36:20.720671 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0712 08:36:20.720696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
W0712 08:36:20.859316 1 server.cc:218] failed to enable peer access for some device pairs
I0712 08:36:21.439338 1 model_lifecycle.cc:459] loading: turbomind:1
I0712 08:36:21.441099 1 model_lifecycle.cc:459] loading: postprocessing:1
I0712 08:36:21.442823 1 model_lifecycle.cc:459] loading: preprocessing:1
I0712 08:36:21.582434 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0712 08:36:21.582484 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0712 08:36:21.582501 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0712 08:36:21.585413 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0712 08:36:21.586543 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
E0712 08:36:21.586654 1 libfastertransformer.cc:226] Invalid configuration argument 'tensor_para_size': stoi
[3b379e147757:1    :0:86] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:     86) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x0000000000018313 triton::backend::turbomind_backend::ModelState::ModelState()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:323
 2 0x0000000000024554 triton::backend::turbomind_backend::ModelState::Create()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:182
 3 0x0000000000024b81 TRITONBACKEND_ModelInitialize()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:1791
 4 0x000000000010689b triton::core::TritonModel::Create()  :0
 5 0x00000000001c4f5d triton::core::ModelLifeCycle::CreateModel()  :0
 6 0x00000000001caccd std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke()  model_lifecycle.cc:0
 7 0x00000000003083a0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
 8 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 9 0x0000000000008609 start_thread()  ???:0
10 0x000000000011f133 clone()  ???:0
=================================
[3b379e147757:00001] *** Process received signal ***
[3b379e147757:00001] Signal: Floating point exception (8)
[3b379e147757:00001] Signal code:  (-6)
[3b379e147757:00001] Failing at address: 0x1
[3b379e147757:00001] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f0c8d1a9420]
[3b379e147757:00001] [ 1] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x18313)[0x7f0c80652313]
[3b379e147757:00001] [ 2] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x24554)[0x7f0c8065e554]
[3b379e147757:00001] [ 3] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(TRITONBACKEND_ModelInitialize+0x341)[0x7f0c8065eb81]
[3b379e147757:00001] [ 4] /opt/tritonserver/lib/libtritonserver.so(+0x10689b)[0x7f0c8c2de89b]
[3b379e147757:00001] [ 5] /opt/tritonserver/lib/libtritonserver.so(+0x1c4f5d)[0x7f0c8c39cf5d]
[3b379e147757:00001] [ 6] /opt/tritonserver/lib/libtritonserver.so(+0x1caccd)[0x7f0c8c3a2ccd]
[3b379e147757:00001] [ 7] /opt/tritonserver/lib/libtritonserver.so(+0x3083a0)[0x7f0c8c4e03a0]
[3b379e147757:00001] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0c8be25de4]
[3b379e147757:00001] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f0c8d19d609]
[3b379e147757:00001] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0c8bb10133]
[3b379e147757:00001] *** End of error message ***

有什么debug的建议吗?

an error about llama-65b

65B
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh

is this correct?i found this in docs.

a bug

I try to deploy in my server with 2*3090, cuda-11.7.
It deploy normally with command:"docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest
python3 -m lmdeploy.turbomind.chat internlm /workspace"
however it can't be deployed by command:"bash workspace/service_docker_up.sh" because segment fault.
image

otherwise, when I try "python3 lmdeploy.app {server_ip_addresss}:33337 internlm" to deploy a client,it report : torch don't have module cuda. it is because you add lmdeploy/lmdeploy/torch into sys.path, and that torch don't have cuda module. I fix this by add "sys.path.remove("lmdeploy/lmdeploy/torch")"

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.serve.turbomind.deploy internlm-7b /mnt/internlm-7b hf
create workspace in directory ./workspace
copy triton model templates from "/mnt/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models" successfully
['pytorch_model-00001-of-00002.bin', 'pytorch_model-00002-of-00002.bin']

### copying layers.31.attention.wo.bias, shape=torch.Size([4096])
layers.31.attention.wo.0.bias torch.Size([4096])
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w1.0.weight torch.Size([4096, 11008])
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0
layers.31.feed_forward.w2.0.weight torch.Size([11008, 4096])
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w3.0.weight torch.Size([4096, 11008])
layers.31.attention_norm.weight torch.Size([4096])
layers.31.ffn_norm.weight torch.Size([4096])
tok_embeddings.weight torch.Size([103168, 4096])
norm.weight torch.Size([4096])
output.weight torch.Size([103168, 4096])
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 549, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 522, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 462, in deploy_hf
    return export(model_name, num_layer, norm_eps, model_params,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 167, in export
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 87, in tokenizer_info
    sp_model = SentencePieceProcessor(model_file=model_path)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

image

need all the LFS file

Question about benchmark

Hi, I tested LMDeploy with the following steps,

    1. Get models from https://huggingface.co/internlm/internlm-chat-7b/
    1. Convert to triton models python -m lmdeploy.serve.turbomind.deploy interlm-7b interlm-7b hf
    1. Run python3 profile_generation.py --model_path /workspace/ --model_name internlm --concurrency 8 --input_seqlen 1 --output_seqlen 2048 --test_round 8 in provided docker image container openmmlab/lmdeploy:latest with A100 80G

The result I get is throughput: 70.98455828512093 token/s while the document shows it will reach 640 token/s almost with batch=8.
image

Are there any configurations I need to modify, Thanks

Question about persistent Batch Inference

Hi, Thank you for the open source LMDeploy project!

There is a image in the documentation that provides a good description of the process of dynamic batching inference, but I couldn't find more details about how LMDeploy implements this function.

Is there any document or could you tell me where this part is implemented in the code?

Comparasion with vllm

vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?

[Bug] TurboMind execute failure: 1

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

When i communicate with the inference server more than one round, there is an error. I must rest the session.

TurboMind execute failure:  1
07/31 12:28:43 - service.ft - ERROR - /usr/local/lib/python3.8/dist-packages/lmdeploy/serve/turbomind/chatbot.py - stream_consumer - 553 - got error from turbomind, code StatusCode.TRITON_SERVER_ERR, TurboMind execute failure:  1, token 677

Reproduction

Communicate with the inference server more than one round.

Error traceback

No response

支持tritonserver

我看了一下代码,tritonBackend好像也支持,但是没int8_mode的选项,是没有支持吗?

Using deepspeed tp to load InternLM, but memory do not save.

我现在的硬件配置是4块16G的V100,我按照README.md里基于pytorch推理

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

单卡推理load完模型后占用显存15G
image

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

双卡tp推理load完模型后每张卡占用显存也是15G
image

没有达到期望的降低显存的效果,不是很理解问题出在哪?

Get trouble with 'Quantization' in '/README.md'

here is the log

(lmdeploy_test) [xxx@xxxxxxxxxxxx internlm_test]$ python -m lmdeploy.lite.apis.kv_qparams --model internlm-chat-7b --output_dir internlm-chat-7b-deploy --symmetry True --offload  False --num_tp 1
Traceback (most recent call last):
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nvme/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 199, in <module>
    fire.Fire(main)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 112, in main
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class InternLMTokenizer does not exist or is not currently imported.

It looks like it can't get 'InternLMTokenizer' smoothly

请问llama 65b kv cache量化和context fmha不能同时打开吗?

在部署测试过程中,llama 7b的use_context_fmha = 1,quant_policy = 4是可以运行的,但是llama 65b不可以,需要use_context_fmha = 0。请问这是我这边的问题还是目前确实不能同时打开呢?

报错是这个:
what(): [TM][ERROR] CUDA runtime error: an illegal memory access was encountered lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:843

根据readme.md部署报错 [FT][ERROR] CUDA runtime error: invalid argument /opt/trito

step1 :下载internlm-chat-7b模型

step2: 运行docker镜像docker run -itd --net=host --name internlm_server --gpus all -v ./workspace/:/workspace -v /data/models/:/models -it openmmlab/lmdeploy:latest bash

step3: 转换模型为turbomind ,因为我是有两台T4
root@:/opt/tritonserver/lmdeploy# python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /models/internlm-chat-7b hf --tp 2

step4: 切换到目录运行
root@/opt/tritonserver/lmdeploy# python3 -m lmdeploy.turbomind.chat internlm ./workspace/

报错如下:

[WARNING] gemm_config.in is not found; using default GEMM algo
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:252

Aborted (core dumped)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.