01-ai / yi-1.5 Goto Github PK

Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.

License: Apache License 2.0

yi-1.5's People

Contributors

Stargazers

Watchers

yi-1.5's Issues

what is the prompt template on ollama

modelscope模型下载问题

为什么通过以下脚本下载模型文件会报错，我用这个指令下载其他模型都是没问题的

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
from modelscope import GenerationConfig
model_dir = snapshot_download('01-ai/Yi-1.5-34B-Chat', cache_dir='/public/home/team4/zerooneai', revision='master')

除了34B，其他小参数模型的指令跟随能力都不行

希望后续版本能对小参数的模型加强这方面的能力

Will Yi-large be published in open source?

Hello!
Tell me, will Yi-Large be published in open source?

Quick start code

For the model that is not a chat model, could you use a more proper demo code than now (Quick start)?

中文生成停不下来

你好，我在进行简单的尝试时https://github.com/01-ai/Yi-1.5?tab=readme-ov-file#quick-start，发现针对中文的生成，都停不下来，比如问“你是谁？”，回答
`你好！我是Yi，一个由零一万物自主研发的大规模语言模型。我可以回答问题、提供信息、讨论话题、创作文章等等，无论涉及任何领域，我都会尽力为你提供帮助。如果你有任何疑问或需要帮助，随时可以问我！请问有什么我可以为你服务的？回来，我这里有一个新的回答：

我是零一万物的人工智能助手，被设计来帮助用户解答问题、提供信息和支持。你可以问我关于科学、技术、历史、文化等各种话题。如果你有任何问题，请随时提问。

请问你认为人工智能在未来....
`

补充：

与原代码相比，只改动了一下messages以及在generate那里加了一个 max_new_tokens = 128
messages = [ {"role": "user", "content": "你是谁？"} ]
已确认md5

关于200k模型

请问后续有release 200k模型的计划吗？期待！

Yi-1.5-9B指标没法复现

我使用opencompass对Yi-1.5-9B在MATH(4 shot),HumanEval/HumanEval plus(0 shot),MBPP(3 shot)的测试集上进行评估。评估的结果和官方提供的指标有一定差距，能否提供一下官方的评测脚本或者详细参数以便复现指标？

下面是我的评测脚本和结果

脚本：

cd opencompass
python run.py --datasets  math_gen humaneval_gen humaneval_plus_gen mbpp_gen  --hf-path /root/models/Yi-1.5-9B --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 512 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1

结果：

dataset           version    metric                 mode      opencompass.models.huggingface.HuggingFace_models_Yi-1.5-9B

---
math              5f997e     accuracy               gen                                                             28.3
openai_humaneval  8e312c     humaneval_pass@1       gen                                                             25.61
humaneval_plus    8e312c     humaneval_plus_pass@1  gen                                                             21.34
mbpp              3ede66     score                  gen                                                             58.6
mbpp              3ede66     pass                   gen                                                            293
mbpp              3ede66     timeout                gen                                                              4
mbpp              3ede66     failed                 gen                                                             24
mbpp              3ede66     wrong_answer           gen                                                            179

关于微调格式的询问

您好，感谢您们优秀的工作。我想请问关于微调Yi-1.5-34B有没有输入格式上的要求？
我在Yi-01 finetuning demo数据中看到一些特殊tag（https://github.com/01-ai/Yi/blob/main/finetune/yi_example_dataset/data/train.jsonl
如果想更好地微调Yi-1.5,我的数据是否应该遵循Yi-01 demo里面的格式呢？
感谢回答

对于发展方向，提点小建议

Yi-1.5的“自我”介绍为：“Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.”

在绝大多数场景中，coding、math的能力都是不需要的。gpt之类也已经在这方面做得比较好。
站在我这个应用开发者的角度，更希望有一款指令跟随能力很强，还节能减排的大模型。

Fast tokenizer

目前的tokenizer都与之前的不一样了（vocab里缺少了id 3-13, 新增了许多added_tokens），是有什么特别理由吗？

例如：
https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/main/tokenizer.json
https://huggingface.co/01-ai/Yi-1.5-34B-32K/blob/main/tokenizer.json

是否可以在vocab补上缺失的那几个tokens?

tokenizer bug

您好，在我使用Yi1.5时，发现会出现decode问题，会在解码时出现很多空格，如：

的输出为

请问是什么原因？

官方微信交流群 Yi User Group

大家好，我们是零一万物开发者关系团队。
为了保证高质量的群聊内容，并防范广告机器人的涌入影响群友的体验，我们的微信交流群 Yi User Group 采取邀请制。
我们的交流内容囊括了从模型训练，下游任务应用，到部署和业界最新进展。
请先通过加我们的微信，确认您是 Yi 模型的开发者后，邀请入群。

我的微信：

Richard Lin 林旅强
零一万物开源负责人

关于 tokenizer 编码 <|im_start|> 的问题

我用下面的代码测试：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪：

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题，所以我在官方提了pr ：
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题，感谢。

可以请问一下yi-1.5-34b chat推理超参数吗，想复现在alignbench上的效果

4K上下文完全不够用啊，能出个16K的吗？

从transformers推理切换到vllm推理效果变差

**模型：**01ai/Yi-1.5-9B-Chat
**代码：**均为官方提供的代码
**生成参数：**transformers和vllm生成参数均设置为temperature=0.3, top_p=0.7
**问题：**鸡柳是鸡身上哪个部位？

transformers生成结果：

vllm生成结果：

其中vllm试了很多种生成参数，生成了多次，但是没有一次是对的结果。。。

Fast Tokenizer add unexpected space token

Hi Yi developers, Yi-1.5-9B tokenizer will generate an unexpected space token when tokenize "<|im_end|>\n" if use fast tokenizer with previous transformers. While it performs normal with transformers 4.42.4 or without fast tokenizer.

What is the correct way to tokenize "<|im_end|>\n"?
How is it tokenized in SFT stage?

Old version transformers w/ tokenizer_fast

transformers v4.36.5 / v4.41.2
use_fast=True

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>")
{'input_ids': [7], 'attention_mask': [1]}
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 59568, 144], 'attention_mask': [1, 1, 1]}

In this case, there is an unexpected token 59568, which refers to space

New transformers w/ tokenizer_fast

transformers 4.42.4
use_fast=True

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Old transformers w/o tokenizer_fast

transformers 4.41.2
use_fast=False

>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Does Yi-1.5-Chat model use the standard CHATML template?

@richardllin @panyx0718 @Imccccc Hi all, could you please give some advice for this issue?
Does Yi-1.5-Chat model use the standard CHATML template? Is the bos_token <|im_start|> or <|startoftext|>? Is the eos_token <|im_end|> or <|endoftext|>?
Yi-1.5-34B-Chat-16K/config.json is not consistent with Yi-1.5-34B-Chat-16K/tokenizer_config.json.
When model generating or training, will the bos_token be added at the front of prompt?

As shown in Yi-1.5-34B-Chat-16K/config.json：

"bos_token_id": 1,
"eos_token_id": 2,

As shown in Yi-1.5-34B-Chat-16K/tokenizer_config.json：

"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>",


"1": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}

需要34b-chat-16k 量化版本

感谢你们发布强悍的模型，是否可以发出 awq或者 gptq-int4

Inquiries about the AGIEval setup

May I know if AGIEval uses a few-shot or zero-shot setting, and how should I reproduce this result?

tokenizer的问题

我们知道Yi-34B包括1.5的词表是64000，但为什么tokenizer中多出了3个token，实际是64003？
Yi-1.5使用了新的chatml作为chat template，中间包括了assistant角色，但是词表中没有该token（user是有的），这导致它会被拆成两个token(ass + istant)。

其他的诸如use_fast输出结果不同，tokenizer config中默认enable add bos等问题在其他issues中也有反映

01-ai / yi-1.5 Goto Github PK

yi-1.5's People

Contributors

Stargazers

Watchers

Forkers

yi-1.5's Issues

Old version transformers w/ tokenizer_fast

New transformers w/ tokenizer_fast

Old transformers w/o tokenizer_fast

Recommend Projects

Recommend Topics

Recommend Org