Git Product home page Git Product logo

yi-1.5's People

Contributors

anonymitaet avatar c78c avatar eltociear avatar nlmlml avatar yimi81 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yi-1.5's Issues

modelscope模型下载问题

为什么通过以下脚本下载模型文件会报错,我用这个指令下载其他模型都是没问题的

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
from modelscope import GenerationConfig
model_dir = snapshot_download('01-ai/Yi-1.5-34B-Chat', cache_dir='/public/home/team4/zerooneai', revision='master')

Quick start code

For the model that is not a chat model, could you use a more proper demo code than now (Quick start)?

中文生成停不下来

你好,我在进行简单的尝试时https://github.com/01-ai/Yi-1.5?tab=readme-ov-file#quick-start,发现针对中文的生成,都停不下来,比如问“你是谁?”,回答
`你好!我是Yi,一个由零一万物自主研发的大规模语言模型。我可以回答问题、提供信息、讨论话题、创作文章等等,无论涉及任何领域,我都会尽力为你提供帮助。如果你有任何疑问或需要帮助,随时可以问我!请问有什么我可以为你服务的?回来,我这里有一个新的回答:

我是零一万物的人工智能助手,被设计来帮助用户解答问题、提供信息和支持。你可以问我关于科学、技术、历史、文化等各种话题。如果你有任何问题,请随时提问。

请问你认为人工智能在未来....
`

补充:

  1. 与原代码相比,只改动了一下messages以及在generate那里加了一个 max_new_tokens = 128
    messages = [ {"role": "user", "content": "你是谁?"} ]
  2. 已确认md5

Yi-1.5-9B指标没法复现

我使用opencompass对Yi-1.5-9B在MATH(4 shot),HumanEval/HumanEval plus(0 shot),MBPP(3 shot)的测试集上进行评估。评估的结果和官方提供的指标有一定差距,能否提供一下官方的评测脚本或者详细参数以便复现指标?

下面是我的评测脚本和结果

  • 脚本:
cd opencompass
python run.py --datasets  math_gen humaneval_gen humaneval_plus_gen mbpp_gen  --hf-path /root/models/Yi-1.5-9B --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 512 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1
  • 结果:
dataset           version    metric                 mode      opencompass.models.huggingface.HuggingFace_models_Yi-1.5-9B

---
math              5f997e     accuracy               gen                                                             28.3
openai_humaneval  8e312c     humaneval_pass@1       gen                                                             25.61
humaneval_plus    8e312c     humaneval_plus_pass@1  gen                                                             21.34
mbpp              3ede66     score                  gen                                                             58.6
mbpp              3ede66     pass                   gen                                                            293
mbpp              3ede66     timeout                gen                                                              4
mbpp              3ede66     failed                 gen                                                             24
mbpp              3ede66     wrong_answer           gen                                                            179

对于发展方向,提点小建议

Yi-1.5的“自我”介绍为:“Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.”

在绝大多数场景中,coding、math的能力都是不需要的。gpt之类也已经在这方面做得比较好。
站在我这个应用开发者的角度,更希望有一款指令跟随能力很强,还节能减排的大模型。

tokenizer bug

您好,在我使用Yi1.5时,发现会出现decode问题,会在解码时出现很多空格,如:
6531719891243_ pic

的输出为
6541719891252_ pic
请问是什么原因?

官方微信交流群 Yi User Group

大家好,我们是零一万物 开发者关系团队。
为了保证高质量的群聊内容,并防范广告机器人的涌入影响群友的体验,我们的微信交流群 Yi User Group 采取邀请制。
我们的交流内容囊括了从模型训练,下游任务应用,到部署和业界最新进展。
请先通过加我们的微信,确认您是 Yi 模型的开发者后,邀请入群。

我的微信:
image

Richard Lin 林旅强
零一万物 开源负责人

关于 tokenizer 编码 <|im_start|> 的问题

我用下面的代码测试:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪:

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题,所以我在官方提了pr :
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12
https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题,感谢。

从transformers推理切换到vllm推理效果变差

**模型:**01ai/Yi-1.5-9B-Chat
**代码:**均为官方提供的代码
**生成参数:**transformers和vllm生成参数均设置为temperature=0.3, top_p=0.7
**问题:**鸡柳是鸡身上哪个部位?

transformers生成结果:
image

vllm生成结果:
image

其中vllm试了很多种生成参数,生成了多次,但是没有一次是对的结果。。。

Fast Tokenizer add unexpected space token

Hi Yi developers, Yi-1.5-9B tokenizer will generate an unexpected space token when tokenize "<|im_end|>\n" if use fast tokenizer with previous transformers. While it performs normal with transformers 4.42.4 or without fast tokenizer.

What is the correct way to tokenize "<|im_end|>\n"?
How is it tokenized in SFT stage?

Old version transformers w/ tokenizer_fast

  • transformers v4.36.5 / v4.41.2
  • use_fast=True
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>")
{'input_ids': [7], 'attention_mask': [1]}
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 59568, 144], 'attention_mask': [1, 1, 1]}

In this case, there is an unexpected token 59568, which refers to space

New transformers w/ tokenizer_fast

  • transformers 4.42.4
  • use_fast=True
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Old transformers w/o tokenizer_fast

  • transformers 4.41.2
  • use_fast=False
>>> tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)
>>> tokenizer("<|im_end|>\n")
{'input_ids': [7, 144], 'attention_mask': [1, 1]}

Does Yi-1.5-Chat model use the standard CHATML template?

@richardllin @panyx0718 @Imccccc Hi all, could you please give some advice for this issue?
Does Yi-1.5-Chat model use the standard CHATML template? Is the bos_token <|im_start|> or <|startoftext|>? Is the eos_token <|im_end|> or <|endoftext|>?
Yi-1.5-34B-Chat-16K/config.json is not consistent with Yi-1.5-34B-Chat-16K/tokenizer_config.json.
When model generating or training, will the bos_token be added at the front of prompt?

As shown in Yi-1.5-34B-Chat-16K/config.json:

"bos_token_id": 1,
"eos_token_id": 2,

As shown in Yi-1.5-34B-Chat-16K/tokenizer_config.json:

"bos_token": "<|startoftext|>",
"eos_token": "<|im_end|>",

"1": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}

tokenizer的问题

  1. 我们知道Yi-34B包括1.5的词表是64000,但为什么tokenizer中多出了3个token,实际是64003?
  2. Yi-1.5使用了新的chatml作为chat template,中间包括了assistant角色,但是词表中没有该token(user是有的),这导致它会被拆成两个token(ass + istant)。

其他的诸如use_fast输出结果不同,tokenizer config中默认enable add bos等问题在其他issues中也有反映

test

test github issue feeding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.