hikariming / chat-dataset-baseline Goto Github PK
View Code? Open in Web Editor NEW人工精调的中文对话数据集和一段chatglm的微调代码
人工精调的中文对话数据集和一段chatglm的微调代码
这种情况如何解决?
https://github.com/zhenbench/z-bench
我认为还可以的。
如果能合并(或合作)一定最好的。
疑问❓
感谢你的工作!我有个疑问是此处提到的明显变化是指finetune的效果比较好吗,这块是怎么比较的呢?是加入了自己的数据集微调之后看效果吗?其他code效果都不好?
hello,我这里想认领一部分这个数据集翻译,请问怎么划分呢?
还是说已经翻译完了呢?
你好对这个方向非常感兴趣,比如在公司实际落地的时候,需要模型理解一个系统的整体概念,用langchain+向量检索无法理解大的语境 。期待你们的论文,另外有推荐的相关领域论文或者资料吗,非常感谢
最近充值了colab pro想要学习和复现alpaca-lora,害怕计算单元不足还停留在阅读各个项目代码阶段。
大佬想咨询一下大概需要多少计算单元?以及数据量是多少。
{
"instruction": "从给定列表中选择一种颜色,并描述它如何用于创造一个舒适的房间氛围。",
"input": "黄色",
"output": "黄色是一种温暖和愉快的颜色,可以用来创造一个舒适的房间氛围。通过使用浅黄色的墙壁和装饰品,可以给人一种舒适和快乐的感觉。柔和的灯光会让房间感到温馨,黄色的暖色调则会增添明亮、阳光般的气氛。"
},
类似于这种有input的数据,input是不是要和instruction合在一起作为question更合适。
参数怎么设置比较合适捏
你好,如何使用 fastapi 部署微调后的模型
就像官方的一样。如何修改api.py的脚本???
官方的如下:
首先需要安装额外的依赖 pip install fastapi uvicorn,然后运行仓库中的 api.py:
python api.py
默认部署在本地的 8000 端口,通过 POST 方法进行调用
curl -X POST "http://127.0.0.1:8000"
-H 'Content-Type: application/json'
-d '{"prompt": "你好", "history": []}'
There is a comment about "修改train_util.py的超参数" in 微调使用自己数据集成功方案.ipynb.
Should it be modified to"修改data_utils.py的超参数" according to this repository chatglm_finetuning
感谢Contributors的无私奉献~ 今天下载数据集翻了翻,发现两个小问题!我觉得可以弄一层楼,专门给大家反馈英-中转换过程的错误,这样也许能更好的提升质量~ :p
alpaca_data-6382-9407-(jlh)中文-已完成.json
{
"instruction": "将给定的字符串更改为大写。",
"input": "你好世界。",
"output": "你好世界。"
},
{
"instruction": "找出给定单词的押韵词。",
"input": "有",
"output": "垫子、老鼠、蝙蝠、床、村庄、聊天。"
},
第一个问题,字符串大小写应该是针对英文,中文不存在这个转换?
第二个问题,押韵的词是不是英文原本押韵,但是翻译成中文后并不押韵?输入有,应该输出ou的韵对应词
loss一直不降低?怎么调参
请问一下,对话模型的数据集一般是怎么获取的,有没有什么数据标注的软件
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpg1hbjeku
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpg1hbjeku/_remote_module_non_scriptable.py
INFO:lightning_fabric.utilities.seed:Global seed set to 42
Traceback (most recent call last):
File "/home/cike/zzp/alpaca/chatglm_finetuning/data_utils.py", line 272, in
tokenizer, config, , = dataHelper.load_tokenizer_and_config(tokenizer_class_name=ChatGLMTokenizer,config_class_name=ChatGLMConfig)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_helper.py", line 257, in load_tokenizer_and_config
tokenizer = load_tokenizer(tokenizer_name=tokenizer_name or model_args.tokenizer_name,
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/deep_training/data_helper/data_module.py", line 29, in load_tokenizer
tokenizer = class_name.from_pretrained(tokenizer_name, **tokenizer_kwargs)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 211, in init
self.sp_tokenizer = SPTokenizer(vocab_file)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 32, in init
self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 65, in _build_text_tokenizer
self._configure_tokenizer(
File "/home/cike/zzp/alpaca/chatglm_finetuning/tokenization_chatglm.py", line 61, in _configure_tokenizer
text_tokenizer.refresh()
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh
self.sp.Load(model_proto=self.proto.SerializeToString())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 904, in Load
return self.LoadFromSerializedProto(model_proto)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/site-packages/sentencepiece/init.py", line 250, in LoadFromSerializedProto
return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized)
RuntimeError: Internal: [MASK] is already defined.
您好,我用8900个单轮聊天数据微调chatglm,模型遗忘很严重。轮次多了,所有的回答都往微调数据的领域扯。但轮次少了,微调数据学不到,应该如何解决呢?
如题,请教
看了readme,如果官方数据集能实现翻译 + 人工审核,那数据集的质量一定很高,再加上chatglm、文心一言、chatgpt的数据,想想都起飞,感谢辛苦开源,非常期待,大佬们加油
除了输入“你是谁?”,或者数据集中特有一些问题,有没有别的量化方案可以评估训练完的模型性能提升或变差?
这边也是训练完了, 想知道结果怎么样
想问下博主显存占用情况
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 293, in load
return loads(fp.read(),
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/cike/anaconda/envs/alpaca/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 112 column 1 (char 11779)
您好,我在用您给的代码进行微调的时候,发现在最后调用模型,用 LoraArguments 读取 /best_ckpt/config.json 文件 的时候,即使相关目录下面存在 config.json 文件,但是最终还是报"ValueError: Can't find config.json at './best_ckpt/'" 的错误:
lora_args = LoraArguments.from_pretrained('./best_ckpt/')
ValueError: Can't find config.json at './best_ckpt/'
不知道是什么原因导致,以下是 config.json 文件的内容,您遇到过这样的问题吗,或者您知道可能是什么原因导致的吗,期待您的回复.
{
"architectures": [
"ChatGLMModel"
],
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
},
"bos_token_id": 150004,
"eos_token_id": 150005,
"hidden_size": 4096,
"initializer_range": 0.02,
"initializer_weight": false,
"inner_hidden_size": 16384,
"layernorm_epsilon": 1e-05,
"max_sequence_length": 2048,
"model_type": "chatglm",
"num_attention_heads": 32,
"num_layers": 28,
"pad_token_id": 20003,
"position_encoding_2d": true,
"pre_seq_len": null,
"precision": 16,
"prefix_projection": false,
"quantization_bit": 0,
"return_dict": false,
"task_specific_params": {
"learning_rate": 2e-05,
"learning_rate_for_task": 2e-05
},
"torch_dtype": "float16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 150528
}
领域10万条多轮对话数据微调之后,能记住10万条数据,但是其他问题,回答的不好,如何解决?
你好,想请教下,无监督时指令如何设计的呢,直接灌无监督数据不会导致prompt功能失效吗
不能直接用chatgpt 翻译吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.