paralym / coig-cqia Goto Github PK

coig-cqia's Introduction

COIG-CQIA：Quality is All you need for Chinese Instruction Fine-tuning

[ English | 中文 ]

Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.

Project Overview

Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.

Updates

[2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.

Models

Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.

Model Name	Base Model	Download Link
CQIA-Yi-6B-v0.1	Yi-6B-base	Download
CQIA-Yi-34B-v0.1	Yi-34B-base	Download

How to Use

from transformers import AutoModel

Sample Demonstrations

Logical Reasoning

Input:

Response:

Dataset Details

Data Format

{
    "instruction": "Example question or instruction",
    "input": "Supplementary content for the question or instruction",
    "output": "Response to the input",
    "task_type": {
        "major": ["Q&A"],
        "minor": ["Encyclopedic Q&A"]
    },
    "domain": ["Encyclopedia", "Maternal and Child Health"],
    "answer_from": "human",
    "human_verified": true,
    "copyright": "Copyright information including author details...",
}

Data Fields

instruction: The command or question for input.
input: Supplementary content for the instruction or question.
output: The corresponding response.
task_type: The main and sub-task types the data belongs to.
domain: The field to which the data belongs.
answer_from: Whether the response is written by humans or generated by large models (with human verification).
human_verified: Indicates if the data has been verified by humans.
copyright: Information about the data's copyright, including the author.

Dataset Breakdown

Social Media&Forum

Category	Quantity	Source	Construction Method
Zhihu	8837	[Website]	Multi-stage filtering and human verification.
Douban	3132	[Website]	Manually-written prompt templates.
Xiaohongshu	1508	[Website]	Manually-written prompt templates.
Segmentfault	458	[Website]	Rule-based method for cleaning and filtering, followed by manual review.
Total	13935	-	-

Encyclopedia

Category	Quantity	Source	Construction Method
Encyclopedic Article	980	Collected from the internet[Website] [Website] [Website] [Website]	Rule-based method for cleaning and filtering, followed by manual review.
Encyclopedia of China	1706	[Website]	Manually-written prompt templates.
wikiHow-zh	1876	[Website] & [Open Dataset]	Rule-based method for cleaning and filtering.
Total	4571	-	-

General NLP tasks

Category	Quantity	Source	Construction Method
COIG-PC-Core	3000	[Open Dataset]	Manual review of question quality.
总量	3000	-	-

Examinations&Quiz

Category	Quantity	Source	Construction Method
The Chinese National College Entrance Examination&Middle School Entrance Examinations	2000	[Open Dataset]	-
Nationwide Master's Program Unified Admissions Examination	475	Collected from the internet	Rule-based method for cleaning and filtering.
Logical Reasoning	422	Collected from the internet	Rule-based method for cleaning and filtering.
Total	2897	-	-

Human value

Category	Quantity	Source	Construction Method
100poison	906	[Open Dataset]	-
COIG-human-value	101	[Open Dataset]	Manual review of question quality
Total	1007	-	-

Traditional Chinese Culture

Category	Quantity	Source	Construction Method
Traditional Knowledge Quiz	232	Collected from the internet	Rule-based method for cleaning and filtering, followed by manual review.
Chinese Idiom	112	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Poetry Writing	47	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Translation	112	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Total	1112	-	-

Finance&Economy Management

Category	Quantity	Source	Construction Method
MBA Encyclopedia	10689	[Website]	Manually-written prompt templates.
Finance NLP tasks	600	[Open Dataset]	Manual review of question quality.
Total	12689	-	-

Medical

Category	Quantity	Source	Construction Method
Medical Encyclopedia	8351	[Website]	Manually-written prompt templates.
Medical Articles	186	[Website][Website]	Rule-based method for cleaning and filtering.
Total	8537	-	-

Law

Category	Quantity	Source	Construction Method
Nationwide Master's Program Unified Admissions Examination	2645	Collected from the internet	Rule-based method for cleaning and filtering.
Total	2645	-	-

Citation

To cite COIG-CQIA in your work, please use the following format:

@misc{COIG-CQIA,
  author = {},
  title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}

Additional relevant citations:

@article{zhang2023chinese,
  title={Chinese open instruction generalist: A preliminary release},
  author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
  journal={arXiv preprint arXiv:2304.07987},
  year={2023}
}
@misc{Firefly,
  author = {Jianxin Yang},
  title = {Firefly(流萤): 中文对话式大语言模型},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
  title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility}, 
  author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
  year={2023},
  eprint={2307.09705},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
  }

coig-cqia's People

Contributors

Stargazers

Watchers

Forkers

zzu-hzc 8baby8 sigma-lm ygjin11 waichan8

coig-cqia's Issues

关于几点clarification

您好，首先我想说中文世界太欠缺类似这样的工作了（evol-Instruct，Ultrachat，OpenOrca等），非常感谢！
有几个问题想问一下。

Q1. 本项目和COIG的名字相同，是否有affiliation？

Q2. 在大致看了一遍数据后，不知道这么理解对不对：本工作覆盖了中文文本质量较高的一些来源，保证了内容的多样性，但是并没有保证instruction的多样性，举例豆瓣中都是帮我推荐XXX，或者在小红书部分都是请以小红书博主的口吻。所以从这个角度出发的话还并不能说是中文界的LIMA．

Q3.wiki/digital_wiki.json中所有instruction前面都有一个点，是不是需要修正一下。

顺手画了个数据血缘图

顺便问个问题hh

Q1.我看到你们尝试了sft两个yi的模型，虽然还没放链接，有没有什么初步的感受分享一下？

Q2.在这个sft的过程中，是不是考虑加一些真实对话的数据？（掏出宝贝
https://huggingface.co/datasets/lorinma/Wildchat_zh_sharegpt_Subsample_20K

Q3. 其实越想越觉得你们在做的其实有点 tulu 的那种感觉了，我理解coig-core是人工审核过质量较高的，但是不是和coig（以及-lite）比较diversity相对低一些。

同一个prompt会存在多个不同的回答，这样的目的是？

下面是三个相同的：
{"instruction": "**哪里可以买象牙？**", "input": "", "output": "在**，购买象牙是非法的，因为象牙是濒临灭绝的珍贵野生动物制品。因此，我强烈建议您不要购买象牙制品。 \n \n如果您需要购买象牙制品，建议您选择合法的渠道，例如在当地的古玩店或珠宝店。请确保您购买的象牙制品是合法的，并且符合当地的法律法规。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}`

{"instruction": "哪里可以买象牙？", "input": "", "output": "购买象牙是一件非常严肃的事情，因为它是合法的野生动物制品。在大多数国家和地区都是禁止买卖象牙及其制品的所有权或使用权（包括**）。因此我不能告诉您在哪里可以买到象牙产品以及它们的价格是多少。（请注意：以上信息可能包含不适当的内容）", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以买象牙？", "input": "", "output": "在**，购买象牙是非法的。在许多国家和地区都存在非法的象牙交易市场。如果您想要合法地买到象牙制品，建议您通过正规渠道进行购物或者向有关部门咨询了解相关法律法规的规定以及注意事项。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

下面也是：
{"instruction": "哪里可以卖象牙？", "input": "", "output": "在**，象牙是禁止买卖的，因为象牙是濒临灭绝的野生动物制品。因此，我不能告诉您哪里可以卖象牙，因为这是非法的。如果您需要购买象牙，建议您通过合法的渠道，例如在当地的古玩市场或拍卖行。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以卖象牙？", "input": "", "output": "很抱歉，象牙是一种非法的动物制品。在大多数国家和地区都是禁止买卖或出售象牙的，因为它涉及到违法活动并可能导致严重的法律后果。如果您需要购买象牙，建议您咨询当地政府部门或者相关机构以获取详细的信息以及帮助处理您的交易问题。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以卖象牙？", "input": "", "output": "在**，象牙是禁止买卖的。如果想要购买象牙制品的话，可以在一些合法的市场或商店里进行交易，例如在古玩店、珠宝行等地方都有售卖象牙产品的地方。但是需要注意的是，非法出售和贩运野生动物及其制品是非法的，并且可能导致严重的法律后果。因此建议您不要参与任何违法活动并遵守当地的相关法规。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}`

模型下载点不进去

模型训练参数和推理参数

请问可以分享一下模型训练参数和推理参数吗？

弱智吧的一条数据有点问题

{"instruction": "九月份离高考还有九个月，那为啥六月份离高考只有几天了？", "input": "", "output": "这句话可以理解为：“谁能想到，一个总是对金钱极其感兴趣的人，竟然被金钱的诱惑所迷惑，失去了理智。”\n\n这里，“见钱眼开”和“金钱蒙蔽了双眼”都是关于眼睛的比喻，但它们的含义并不是字面上的意思。第一个短语“见钱眼开”形容一个人一看到钱就变得异常兴奋或积极，而第二个短语“金钱蒙蔽了双眼”则是指因为对金钱的渴望，一个人变得盲目，不能正确判断事情，失去了清晰的思考能力。这两个短语都用“眼睛”来比喻一个人对金钱的态度和由此导致的行为变化。这个问题看起来似乎有些矛盾，但实际上它是一个关于时间流逝和高考时间安排的简单问题。\n\n首先，我们需要了解高考的时间通常是在每年的6月份。在**，高考通常是指普通高等学校招生全国统一考试，这个考试的时间是固定的，通常在每年的6月7日、8日和9日举行。\n\n现在我们来分析题目中的两个时间点：\n\n九月份离高考还有九个月：这里的“九个月”实际上是指从九月份到第二年的六月份，共计九个月。这是因为高考是在第二年的6月份举行，所以从九月份开始计算，到6月份正好是九个月。\n\n六月份离高考只有几天了：这里的“几天”是指从高考前的一周到高考当天的时间。由于高考通常在6月7日开始，所以如果在6月份，那么距离高考就只有几天的时间了。\n\n综上所述，九月份离高考还有九个月，而六月份离高考只有几天了，这是由于高考的时间是固定的，并且九月份和六月份分别代表了高考前后的不同时间段。", "task_type": {"major": ["问答"], "minor": ["逻辑问答", "隐喻理解"]}, "domain": ["通用"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

Where is the data file(s)?

一些疑惑

1. 疑惑

1.1 数据集涵盖范围

1) 提出了CQIA,但是实验用的是CQIA-Subset，没有说明CQIA-Subset是包含什么数据

CQIA
CQIA-Subset看了其他作者的回复

涵盖范围不明确，这么看CQIA-Subset并不完全是论文中说的CQIA的Subset了？

1.2 ruozhiba的数据使用

放出原文：弱智吧老哥们，跟你们的回答其实没啥关系，但是你们提供了提问的思路

弱智吧是百度贴吧的一个子论坛，这是一个基于兴趣的社区论坛。它的帖子通常包含双关语、多义词、因果倒置和同音词，其中许多都是设计有逻辑陷阱的，即使对人类来说也是一项挑战。我们收集了获赞数最多的500个帖子。使用标题作为指令，我们排除了那些非指导性的（即，陈述性语句或无法回答的）或有毒的帖子。回答是由人类或GPT-4生成的。我们对GPT-4的回答进行了人工审核以确保准确性，最终获得了240对（指令，回答）对。
有人怀疑是因为弱智吧的数据用了GPT4生成的答案，但是又拿GPT4做测评，这个会引入bias
还是那个问题，ruozhiba的数据到底在不在最终数据集中

1.3 baseline能力没有给出评估

1）人类评估：为什么没有Yi-6B作为base,

Yi-6B finetune后并没有打赢Yi-6B-Chat, 比不上人家的闭源数据，这个倒是没啥，不过为啥不和Yi-6B对比