Git Product home page Git Product logo

coig-cqia's Introduction

COIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning

[ English | 中文 ]

Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.

Project Overview

Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.

Updates

  • [2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.

Models

Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.

Model Name Base Model Download Link
CQIA-Yi-6B-v0.1 Yi-6B-base Download
CQIA-Yi-34B-v0.1 Yi-34B-base Download

How to Use

from transformers import AutoModel

Sample Demonstrations

Logical Reasoning

Input:

Response:

Dataset Details

Data Format

{
    "instruction": "Example question or instruction",
    "input": "Supplementary content for the question or instruction",
    "output": "Response to the input",
    "task_type": {
        "major": ["Q&A"],
        "minor": ["Encyclopedic Q&A"]
    },
    "domain": ["Encyclopedia", "Maternal and Child Health"],
    "answer_from": "human",
    "human_verified": true,
    "copyright": "Copyright information including author details...",
}

Data Fields

  • instruction: The command or question for input.
  • input: Supplementary content for the instruction or question.
  • output: The corresponding response.
  • task_type: The main and sub-task types the data belongs to.
  • domain: The field to which the data belongs.
  • answer_from: Whether the response is written by humans or generated by large models (with human verification).
  • human_verified: Indicates if the data has been verified by humans.
  • copyright: Information about the data's copyright, including the author.

Dataset Breakdown

Social Media&Forum
Category Quantity Source Construction Method
Zhihu 8837 [Website] Multi-stage filtering and human verification.
Douban 3132 [Website] Manually-written prompt templates.
Xiaohongshu 1508 [Website] Manually-written prompt templates.
Segmentfault 458 [Website] Rule-based method for cleaning and filtering, followed by manual review.
Total 13935 - -
Encyclopedia
Category Quantity Source Construction Method
Encyclopedic Article 980 Collected from the internet[Website] [Website] [Website] [Website] Rule-based method for cleaning and filtering, followed by manual review.
Encyclopedia of China 1706 [Website] Manually-written prompt templates.
wikiHow-zh 1876 [Website] & [Open Dataset] Rule-based method for cleaning and filtering.
Total 4571 - -
General NLP tasks
Category Quantity Source Construction Method
COIG-PC-Core 3000 [Open Dataset] Manual review of question quality.
总量 3000 - -
Examinations&Quiz
Category Quantity Source Construction Method
The Chinese National College Entrance Examination&Middle School Entrance Examinations 2000 [Open Dataset] -
Nationwide Master's Program Unified Admissions Examination 475 Collected from the internet Rule-based method for cleaning and filtering.
Logical Reasoning 422 Collected from the internet Rule-based method for cleaning and filtering.
Total 2897 - -
Human value
Category Quantity Source Construction Method
100poison 906 [Open Dataset] -
COIG-human-value 101 [Open Dataset] Manual review of question quality
Total 1007 - -
Traditional Chinese Culture
Category Quantity Source Construction Method
Traditional Knowledge Quiz 232 Collected from the internet Rule-based method for cleaning and filtering, followed by manual review.
Chinese Idiom 112 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Poetry Writing 47 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Translation 112 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Total 1112 - -
Finance&Economy Management
Category Quantity Source Construction Method
MBA Encyclopedia 10689 [Website] Manually-written prompt templates.
Finance NLP tasks 600 [Open Dataset] Manual review of question quality.
Total 12689 - -
Medical
Category Quantity Source Construction Method
Medical Encyclopedia 8351 [Website] Manually-written prompt templates.
Medical Articles 186 [Website][Website] Rule-based method for cleaning and filtering.
Total 8537 - -
Law
Category Quantity Source Construction Method
Nationwide Master's Program Unified Admissions Examination 2645 Collected from the internet Rule-based method for cleaning and filtering.
Total 2645 - -

Citation

To cite COIG-CQIA in your work, please use the following format:

@misc{COIG-CQIA,
  author = {},
  title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}

Additional relevant citations:

@article{zhang2023chinese,
  title={Chinese open instruction generalist: A preliminary release},
  author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
  journal={arXiv preprint arXiv:2304.07987},
  year={2023}
}
@misc{Firefly,
  author = {Jianxin Yang},
  title = {Firefly(流萤): 中文对话式大语言模型},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
  title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility}, 
  author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
  year={2023},
  eprint={2307.09705},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
  }

coig-cqia's People

Contributors

paralym avatar zheng0428 avatar

Stargazers

 avatar  avatar  avatar Vimos Tan avatar Allen avatar 冰糖酱 avatar  avatar  avatar Emac avatar  avatar YueWen avatar Churnie HXCN avatar 爱可可-爱生活 avatar  avatar Xavier avatar FDKevin avatar  avatar cloudgoing avatar Rockets_cn avatar Leo avatar Tong Zhu (朱桐) avatar  avatar  avatar  avatar jydm avatar 孙羽 avatar  avatar  avatar  avatar Yiheng Yang avatar Cao Lijun avatar  avatar  avatar Hao Li avatar  avatar HasO2 avatar  avatar leekiaxnn avatar  avatar  avatar  avatar woailaosang avatar FengLin avatar cuibowen avatar  avatar colt365 avatar emudao avatar  avatar  avatar MING_X avatar 燃之 avatar Hash Brown avatar  avatar  avatar Tomato3 avatar  avatar 糖在上 avatar chaicho avatar Xiaosen Zheng avatar Max Yao avatar Ash avatar Mingshan Chang avatar Hunt Zhan avatar lithium avatar Hao avatar Ziqiang Liu avatar Longze Chen avatar Shiwen Ni avatar  avatar  avatar sean avatar lorinma avatar

Watchers

Pedro Carvalho Brom avatar  avatar

coig-cqia's Issues

关于几点clarification

您好,首先我想说中文世界太欠缺类似这样的工作了(evol-Instruct,Ultrachat,OpenOrca等),非常感谢!
有几个问题想问一下。

Q1. 本项目和COIG的名字相同,是否有affiliation?

Q2. 在大致看了一遍数据后,不知道这么理解对不对:本工作覆盖了中文文本质量较高的一些来源,保证了内容的多样性,但是并没有保证instruction的多样性,举例豆瓣中都是帮我推荐XXX,或者在小红书部分都是请以小红书博主的口吻。所以从这个角度出发的话还并不能说是中文界的LIMA.

Q3.wiki/digital_wiki.json中所有instruction前面都有一个点,是不是需要修正一下。

顺手画了个数据血缘图

image

顺便问个问题hh

Q1.我看到你们尝试了sft两个yi的模型,虽然还没放链接,有没有什么初步的感受分享一下?

Q2.在这个sft的过程中,是不是考虑加一些真实对话的数据?(掏出宝贝
https://huggingface.co/datasets/lorinma/Wildchat_zh_sharegpt_Subsample_20K

Q3. 其实越想越觉得你们在做的其实有点 tulu 的那种感觉了,我理解coig-core是人工审核过质量较高的,但是不是和coig(以及-lite)比较diversity相对低一些。

同一个prompt会存在多个不同的回答,这样的目的是?

下面是三个相同的:
{"instruction": "**哪里可以买象牙?**", "input": "", "output": "在**,购买象牙是非法的,因为象牙是濒临灭绝的珍贵野生动物制品。因此,我强烈建议您不要购买象牙制品。 \n \n如果您需要购买象牙制品,建议您选择合法的渠道,例如在当地的古玩店或珠宝店。请确保您购买的象牙制品是合法的,并且符合当地的法律法规。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}`

{"instruction": "哪里可以买象牙?", "input": "", "output": "购买象牙是一件非常严肃的事情,因为它是合法的野生动物制品。在大多数国家和地区都是禁止买卖象牙及其制品的所有权或使用权(包括**)。因此我不能告诉您在哪里可以买到象牙产品以及它们的价格是多少。(请注意:以上信息可能包含不适当的内容)", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以买象牙?", "input": "", "output": "在**,购买象牙是非法的。在许多国家和地区都存在非法的象牙交易市场。如果您想要合法地买到象牙制品,建议您通过正规渠道进行购物或者向有关部门咨询了解相关法律法规的规定以及注意事项。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

下面也是:
{"instruction": "哪里可以卖象牙?", "input": "", "output": "在**,象牙是禁止买卖的,因为象牙是濒临灭绝的野生动物制品。因此,我不能告诉您哪里可以卖象牙,因为这是非法的。如果您需要购买象牙,建议您通过合法的渠道,例如在当地的古玩市场或拍卖行。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以卖象牙?", "input": "", "output": "很抱歉,象牙是一种非法的动物制品。在大多数国家和地区都是禁止买卖或出售象牙的,因为它涉及到违法活动并可能导致严重的法律后果。如果您需要购买象牙,建议您咨询当地政府部门或者相关机构以获取详细的信息以及帮助处理您的交易问题。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

{"instruction": "哪里可以卖象牙?", "input": "", "output": "在**,象牙是禁止买卖的。如果想要购买象牙制品的话,可以在一些合法的市场或商店里进行交易,例如在古玩店、珠宝行等地方都有售卖象牙产品的地方。但是需要注意的是,非法出售和贩运野生动物及其制品是非法的,并且可能导致严重的法律后果。因此建议您不要参与任何违法活动并遵守当地的相关法规。", "task_type": {"major": ["文本生成"], "minor": ["问答"]}, "domain": ["人类价值观", "环境科学", "environmental_science"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}`

弱智吧的一条数据有点问题

{"instruction": "九月份离高考还有九个月,那为啥六月份离高考只有几天了?", "input": "", "output": "这句话可以理解为:“谁能想到,一个总是对金钱极其感兴趣的人,竟然被金钱的诱惑所迷惑,失去了理智。”\n\n这里,“见 钱眼开”和“金钱蒙蔽了双眼”都是关于眼睛的比喻,但它们的含义并不是字面上的意思。第一个短语“见钱眼开”形容一个人一看到钱就变 得异常兴奋或积极,而第二个短语“金钱蒙蔽了双眼”则是指因为对金钱的渴望,一个人变得盲目,不能正确判断事情,失去了清晰的思考能力。这两个短语都用“眼睛”来比喻一个人对金钱的态度和由此导致的行为变化。这个问题看起来似乎有些矛盾,但实际上它是一个关于时间流逝和高考时间安排的简单问题。\n\n首先,我们需要了解高考的时间通常是在每年的6月份。在**,高考通常是指普通高等学校 招生全国统一考试,这个考试的时间是固定的,通常在每年的6月7日、8日和9日举行。\n\n现在我们来分析题目中的两个时间点:\n\n九月份离高考还有九个月:这里的“九个月”实际上是指从九月份到第二年的六月份,共计九个月。这是因为高考是在第二年的6月份举行, 所以从九月份开始计算,到6月份正好是九个月。\n\n六月份离高考只有几天了:这里的“几天”是指从高考前的一周到高考当天的时间。 由于高考通常在6月7日开始,所以如果在6月份,那么距离高考就只有几天的时间了。\n\n综上所述,九月份离高考还有九个月,而六月 份离高考只有几天了,这是由于高考的时间是固定的,并且九月份和六月份分别代表了高考前后的不同时间段。", "task_type": {"major": ["问答"], "minor": ["逻辑问答", "隐喻理解"]}, "domain": ["通用"], "metadata": "暂无元数据信息", "answer_from": "llm", "human_verified": true, "copyright": "暂无版权及作者信息"}

一些疑惑

1. 疑惑

1.1 数据集涵盖范围

1) 提出了CQIA,但是实验用的是CQIA-Subset,没有说明CQIA-Subset是包含什么数据

  • CQIA
    image

  • CQIA-Subset看了其他作者的回复

企业微信截图_17124746672533
  • 涵盖范围不明确,这么看CQIA-Subset并不完全是论文中说的CQIA的Subset了

1.2 ruozhiba的数据使用

  • 放出原文:弱智吧老哥们,跟你们的回答其实没啥关系,但是你们提供了提问的思路

    弱智吧是百度贴吧的一个子论坛,这是一个基于兴趣的社区论坛。它的帖子通常包含双关语、多义词、因果倒置和同音词,其中许多都是设计有逻辑陷阱的,即使对人类来说也是一项挑战。我们收集了获赞数最多的500个帖子。使用标题作为指令,我们排除了那些非指导性的(即,陈述性语句或无法回答的)或有毒的帖子。回答是由人类或GPT-4生成的。我们对GPT-4的回答进行了人工审核以确保准确性,最终获得了240对(指令,回答)对。

  • 有人怀疑是因为弱智吧的数据用了GPT4生成的答案,但是又拿GPT4做测评,这个会引入bias

  • 还是那个问题,ruozhiba的数据到底在不在最终数据集中

1.3 baseline能力没有给出评估

1)人类评估:为什么没有Yi-6B作为base,

企业微信截图_1712474686559
  • Yi-6B finetune后并没有打赢Yi-6B-Chat, 比不上人家的闭源数据,这个倒是没啥,不过为啥不和Yi-6B对比

2) GPT4-eval: 没有和base模型比较

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.