A collection of open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)
Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
Other relevant awesome-list: nichtdax/awesome-totally-open-chatgpt
Size: The number of instruction tuning pairs
Lingual-Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
Task-Tags:
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
- The template
- The Instruction tuning Dataset
- (tatsu-lab/Alpaca)|52K|EN|MT|SI
- (gururise/Cleaned Alpaca)|52K|EN|MT|SI
- (XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI
- (JosephusCheung/GuanacoDataset)|534K|ML|MT|SI
- (Hello-SimpleAI/HC3)|24K|EN|MT|MIX
- (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX
- (allenai/prosocial-dialog)|58K|EN|MT|MIX
- (allenai/natural-instructions)|1.6K|ML|MT|HG
- (bigscience/xP3)|N/A|ML|MT|MIX
- (nomic-ai/gpt4all)|437k|EN|MT|COL
- (PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL
- (google-research/FLAN)|N/A|EN|MT|MIX
- (thunlp/UltraChat)|280k|EN|TS|MIX
- (cascip/ChatAlpaca)|10k|EN|MT|MIX
- (YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL
- (orhonovich/unnatural-instructions)|240K|EN|MT|MIX
- (Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI
- Reinforcement Learning from Human Feedback (RLHF) Datasets
- At Your Own Risk Dataset
- Awesome Codebase
Append the new project at the end of file
## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}
- summary:
- Data generation model:
- paper:
- Cost:
- Related: (if applicable)
- Summary:
52K
data generated from modifiedself-instruct
pipeline with human written175 seed task
. - Data generation model:
text-davinci-003
- paper: alpaca-blog
- Cost: $600
- Summary: A project that manually cleaned the Alpaca 52K Dataset
- Data generation model:
text-davinci-003
- paper: N/A
- Cost: N/A
- Summary:
52K
data generated from modifiedself-instruct
pipeline with human written429 seed task
. - Data generation model:
text-davinci-003
- paper: N/A
- Cost: $880
- Summary:
52K
instruction data generated from modifiedself-instruct
pipeline with human written429 seed task
. - Data generation model:
text-davinci-003
- Cost: $6000
- Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- Cost: N/A
- Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- Cost: N/A
- Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.
- Data generation model:
gpt-3.5
,human generated
- paper: ProsocialDialog: A Prosocial Backbone for Conversational Agents
- Cost: N/A
- Summary: A community effort to create a large collection of
1,616 diverse NLP tasks
and their natural language definitions/instructions. - Data generation model:
Human generated
- paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
- Cost: N/A
- Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.
- Data generation model: N/A
- paper: Crosslingual Generalization through Multitask Finetuning
- Cost: N/A
- Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
- paper: N/A
- Cost: N/A
- Summary: gpt4all leverages three publicly available datasets: 1.laion/OIG, 2.pacovaldez/stackoverflow-questions 3. subset of bigscience/bloomz-p3
- Data generation model: N/A
- paper: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
- Cost: $500
- Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
- Data generation model:
GPT-4
- paper: N/A
- Cost: N/A
- Summary: The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates
- Data generation model: N/A
- paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
- Cost: N/A
- Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- Cost: N/A
- Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- Cost: N/A
- Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
- Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
- Data generation model: N/A
- paper: N/A
- Cost: N/A
- Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.
- Data generation model:
text-davinci-002
- paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
- Cost: N/A
- Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.
- Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- Cost: N/A
- Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI -(orhonovich/unnatural-instructions)|240K|EN|MT|MIX
- Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
- Data generation model:
Anthropic RL-CAI 52B
- paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Cost: N/A
- Related: -(Hello-SimpleAI/HC3)|24K|EN|MT|MIX -(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX
- Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
- Data generation model: N/A
- paper: A General Language Assistant as a Laboratory for Alignment
- Cost: N/A
- Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
- Data generation model: N/A
- paper: N/A
- Cost: N/A
- Summary: Ranked responses (Note: Data is evaluated by
GPT-4
model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" - Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- Cost: N/A
- Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI
- Summary: A compilation of
tatsu-lab/alpaca
,Dahoas/instruct-human-assistant-prompt
,allenai/prosocial-dialog
- Data generation model: N/A
- paper: N/A
- Cost: N/A
- Summary: Alternatives are projects featuring different instruct finetuned language models for chat.