Git Product home page Git Product logo

awesome-instruction-dataset's Introduction

awesome-instruction-tuning(ChatGPT|LLaMA)-dataset

A collection of open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Other relevant awesome-list: nichtdax/awesome-totally-open-chatgpt

Size: The number of instruction tuning pairs

Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Table of Contents

  1. The template
  2. The Instruction tuning Dataset
  3. Reinforcement Learning from Human Feedback (RLHF) Datasets
  4. At Your Own Risk Dataset
  5. Awesome Codebase

The template

Append the new project at the end of file

## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}

- summary:
- Data generation model:
- paper:
- Cost:
- Related: (if applicable)

The Instruction-following Datasets

  • Summary:52K data generated from modified self-instruct pipeline with human written 175 seed task.
  • Data generation model: text-davinci-003
  • paper: alpaca-blog
  • Cost: $600
  • Summary: A project that manually cleaned the Alpaca 52K Dataset
  • Data generation model: text-davinci-003
  • paper: N/A
  • Cost: N/A
  • Summary:52K data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • paper: N/A
  • Cost: $880
  • Summary:52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • Cost: $6000
  • Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
  • paper: N/A
  • Cost: N/A
  • Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
  • Data generation model: GPT-4
  • paper: N/A
  • Cost: N/A
  • Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
  • Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A

Reinforcement Learning from Human Feedback (RLHF) Datasets

  • Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A
  • Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
  • Data generation model: GPT-4
  • paper: Instruction Tuning with GPT-4
  • Cost: N/A
  • Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI

Datasets without license information

  • Summary: A compilation of tatsu-lab/alpaca ,Dahoas/instruct-human-assistant-prompt ,allenai/prosocial-dialog
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A

Open-source Codebase For Instruction-following LLMs

  • Summary: Alternatives are projects featuring different instruct finetuned language models for chat.

awesome-instruction-dataset's People

Contributors

imryanxu avatar phoebussi avatar yaodongc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.