Git Product home page Git Product logo

trlx-examples's Introduction

Learning to summarize from Human Feedback using trlx

This example shows how to use trlx to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".

Before running everything, we need some extra packages not included in the trlx dependency list. Specifically, we need HuggingFace's evaluate package and Google's re-implementation of ROUGE, rouge-score. To install them, run requirements.txt in this example's root directory:

pip install -r requirements.txt

Training Process

For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.

  1. Train SFT:

    cd sft/ && deepspeed train_sft.py

    Checkpoint: SFT

  2. Train Reward Model:

    cd rm/ && deepspeed train_rm.py

    Download reward model checkpoint:

    mkdir rm/rm_checkpoint
    wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O rm/rm_checkpoint/pytorch_model.bin
  3. PPO training:

    accelerate launch --config_file configs/default_accelerate_config.yaml trlx_train.py

    Checkpoint: PPO

    🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease batch_size in case you're running out of memory.

Results

The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.

  1. SFT vs PPO

    ROUGE scores

    Model Rouge-1 Rouge-2 Rouge-L Average
    SFT 0.334 0.125 0.261 0.240
    PPO 0.323 0.109 0.238 0.223

    Reward scores

    Model Average Reward Reward $\Delta$
    SFT 2.729 -0.181
    PPO 3.291 +0.411
  2. Examples of generated summaries can be found here.

  3. Check our blog post for metric logs and other results here.

References

  1. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.

trlx-examples's People

Contributors

knowledgehacker avatar

Watchers

 avatar  avatar

trlx-examples's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.