The trlx-examples from knowledgehacker

Learning to summarize from Human Feedback using `trlx`

This example shows how to use trlx to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".

Before running everything, we need some extra packages not included in the trlx dependency list. Specifically, we need HuggingFace's evaluate package and Google's re-implementation of ROUGE, rouge-score. To install them, run requirements.txt in this example's root directory:

pip install -r requirements.txt

Training Process

For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.

Train SFT:
```
cd sft/ && deepspeed train_sft.py
```
Checkpoint: SFT

Train Reward Model:

cd rm/ && deepspeed train_rm.py

Download reward model checkpoint:

mkdir rm/rm_checkpoint
wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O rm/rm_checkpoint/pytorch_model.bin

PPO training:
```
accelerate launch --config_file configs/default_accelerate_config.yaml trlx_train.py
```
Checkpoint: PPO

🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease batch_size in case you're running out of memory.

Results

The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.

SFT vs PPO

ROUGE scores

Model Rouge-1 Rouge-2 Rouge-L Average

SFT 0.334 0.125 0.261 0.240

PPO 0.323 0.109 0.238 0.223

Reward scores

Model Average Reward Reward $\Delta$

SFT 2.729 -0.181

PPO 3.291 +0.411
Examples of generated summaries can be found here.
Check our blog post for metric logs and other results here.

Model	Rouge-1	Rouge-2	Rouge-L	Average
SFT	0.334	0.125	0.261	0.240
PPO	0.323	0.109	0.238	0.223

Model	Average Reward	Reward $\Delta$
SFT	2.729	-0.181
PPO	3.291	+0.411

References

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.

knowledgehacker / trlx-examples Goto Github PK

trlx-examples's Introduction

Learning to summarize from Human Feedback using `trlx`

Training Process

Results

References

trlx-examples's People

Contributors

Watchers

trlx-examples's Issues

请问这个是使用trlx还是trl包执行ppo策略

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

knowledgehacker / trlx-examples Goto Github PK

trlx-examples's Introduction

Learning to summarize from Human Feedback using trlx

Training Process

Results

References

trlx-examples's People

Contributors

Watchers

trlx-examples's Issues

Recommend Projects

Recommend Topics

Recommend Org

Learning to summarize from Human Feedback using `trlx`