openllmai / openrlhf Goto Github PK

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

Home Page: https://openrlhf.readthedocs.io/

License: Apache License 2.0

Python 99.64% Shell 0.16% Dockerfile 0.20%

deepspeed transformers vllm large-language-models raylib reinforcement-learning-from-human-feedback reinforcement-learning

openrlhf's Introduction

Open-source / Comprehensive / Lightweight / Easy-to-use

[ English | 中文 ]

OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:

Simple and easy to use: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and compatible with Huggingface models and datasets.
High performance: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Adam Offload (Pinned Memory) and vLLM generation acceleration, the performance of OpenRLHF 2x+ that of Optimized DeepSpeedChat with Hybrid Engine.
Distributed RLHF: OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.
PPO Implementation Optimization: We integrated the implementation tricks for PPO to improve the training stability, referencing Zhihu and the Notion blog.

More details are in Technical Report | Documents

Features

Distributed PPO based on Ray.
Support full RLHF fine-tuning of models with over 70 billion parameters.
Support vLLM generation acceleration in RLHF (--vllm_num_engines).
Support multiple reward models (--reward_pretrain model1,model2...).
Support DPO (direct-preference-optimization)/IPO/cDPO.
Support Kahneman-Tversky optimization (KTO).
Support Rejection Sampling.
Support Iterative DPO (https://github.com/RLHFlow/Online-RLHF).
Support Conditional SFT (https://arxiv.org/abs/2308.12050).
Support Knowledge Distillation (https://github.com/microsoft/LMOps/tree/main/minillm).
Support SFT samples packing (--packing_samples).
Support MoE (--aux_loss_coef)
Support FlashAttention2 (--flash_attn).
Support QLoRA (--load_in_4bit), LoRA (--lora_rank, --target_modules).
Support HuggingFace tokenizer.apply_chat_template in datasets (--apply_chat_template and --input_key).
Support Wandb log (--wandb).
Multi-nodes training scripts for Slurm.

PPO Support Matrix

Feature	OpenRLHF	DSChat	CAIChat	TRL
70B+ Full Tuning with 16 A100-80GB	✅	❌	❌	❌
7B Full Tuning with 4 RTX4090	✅	❌	❌	❌
34B DPO Full Tuning with 8 A100-80GB	✅	❌	❌	❌
Inference Engine in PPO	✅	✅	❌	❌
PPO Implementation Tricks	✅	❌	❌	✅
Support QLoRA	✅	❌	❌	✅
Support Mixtral 8*7b	✅	❌	❌	❌
Support Unmerged Actor-Critic	✅	✅	✅	❌
Support Multiple Reward Models	✅	❌	❌	❌
Support Huggingface Models	✅	✅	✅	✅
Easy-to-use	✅	❌ (HybridEngine bugs)	✅	✅

Quick Start

Installation

To use OpenRLHF, first launch the docker container (Recommended) and pip install openrlhf inside the docker container:

# Launch the docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.02-py3 bash

# pip install
pip install openrlhf

# If you want to use vLLM acceleration (To install vLLM 0.4.2)
pip install openrlhf[vllm]
# latest vLLM is also supported (using Gloo)
pip install openrlhf[vllm_latest]

# pip install the latest version
pip install git+https://github.com/OpenRLHF/OpenRLHF.git

# Or git clone
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .

Note

We recommend using vLLM 0.4.2, as versions 0.4.3+ currently only support weight synchronization (DeepSpeed to vLLM) via Gloo (--vllm_sync_backend gloo). We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.

Prepare Datasets

OpenRLHF provides multiple data processing methods in our dataset classes. Such as in the Prompt Dataset:

def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
    if apply_chat_template:
        prompt = apply_chat_template(data[input_key], tokenize=False, add_generation_prompt=True)
    else:
        prompt = data[input_key]
        if input_template:
            prompt = input_template.format(prompt)
    return prompt

We can use --input_key to specify the JSON key name of the input datasets --prompt_data {name or path} (PPO) or --dataset {name or path}, and use --apply_chat_template to utilize the chat_template from the Huggingface Tokenizer.
If you don't want to use --apply_chat_template, you can use --input_template instead, or preprocess the datasets offline in advance.
OpenRLHF also support mixing multiple datasets using --prompt_data_probs 0.1,0.4,0.5 (PPO) or --dataset_probs 0.1,0.4,0.5.

How Chat Templating Works:

dataset = [{"input_key": [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]}]

tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)

"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

Note

By default, we use train and test as splits to distinguish training and testing datasets from Huggingface. The JSON key options depends on the specific datasets. See Reward Dataset and SFT Dataset

Supervised Fine-tuning

OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using --pretrain {name or path}, --reward_pretrain {name or path} and --critic_pretrain {name or path}. We have provided some pre-trained checkpoints and datasets on HuggingFace OpenRLHF.

Then you can use the startup scripts we provide in the examples/scripts directory, or start the training using the following commands.

deepspeed --module openrlhf.cli.train_sft \
   --max_len 4096 \
   --dataset Open-Orca/OpenOrca \
   --input_key question \
   --output_key response \
   --input_template 'User: {}\nAssistant: ' \
   --train_batch_size 256 \
   --micro_train_batch_size 2 \
   --max_samples 500000 \
   --pretrain meta-llama/Meta-Llama-3-8B \
   --save_path ./checkpoint/llama3-8b-sft \
   --save_steps -1 \
   --logging_steps 1 \
   --eval_steps -1 \
   --zero_stage 2 \
   --max_epochs 1 \
   --bf16 \
   --flash_attn \
   --learning_rate 5e-6 \
   --gradient_checkpointing \
   --use_wandb {wandb_token}

# HF tokenizer.apply_chat_template is supported.
# --apply_chat_template 
# --input_key {JSON Key}
# --tokenizer_chat_template {HF Chat Template}

# SFT samples packing
# --packing_samples

# Can also be used for continued pre-training
# --pretrain_mode

Note

OpenRLHF SFT/DPO/RewardModel trainers support --packing_samples based on --flash_attn

Reward Model Training

deepspeed --module openrlhf.cli.train_rm \
   --save_path ./checkpoint/llama3-8b-rm \
   --save_steps -1 \
   --logging_steps 1 \
   --eval_steps -1 \
   --train_batch_size 256 \
   --micro_train_batch_size 1 \
   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
   --bf16 \
   --max_epochs 1 \
   --max_len 8192 \
   --zero_stage 3 \
   --learning_rate 9e-6 \
   --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
   --apply_chat_template \
   --chosen_key chosen \
   --rejected_key rejected \
   --flash_attn \
   --gradient_checkpointing \
   --use_wandb {wandb_token}

# RM samples packing
# --packing_samples

PPO without Ray

deepspeed --module openrlhf.cli.train_ppo \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --save_path ./checkpoint/llama-3-8b-rlhf \
  --save_steps -1 \
  --logging_steps 1 \
  --eval_steps -1 \
  --micro_train_batch_size 2 \
  --train_batch_size 128 \
  --micro_rollout_batch_size 4 \
  --rollout_batch_size 1024 \
  --max_epochs 1 \
  --prompt_max_len 1024 \
  --generate_max_len 1024 \
  --zero_stage 2 \
  --bf16 \
  --actor_learning_rate 5e-7 \
  --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 \
  --prompt_data OpenRLHF/prompt-collection-v0.1 \
  --input_key context_messages \
  --apply_chat_template \
  --max_samples 100000 \
  --normalize_reward \
  --adam_offload \
  --flash_attn \
  --gradient_checkpointing \
  --use_wandb {wandb_token}

PPO with Ray and vLLM

To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration

# launch the master node of ray in container
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

# if you want to launch ray on more nodes, use
ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 \
  --ref_num_gpus_per_node 2 \
  --reward_num_nodes 1 \
  --reward_num_gpus_per_node 2 \
  --critic_num_nodes 1 \
  --critic_num_gpus_per_node 2 \
  --actor_num_nodes 1 \
  --actor_num_gpus_per_node 2 \
  --vllm_num_engines 2 \
  --vllm_tensor_parallel_size 2 \
  --colocate_critic_reward \
  --colocate_actor_ref \
  --ref_reward_offload \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
  --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \
  --micro_train_batch_size 8 \
  --train_batch_size 128 \
  --micro_rollout_batch_size 16 \
  --rollout_batch_size 1024 \
  --max_samples 100000 \
  --max_epochs 1 \
  --prompt_max_len 1024 \
  --generate_max_len 1024 \
  --zero_stage 3 \
  --bf16 \
  --actor_learning_rate 5e-7 \
  --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 \
  --prompt_data OpenRLHF/prompt-collection-v0.1 \
  --input_key context_messages \
  --apply_chat_template \
  --normalize_reward \
  --adam_offload \
  --flash_attn \
  --gradient_checkpointing \
  --use_wandb {wandb_token}

Note

Do not set --vllm_num_engines means not using the vLLM engine. You can also use setup_commands to let Ray automatically deploy the environment, such as --runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'.

The launch scripts and documents for supported algorithms are in example/scripts and Documents - Usage

Performance

We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:

Size	NVIDIA A800-80GB GPUs	Optimized DSChat (with Hybrid Engine)	OpenRLHF	Speedup
7B	16	855.09	471.11	1.82x
13B	32	1528.93	608.93	2.5x
34B	32	3634.98	1526.4	2.4x
70B	32	10407.0	4488.53	2.3x

Performance Tuning Guide

To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate more than 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the --colocate_critic_reward, --colocate_actor_ref, and --ref_reward_offload options to merge nodes. Finally, you should increase the rollout_micro_batch_size (and minimize the TP size of vLLM engine) as much as possible, and avoid Reward/Reference models forward OOM (Out Of Memory) issues. During the training phase, a larger --micro_train_batch_size is better. Enable enable_prefix_caching in vLLM generation when n_samples_per_prompt > 1.

Join Us

How to Join?

Email us at [email protected] or join GitHub Organization. Please include the following details:
- Your name
- Your GitHub username
- Your areas of interest
- Your skills and experience related to NLP and/or AI
You can also join us through the official GitHub OpenRLHF ↗ project page. Just create an issue about your interest to contribute and we will get back to you.

What can you do?

Join the team and participate in the development of the OpenRLHF project.
Contribute to the project by submitting pull requests.
Help improve documentation, fix bugs, or create new features.
Share the project and help us grow the community.

Sponsor Us

Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on Open Collective ↗.

Starchart

Contributors

A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.

References & Acknowledgements

We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:

Our project would also like to thank ColossalChat and DeepSpeedChat. In the early stages of the project, we referred to their code design.

(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.

Citation

@article{hu2024openrlhf,
  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
  author={Jian Hu and Xibin Wu and Weixun Wang and Xianyu and Dehao Zhang and Yu Cao},
  journal={arXiv preprint arXiv:2405.11143},
  year={2024}
}

openrlhf's People

Contributors

Stargazers

Watchers

Forkers

ranchizhao pikaqqqqqq jovany-wang wul8 dabney777 yiyi-philosophy leeeizhang wawpaopao hkksimple john-ge hex-plex vihoix3 suc16 missflash hhhh12345678 zhangxy-2019 theartificialoutsider wuxibin89 stat-eklee vincezengqiang ftgreat robertalanm damonyangyang dupenf plushpluto mylonasc yonggucheng wangsiwei2010 niuboboo boaz968 seekpoint tmsagarofficial vivekguruduttk28 vyksi vanesh37 shlee007 taoleitian xffxff wangxidong06 eltociear yzs-lab lihuibng callanwu chizhu tsaoyu li-plus yhna940 huihuitong tianbingsz edisonchenn yinchuanwang weedge liang-zx ziyiliubird fridayl zhanghaoie panandy xin-li-67 sundogs8603 yuanmeng1120 wuxiaobo murugesanraju rbao2018 donnyyou dylancer1998 kajyuuen atqarana haicaihi forex24 karthik-nexusflow thecats-jfm maic999 wmuog syzong zuonet1988 jamestiotio leejodie zyb5086zyb mgerstgrasser luobintianya wangguojim jackeylove1 raogj jacobthebanana dshnightmare kfertakis jenningsje stwaynexg victorshawfan ridiculouz stephen-nju hyunwoongko cdm114514 michaelcola zhangpengbo brunoscaglione trenorenos mickelliu ifromeast evdcush

openrlhf's Issues

HfDeepSpeedConfig must be kept during AutoModel.from_pretrained if using ZeRO-3

According to Non-Trainer Deepspeed Integration:

The HfDeepSpeedConfig is used to integrate Deepspeed into the 🤗 Transformers core functionality, when Trainer is not used. The only thing that it does is handling Deepspeed ZeRO-3 param gathering and automatically splitting the model onto multiple gpus during from_pretrained call.

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed

ds_config = {...}  # deepspeed config object or path to the file
# must run before instantiating the model to detect zero 3
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
model = AutoModel.from_pretrained("gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)

But we seem to missing HfDeepSpeedConfig when init Actor, Critic, Reward.

Support pretrain and post-pretrain

Error occurred when loading datasets from disk

I download Open-Orca/OpenOrca dataset to my disk and then set the --dataset as the saved address. However, an error occurred:

(lzy-rlhf) root@di-20231110113227-9fqgm:/alg_vepfs/public/LZY/mycodes/OpenRLHF/examples/pyscripts# bash train_sft_llama.sh 
[2023-11-13 15:00:47,432] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-13 15:00:49,850] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-13 15:00:49,850] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2023-11-13 15:00:50,267] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.18.56.49, master_port=29500
[2023-11-13 15:00:50,267] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.52s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using pad_token, but it is not set yet.
add pad_token
Actor(
  (model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32001, 4096)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (mlp): LlamaMLP(
            (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
            (act_fn): SiLUActivation()
          )
          (input_layernorm): LlamaRMSNorm()
          (post_attention_layernorm): LlamaRMSNorm()
        )
      )
      (norm): LlamaRMSNorm()
    )
    (lm_head): Linear(in_features=4096, out_features=32001, bias=False)
  )
)
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 3.212965488433838 seconds
dataset: /alg_vepfs/public/LZY/dataset/OpenOrca
load local data file: /alg_vepfs/public/LZY/dataset/OpenOrca
script: []
files: ['/alg_vepfs/public/LZY/dataset/OpenOrca/dataset_dict.json', '/alg_vepfs/public/LZY/dataset/OpenOrca/train/state.json', '/alg_vepfs/public/LZY/dataset/OpenOrca/train/dataset_info.json']
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5599.87it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 534.51it/s]
Generating train split: 1 examples [00:00, 251.35 examples/s]
Traceback (most recent call last):
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string
_fingerprint: string
_format_columns: null
_format_kwargs: struct<>
_format_type: null
_output_all_columns: bool
_split: string
to
{'splits': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/alg_vepfs/public/LZY/mycodes/OpenRLHF/examples/pyscripts/../train_sft.py", line 146, in <module>
    train(args)
  File "/alg_vepfs/public/LZY/mycodes/OpenRLHF/examples/pyscripts/../train_sft.py", line 42, in train
    train_data, eval_data = blending_datasets(args.dataset, args.dataset_probs, strategy, args.seed)
  File "/root/.local/lib/python3.9/site-packages/openrlhf/utils/utils.py", line 119, in blending_datasets
    data = load_dataset(data_type, data_files=files)
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/alg_vepfs/public/miniconda_dirs/envs/lzy-rlhf/lib/python3.9/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Support Lora & QLora

AssertionError: backward pass is invalid for module in evaluation mode

``Hi, thank you for making this repo!

An error occurred while training the PPO model. Polyglot 15.8b was used as the SFT model and polyglot5.8B was used as the reward model. I changed the model part by performing quantization and adding lora, but an error occurred regarding the backward pass.

Does anyone know how to solve this?

The contents below are the part where the error occurred, the part where the SFT and RM model were changed, and the part changed in the learning code.

thank you

Train epoch [1/1]:   0%|                                                                                                                                                                              | 0/1 [00:18<?, ?it/s]
Episode [1/1]:   0%|                                                                                                                                                                               | 0/2367 [12:56<?, ?it/s]
Traceback (most recent call last):
  File "ppo_test.py", line 244, in <module>
    trainer.fit(
  File "/raid2/baekig/OpenLLaMA2/openllama2/trainer/ppo_trainer.py", line 184, in fit
    status = self.ppo_train()
  File "/raid2/baekig/OpenLLaMA2/openllama2/trainer/ppo_trainer.py", line 223, in ppo_train
    status = self.training_step(experience)
  File "/raid2/baekig/OpenLLaMA2/openllama2/trainer/ppo_trainer.py", line 304, in training_step
    self.strategy.backward(critic_loss, self.critic, self.critic_optim)
  File "/raid2/baekig/OpenLLaMA2/openllama2/utils/deepspeed.py", line 97, in backward
    model.backward(loss)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1890, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2029, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
    ctx.pre_backward_function(ctx.module)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 436, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/raid2/baekig/anaconda3/envs/nlp/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 512, in pre_sub_module_backward_function
    assert sub_module.training, "backward pass is invalid for module in evaluation mode"

strategy = get_strategy(args)
...
sft_id ='EleutherAI/polyglot-ko-12.8b' 
rm_id = 'EleutherAI/polyglot-ko-5.8b'
actor = Actor(sft_id, bnbconfig)  # SFT
critic = Critic(rm_id, bnbconfig)  # RM
reward_model = RewardModel(rm_id, bnbconfig)  # RM

actor.gradient_checkpointing_enable()
critic.gradient_checkpointing_enable()
actor.model.config.use_cache = False
reward_model.model.config.use_cache = False

initial_model = deepcopy(actor)  # SFT
critic.model = deepcopy(reward_model.model)
critic.value_head = deepcopy(reward_model.value_head)
critic.mean = deepcopy(reward_model.mean)
critic.std = deepcopy(reward_model.std)
initial_model.gradient_checkpointing_enable()
reward_model.gradient_checkpointing_enable()
critic.model.config.use_cache = False
initial_model.model.config.use_cache = False

actor.train()
critic.train()
initial_model.train()
reward_model.train()
...
tokenizer = get_tokenizer(args.pretrain, actor.model, "left", strategy)
get_tokenizer(args.critic_pretrain, critic.model, "left", strategy)
get_tokenizer(args.critic_pretrain, reward_model.model, "left", strategy)

dataset = PromptDataset(data, strategy) 
prompts_dataloader = strategy.setup_dataloader(dataset, 
                args.micro_rollout_batch_size, True, True)

actor_optim = strategy.create_optimizer(
    actor, lr=args.actor_learning_rate, betas=(0.9, 0.95), weight_decay=args.l2
...
)

class Actor(nn.Module):
    """
    Actor model base class.

    Args:
        model (nn.Module): Actor Model.
        lora_rank (int): LoRA rank.
        lora_train_bias (str): LoRA bias training mode.
    """

    def __init__(
        self,
        pretrain_or_model,
        bnbconfig,
        from_config=False,
        lora_rank: int = 0,
        lora_train_bias: str = "none",
    ) -> None:
        super().__init__()

        self.model = AutoModelForCausalLM.from_pretrained(
            pretrain_or_model, torch_dtype=torch.bfloat16, 
            quantization_config = bnbconfig,
            trust_remote_code=True,
            device_map = {"":0}
        )
        self.model = PeftModel.from_pretrained(self.model, 'ingeol/sft_adapter', is_trainable=True)

class Critic(nn.Module):
# ...This part is the same as the SFT code, except for the adapter.

class RewardModel(nn.Module):
# ... This part is the same as the SFT code, except for the adapter.

Thank you for creating a great repo.

问一下，huggingface提供的checkpoint为pt文件，如何转成大模型常见的.bin

Support Rejection Sampling

ckpt下载

请问能不能给一个其它ckpt下载链接，国内huggingface下载太不方便了

Support running on Ray as distributed RLHF framework.

Inquiry regarding the feasibility of fine-tuning LLaMA2-7B with a single A100

Hi team,
Great work, but I have a question to consult.
I used --adam_offload option in https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh which mentioned in your blog on Zhihu(https://zhuanlan.zhihu.com/p/650758507) to make it possible to fine-tuning 7B model with a single A100(80G).
However, upon implementing this option, I encountered difficulties as the script seemed to get stuck.
Could you provide more details about it and recommended practices for fine-tuning LLaMA2-7B with a single A100(80G):)

Support more prompt template in datasets

Implement Re-max

有几个问题

为什么使用.pt文件保存模型，而不是直接save_pretrained保存成hf格式？如果我需要将其他框架训练得到的hf格式的sft模型用于rm和ppo的话，我只能自己修改代码支持hf格式ckpt的读取。
prompt template可以多定义几个，比如常见的alpaca、vicuna和llama2格式。
rm和ppo的日志输出有点看不明白，比如rm的acc在哪，ppo的mean reward是哪个量，例如：

Train epoch [1/1]: 100%|█| 146/146 [09:37<00:00,  3.95s/it, pg=0.511, cri=0.0207, vals=0
{'pg': -0.007849020959988032, 'cri': 0.009956638121416103, 'vals': 0.28283763604481027, 'kl': -0.0015669609786402971, 'rm': 0.380661175522494, 'ret': 0.3954394154046496, 'glen': 942.7006952991225, 'tlen': 1071.2466288527398, 'k_coef': 0.01}

这里面的vals，rm和ret分别代表什么含义？

更大的模型

你好，请问，支持更大的llama模型吗，例如13B，30B等等。

Local dataset: Please perform appropriate preprocessing on your local data set.

We use huggingface's load_dataset to support common local data formats, but due to the diversity of various data sets themselves, it is impossible to cover the preprocessing of all data sets.
So,
We recommend that you perform appropriate preprocessing on local data or provide appropriate preprocessing scripts.

Data set format issues:
#134

Vocabulary overflow Issue with [PAD] for SFT

When expanding the tokenizer's vocabulary with [PAD] during SFT (https://github.com/OpenLLMAI/OpenLLaMA2/blob/main/examples/utils.py#L22), the tokenizer's vocabulary size becomes 32001 (e.g., print(size(tokenzier.vocab))).

However, the model's embedding can only handle up to 32000 entries.

This issue arises apparent when the startup parameter args.micro_train_batch_size=2 is utilized.

The default args.micro_train_batch_size=1, the dataloader does not invoke dataset.collect_fn, which consequently bypasses the padding process (and the potential overflow of the [PAD] token).

I would appreciate attention to this matter as it affects cases where larger batch sizes are used during SFT.

Add GPT-4 evaluation scripts

Support top chinese language models

pydantic.error_wrappers.ValidationError: 3 validation errors for DeepSpeedZeroConfig zero_hpz_partition_size extra fields not permitted (type=value_error.extra)

In the first SFT

deepspeed version == v0.9.5: Patch release
ValidationError: 3 validation errors for DeepSpeedZeroConfig
zero_hpz_partition_size
extra fields not permitted (type=value_error.extra)
zero_quantized_gradients
extra fields not permitted (type=value_error.extra)

Add pipeline module to support more scientific comparative experiments and research

Performance Test: OpenLLaMA2 vs others

开启ppo-ptx会出现梯度重复计算的报错

运行脚本：

../train_ppo.py \
    --pretrain /data/chuxiong/chinese-llama-2-7b-eot \
    --critic_pretrain /data/chuxiong/chinese-llama-2-7b-eot \
    --reward_model_path ./ckpt/chinese-llama-2-7b-openchat-rm/rm_model.pt \
    --sft_model_path /data/chuxiong/openchat/outputs/chinese-llama-2-7b-openchat/ep_4 \
    --save_path ./ckpt/chinese-llama-2-7b-openchat-ppo \
    --micro_train_batch_size 1 \
    --train_batch_size 126 \
    --micro_rollout_batch_size 1 \
    --rollout_batch_size 1024 \
    --max_epochs 1 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --inference_tp_size 1 \
    --init_kl_coef 0.01 \
    --prompt_data /data/chuxiong/hh_rlhf_cn_prompt \
    --prompt_data_probs 1. \
    --pretrain_data dlwh/wikitext_103_detokenized \
    --pretrain_data_probs 1. \
    --ptx_coef 1. \
    --normalize_reward \
    --adam_offload \
    --gradient_checkpointing

报错：

File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/openllama2/trainer/ppo_trainer.py", line 182, in ppo_train
    status = self.training_step(experience)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/openllama2/trainer/ppo_trainer.py", line 237, in training_step
    self.strategy.backward(actor_loss, self.actor, self.actor_optim)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/openllama2/utils/deepspeed.py", line 94, in backward
    model.backward(loss)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)   
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 814, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1262, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    assert self.params_already_reduced[param_id] == False, \
AssertionError: The parameter 286 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

尝试更新到最新版deepspeed，又报下面的错：

  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/openllama2/trainer/ppo_trainer.py", line 
237, in training_step
    self.strategy.backward(actor_loss, self.actor, self.actor_optim)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/openllama2/utils/deepspeed.py", line 94, 
in backward
    model.backward(loss)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wra
pped_fn
    ret_val = func(*args, **kwargs)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, 
in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py",
 line 1953, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", l
ine 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in
 apply
    return user_fn(self, *args)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in 
backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in
 backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py",
 line 871, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1332, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 899, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ipg_grads()
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1319, in reduce_ipg_grads
    self.copy_grads_in_partition(param)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1239, in copy_grads_in_partition
    self.async_accumulate_grad_in_cpu_via_gpu(param)
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1143, in async_accumulate_grad_in_cpu_via_gpu
    accumulate_gradients()
  File "/data/conda3/usr/chuxiong/envs/scx_llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1122, in accumulate_gradients
    param.grad_accum.data.view(-1).add_(dest_buffer)
AttributeError: 'NoneType' object has no attribute 'data'

Version0.0.1: Release the first development version

Features list: TODO

Docs list: TODO

Support Multiple Reward Models

Feature: Fine-grained Decoding Control

ref: https://platform.openai.com/docs/api-reference/completions/create

AttributeError: 'LlamaModel' object has no attribute 'backward'

Hi, thank you for making this repo!
I'm making reward model and i encounter this error,,

model_id = 'meta-llama/Llama-2-7b-hf'
model = RewardModel(model_id)
model.lora_enable(args.lora_rank)

tokenizer = get_tokenizer(model_id, model.model, "left", strategy)
train_dataset = RewardDataset(data_2, tokenizer, 512, strategy)

...


train_dataloader = strategy.setup_dataloader(
    train_dataset, batch_size=3, pin_memory=False, shuffle=False, collate_fn=train_dataset.collate_fn,
)
num_update_steps_per_epoch = len(
    train_dataloader) * args.max_epochs // strategy.accumulated_gradient
max_steps = math.ceil(args.max_epochs * num_update_steps_per_epoch)
optim = strategy.create_optimizer(
    model, lr=args.learning_rate, betas=(0.9, 0.95), weight_decay=args.l2)
scheduler = get_scheduler(
    "cosine",
    optim,
    num_warmup_steps=math.ceil(max_steps * 0.03),
    num_training_steps=max_steps,
)
...

        chosen_reward = model(chosen_ids, attention_mask=c_mask)
        reject_reward = model(reject_ids, attention_mask=r_mask)

        loss = loss_fn(chosen_reward, reject_reward)

        acc_mean = acc_mean * 0.9 + 0.1 * \
            (chosen_reward > reject_reward).float().mean().item()
        loss_mean = loss_mean * 0.9 + 0.1 * loss.item()
        reward_diff_mean = reward_diff_mean * 0.9 + 0.1 * \
            (chosen_reward - reject_reward).mean().item()
        
        print(loss) 
        strategy.backward(loss, model, optim)
        strategy.optimizer_step(
            optim, model, scheduler)

deepspeed rm_test.py \
     --save_path ./ckpt/7b_llama \
     --train_batch_size 1 \
     --micro_train_batch_size 1 \
     --pretrain meta-llama/Llama-2-7b-hf \
     --max_epochs 1 \
     --max_len 1024 \
     --zero_stage 2 \
     --learning_rate 9e-6 \

Please help me understand what this error is and how to fix it.

and i have question about how model.backward(loss) work.
file path is 'OpenLLaMA2/openllama2/utils/deepspeed.py/DeepspeedStrategy/backward'

thank you.. This is my first time posting a question on Git, so it may be different from the question format you usually see, so please let me know if there are any problems.

Support Multi-nodes training on Slurm

fixing Typo

need to fix the words(manualy--manually , sepecified--specified) in the below picture , please assign it to me

Add better docs and usage examples

请问模型如何

我看实现方法里有DPO和PPO，有现成的结果嘛，PPO比SFT提升了多少，以及DPO比PPO提升了多少之类的，加上这个结果对想从事这个方向的研究人员帮助会非常大~

feature: add api support for hosting a reward model

I want to use a 70b parameter model as my reward model. It is inefficient to load such model from pretrained and ideally should be queried through an api. However, the existing class does not support such usage.

Could this feature be implemented?

Or if it has, could someone point me to its usage as I cannot find it.

Kind thanks

PPO OOM

8*A100-80G:
Traceback (most recent call last):[02:06<01:20, 13.44s/it, pg=-.0119, cri=0.0702, vals=-.0352, kl=0, rm=0.0909, ret=0.0909, glen=1
File "../train_ppo.py", line 239, in
train(args)
File "../train_ppo.py", line 164, in train
trainer.fit(prompts_dataloader,
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 143, in fit
status = self.ppo_train()
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 166, in ppo_train
status = self.training_step(experience)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/trainer/ppo_trainer.py", line 209, in training_step
self.strategy.backward(actor_loss, self.actor, self.actor_optim)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/openllama2/utils/deepspeed.py", line 81, in backward
model.backward(loss)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/opt/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 2; 79.35 GiB total capacity; 66.47 GiB already allocated; 3.87 GiB free; 72.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-08-31 02:44:16,573] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10230
[2023-08-31 02:44:22,676] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10231
[2023-08-31 02:44:29,025] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10232
[2023-08-31 02:44:29,025] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10233
[2023-08-31 02:44:37,723] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10235
[2023-08-31 02:44:45,725] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10237
[2023-08-31 02:44:54,060] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10239
[2023-08-31 02:45:01,505] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10241
[2023-08-31 02:45:08,532] [ERROR] [launch.py:321:sigkill_handler]
['/opt/conda/envs/llama2/bin/python3', '-u', '../train_ppo.py', '--local_rank=7', '--pretrain', './models/Llama-2-7b-hf', '--critic_pretrain', './models/Llama-2-7b-hf', '--reward_model_path', './ckpt/7b_llama/rm_model.pt', '--sft_model_path', './ckpt/7b_llama/sft_model.pt', '--save_path', './ckpt/7b_llama', '--micro_train_batch_size', '1', '--train_batch_size', '128', '--micro_rollout_batch_size', '1', '--rollout_batch_size', '1024', '--max_epochs', '1', '--prompt_max_len', '1024', '--generate_max_len', '1024', '--zero_stage', '2', '--bf16', '--actor_learning_rate', '5e-7', '--critic_learning_rate', '9e-6', '--inference_tp_size', '1', '--init_kl_coef', '0.01', '--prompt_data', 'yahma/alpaca-cleaned,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward', '--prompt_data_probs', '0.3,0.6,0.1', '--normalize_reward', '--adam_offload', '--gradient_checkpointing'] exits with return code = 1

Support Decision Transformer

Support Adam Optmizer offload and reload to GPU

Do you have a plan for applying Reinforced Self-Training (ReST)?

Thansks for sharing a great project!

Do you have any plans to apply the recently published Reinforced Self-Training (ReST)?

Reinforced Self-Training (ReST) for Language Modeling
https://arxiv.org/abs/2308.08998

[Severity] High similarity with Colossal-AI

Dear OpenLLMAI Team,

This is the Colossal-AI team.
Thank you for your contributions to the open source community.
But it looks like your open source content is highly similar to Colossal-AI and not properly referenced.

For example, The overall structure of your repos within the organization is very similar to ColossalAI/applications

There are also many highly similar details in the code, just give a few simple examples.

Replay_ Buffer Design&Implementation
OpenLLMAI with Colossal-AI
Experience Maker Design&Implementation
OpenLLMAI with Colossal-AI
Trainer Design&Implementation
OpenLLMAI with Colossal-AI

There are still many similarities that will not be listed one by one.
We hope that you follow the corresponding open source, academic, commercial, etc., norms immediately make the corresponding corrective measures.
This includes but is not limited to prominently referencing the Colossal-AI project on the homepage and LICENSE file of each project, and prominently indicating in each relevant code file which code of the Colossal AI project was referenced for implementation.

Thank you very much.
Colossal-AI team

DeepSpeed Training and Inference

It seems that in your scripts, the locak_rank always equals -1, so actually you didn't use deepspeed's parallel ability? e.g., data parallel or model parallel

Bug: AttributeError: 'DeepspeedStrategy' object has no attribute 'save_hf_format'

when use --save_hf_model:
I got a training collapse when saving models. so, 30 hours wasted.

strategy.save_hf_format(model, tokenizer, args.save_path + '/sft_hf')
AttributeError: 'DeepspeedStrategy' object has no attribute 'save_hf_format'

At last, only a 67k sft_model.pt was saved.

Introduce LINT tools

Support Evaluation Tools

including GPT-4, human evaluation, MMLU ,etc.

Support llama2 flash attention

Discussion on our 1st release.

Hi team,
as many functions are able to use, let's discuss our 1st alpha release. Please propose the things that you think they're need to be closed. Thanks.

Scale rlhf to 100B models

Support DPO

Support hybird-model in Ray PPO

such as: Baichuan2 for SFT/Actor model & Llama2 for RM/Critic model.

Loading RM ckpt bug: AttributeError: 'NoneType' object has no attribute 'load'

lauch scripts:

set -x 

read -r -d '' training_commands <<EOF
../train_rm.py \
     --save_path ./ckpt_test/TinyLlama \
     --train_batch_size 128 \
     --micro_train_batch_size 1 \
     --pretrain TinyLlama/TinyLlama-1.1B-Chat-v0.1 \
     --bf16 \
     --max_epochs 1 \
     --max_len 2048 \
     --zero_stage 3 \
     --learning_rate 5e-7 \
     --dataset tasksource/oasst1_pairwise_rlhf_reward \
     --dataset_probs 1.0 \
     --gradient_checkpointing \
     --adam_offload \
     --use_wandb xxxxxxxxx \
     --eval_steps 2 
		 --save_steps 2

EOF
     # --wandb [WANDB_TOKENS]

if [[ ${1} != "slurm" ]]; then
    export PATH=$HOME/.local/bin/:$PATH
    deepspeed $training_commands
    # deepspeed --include localhost:1 $training_commands
fi

root@xxx:../OpenLLaMA2/examples/scripts/ckpt/checkpoints_rm_llama# du -ah
5.8G    ./global_step22/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
5.8G    ./global_step22/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
104K    ./global_step22/zero_pp_rank_0_mp_rank_00_model_states.pt
104K    ./global_step22/zero_pp_rank_1_mp_rank_00_model_states.pt
12G     ./global_step22
5.8G    ./global_step24/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
5.8G    ./global_step24/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
104K    ./global_step24/zero_pp_rank_0_mp_rank_00_model_states.pt
104K    ./global_step24/zero_pp_rank_1_mp_rank_00_model_states.pt
12G     ./global_step24
5.8G    ./global_step26/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
5.8G    ./global_step26/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
104K    ./global_step26/zero_pp_rank_0_mp_rank_00_model_states.pt
104K    ./global_step26/zero_pp_rank_1_mp_rank_00_model_states.pt
12G     ./global_step26
4.0K    ./latest
24K     ./zero_to_fp32.py
35G     .

model file: zero_pp_rank_*mp_rank**_model_states.pt is only 104k. it seems have something error when saving.

the code for loading ckpt：

import argparse
import os
from datetime import timedelta

import jsonlines
import torch
from torch import distributed as dist
from tqdm import tqdm

from openrlhf.datasets import PromptDataset, SFTDataset
from openrlhf.models import Actor, RewardModel
from openrlhf.utils import blending_datasets, get_processor, get_strategy, get_tokenizer
import pdb

parser = argparse.ArgumentParser()
parser.add_argument("--eval_task", type=str, default="rm", help="set to generate or rm")
parser.add_argument("--pretrain", type=str, default="TinyLlama/TinyLlama-1.1B-Chat-v0.1")
parser.add_argument("--load_model", type=str, default=None)
parser.add_argument("--max_len", type=int, default=2048)
parser.add_argument("--zero_stage", type=int, default=3)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for deepspeed")
parser.add_argument("--bf16", action="store_true", default=True)
parser.add_argument("--flash_attn", action="store_true", default=False)
parser.add_argument("--micro_batch_size", type=int, default=1)
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument("--dataset_probs", type=str, default="1.0")
parser.add_argument("--output_path", type=str, default="./")
parser.add_argument("--max_samples", type=int, default=500000)
parser.add_argument("--seed", type=int, default=1234)

# for generation
parser.add_argument("--inference_tp_size", type=int, default=1)
parser.add_argument("--ta_prompt", type=str, default="")
parser.add_argument("--prompt_max_len", type=int, default=1024)
parser.add_argument("--greedy_sampling", action="store_true", default=False)
parser.add_argument("--top_p", type=float, default=0.9)
parser.add_argument("--temperature", type=float, default=1.0)
parser.add_argument("--repetition_penalty", type=float, default=1.2)
parser.add_argument("--best_of_n", type=int, default=1)
parser.add_argument(
    "--post_processor",
    type=str,
    default=None,
    help="set to rs (Rejection Sampling), dt (Decision Transformer) or None",
)

# for Iterative generation and Rejection Sampling
parser.add_argument("--iter", type=int, default=None)
parser.add_argument("--rollout_batch_size", type=int, default=2048)

# for Decision Transformer (DT) generation
parser.add_argument("--normalize_reward", action="store_true", default=False)
parser.add_argument("--reward_template", type=str, default=None)
# for DT evaluation
parser.add_argument("--enable_dt", action="store_true", default=False)
parser.add_argument("--dt_prompt", type=str, default="<rm_score>: 5.00", help="decision transformer prompt")

args = parser.parse_args()


# configure strategy
strategy = get_strategy(args)
strategy.setup_distributed(timeout=timedelta(seconds=9999999))

# configure model
# load huggingface model/config
from_config = bool(args.load_model)
model = RewardModel(args.pretrain, from_config, use_flash_attention_2=args.flash_attn)
# prepare models

model = strategy.prepare(model)
model.eval()

load_dir = "./ckpt/checkpoints_rm_llama"
tag = "global_step2"
model = model.load_checkpoint(load_dir=load_dir, tag=tag)

# model = strategy.load_ckpt(model=model, load_dir=load_dir,
#     tag=tag,
#     load_module_strict=True,
#     load_optimizer_states=True,
#     load_lr_scheduler_states=True,
#     load_module_only=False)

bug log:

root@xxx:../OpenLLaMA2/examples/scripts# bash inference_rm.sh 
+ read -r -d '' training_commands
+ [[ '' != \s\l\u\r\m ]]
+ export PATH=/root/.local/bin/:/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/nvm/versions/node/v16.20.0/bin:/etc/dsw/code-server/lib/vscode/bin/remote-cli:/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/nvm/versions/node/v16.20.0/bin:/etc/dsw/node/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/root/.local/bin/:/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/nvm/versions/node/v16.20.0/bin:/etc/dsw/code-server/lib/vscode/bin/remote-cli:/usr/local/nvm/versions/node/v16.20.0/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/nvm/versions/node/v16.20.0/bin:/etc/dsw/node/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ deepspeed ../inference_rm.py --pretrain TinyLlama/TinyLlama-1.1B-Chat-v0.1
[2023-12-14 03:23:20,923] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-14 03:23:26,717] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-14 03:23:26,718] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ../inference_rm.py --pretrain TinyLlama/TinyLlama-1.1B-Chat-v0.1 --bf16
[2023-12-14 03:23:28,724] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-14 03:23:34,520] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.18.3
[2023-12-14 03:23:34,520] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-12-14 03:23:34,520] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-12-14 03:23:34,520] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-12-14 03:23:34,520] [INFO] [launch.py:163:main] dist_world_size=2
[2023-12-14 03:23:34,520] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-12-14 03:23:41,742] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-14 03:23:41,889] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
123
[2023-12-14 03:23:42,858] [INFO] [comm.py:637:init_distributed] cdb=None
123
[2023-12-14 03:23:42,904] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-14 03:23:42,904] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-12-14 03:23:53,919] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.11.1, git-hash=unknown, git-branch=unknown
[2023-12-14 03:23:53,919] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[2023-12-14 03:23:58,729] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-12-14 03:23:58,730] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2023-12-14 03:23:58,853] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-12-14 03:23:58,853] [INFO] [utils.py:803:see_memory_usage] MA 1.94 GB         Max_MA 1.94 GB         CA 2.03 GB         Max_CA 2 GB 
[2023-12-14 03:23:58,854] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 14.45 GB, percent = 4.3%
Parameter Offload: Total persistent parameters: 94209 in 47 params
[2023-12-14 03:23:59,037] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /xxxx/OpenLLaMA2/examples/scripts/ckpt/checkpoints_rm_llama/global_step22/zero_pp_rank_1_mp_rank_00_model_states.pt...
[2023-12-14 03:23:59,042] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /xxxx/OpenLLaMA2/examples/scripts/ckpt/checkpoints_rm_llama/global_step22/zero_pp_rank_1_mp_rank_00_model_states.pt.
[2023-12-14 03:23:59,042] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /xxxx/OpenLLaMA2/examples/scripts/ckpt/checkpoints_rm_llama/global_step22/zero_pp_rank_1_mp_rank_00_model_states.pt...
[2023-12-14 03:23:59,047] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /xxxx/OpenLLaMA2/examples/scripts/ckpt/checkpoints_rm_llama/global_step22/zero_pp_rank_1_mp_rank_00_model_states.pt.
Traceback (most recent call last):
  File "/xxxx/OpenLLaMA2/examples/scripts/../inference_rm.py", line 86, in <module>
    model = model.load_checkpoint(load_dir=load_dir, tag=tag)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2708, in load_checkpoint
    success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2875, in _load_zero_checkpoint
    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2950, in _get_all_zero_checkpoints
    return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2929, in _get_all_zero_checkpoint_state_dicts
    _state = self.checkpoint_engine.load(
AttributeError: 'NoneType' object has no attribute 'load'

@hijkzzz @catqaq

Support wandb logs

[QUESTION] huggingface login in readme

Running LLaMA2 Example

huggingface login

/.local/bin/huggingface-cli login

it doesn't work in nvidia container.

Feature: Support detailed running process management: save_steps, log_steps, eval_steps

I got a training collapse when saving models. so, 30 hours wasted.
see:#101

So, We need more detailed experimental process management.
save_steps is ralated to #65
log_steps, eval_steps are also necessary.

openllmai / openrlhf Goto Github PK

openrlhf's Introduction

Features

PPO Support Matrix

Quick Start

Installation

Prepare Datasets

Supervised Fine-tuning

Reward Model Training

PPO without Ray

PPO with Ray and vLLM

Performance

Performance Tuning Guide

Join Us

Sponsor Us

Starchart

Contributors

References & Acknowledgements

Citation

openrlhf's People

Contributors

Stargazers

Watchers

Forkers

openrlhf's Issues

huggingface login

Recommend Projects

Recommend Topics

Recommend Org