Comments (7)
@wuxibin89 may know how to fix DSChat.
from openrlhf.
Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable.
But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.
Optimized DSChat in OpenRLHF paper | DSChat in my experiments | gap | |
---|---|---|---|
E2E time (s) | 855.09 | About 538 | About 297 |
Generation time (s) | 590.157 | About 328 | About 262 |
Training time (s) | 125.69 | About 148 | About 23 |
Can you give me some information? For further guidance, here is the configuration I am trying to run.
The following is my launch script for your reference:
ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
-H hostfile \
main.py \
--data_path Dahoas/rm-static \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_generation_batch_size 8 \
--per_device_training_batch_size 8 \
--generation_batches 8 \
--ppo_epochs 1 \
--max_answer_seq_len 1024 \
--max_prompt_seq_len 1024 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--offload \
--offload_reference_model \
--release_inference_cache \
--gradient_accumulation_steps 8 \
--actor_gradient_checkpointing \
--critic_gradient_checkpointing \
--actor_dropout 0.0 \
--num_warmup_steps 0 \
--deepspeed --seed 1234 \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--enable_hybrid_engine \
--output_dir $OUTPUT \
--inference_tp_size 1 \
And this is what I have in the output log about the time:
head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375
Looking forwards to your reply. Thank you for the sharing and suggestion again.
from openrlhf.
Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.
Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k" CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194" ACTOR_ZERO_STAGE=3 CRITIC_ZERO_STAGE=3 deepspeed --num_nodes=2 \ -H hostfile \ main.py \ --data_path Dahoas/rm-static \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_generation_batch_size 8 \ --per_device_training_batch_size 8 \ --generation_batches 8 \ --ppo_epochs 1 \ --max_answer_seq_len 1024 \ --max_prompt_seq_len 1024 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --offload \ --offload_reference_model \ --release_inference_cache \ --gradient_accumulation_steps 8 \ --actor_gradient_checkpointing \ --critic_gradient_checkpointing \ --actor_dropout 0.0 \ --num_warmup_steps 0 \ --deepspeed --seed 1234 \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --enable_hybrid_engine \ --output_dir $OUTPUT \ --inference_tp_size 1 \
And this is what I have in the output log about the time:
head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0 head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048 head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024 head: Preparation => Latency: 8.12s head: Training => Latency: 151.79s, TFLOPs: 97.79 head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0 head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048 head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024 head: Preparation => Latency: 8.29s head: Training => Latency: 145.45s, TFLOPs: 102.05 head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375
Looking forwards to your reply. Thank you for the sharing and suggestion again.
I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."
I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.
In addition, we updated the two checkpoints you used last week, which will also lead to different results.
see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main
==============
If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.
from openrlhf.
@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.
from openrlhf.
Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.
Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k" CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194" ACTOR_ZERO_STAGE=3 CRITIC_ZERO_STAGE=3 deepspeed --num_nodes=2 \ -H hostfile \ main.py \ --data_path Dahoas/rm-static \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_generation_batch_size 8 \ --per_device_training_batch_size 8 \ --generation_batches 8 \ --ppo_epochs 1 \ --max_answer_seq_len 1024 \ --max_prompt_seq_len 1024 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --offload \ --offload_reference_model \ --release_inference_cache \ --gradient_accumulation_steps 8 \ --actor_gradient_checkpointing \ --critic_gradient_checkpointing \ --actor_dropout 0.0 \ --num_warmup_steps 0 \ --deepspeed --seed 1234 \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --enable_hybrid_engine \ --output_dir $OUTPUT \ --inference_tp_size 1 \
And this is what I have in the output log about the time:
head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0 head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048 head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024 head: Preparation => Latency: 8.12s head: Training => Latency: 151.79s, TFLOPs: 97.79 head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0 head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048 head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024 head: Preparation => Latency: 8.29s head: Training => Latency: 145.45s, TFLOPs: 102.05 head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375
Looking forwards to your reply. Thank you for the sharing and suggestion again.
I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."
I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.
In addition, we updated the two checkpoints you used last week, which will also lead to different results. see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main
==============
If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.
Thanks again for your reply, I will try to use your suggestions
from openrlhf.
@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.
Thanks for your reply and sharing, I will try to use this fork version
from openrlhf.
Performance Tuning Guide: https://github.com/OpenLLMAI/OpenRLHF?tab=readme-ov-file#performance-tuning-guide
from openrlhf.
Related Issues (20)
- A worker died or was killed while executing a task by an unexpected system error. HOT 3
- SFT loss calculation issue HOT 2
- DPO Finetuning constantly gives preference loss as 0.6931 HOT 8
- Difference between `DeepSpeedEngine.save_checkpoint()` and `DeepSpeedStrategy.save_model()` HOT 2
- DPO后的模型推理出的结果都是无序符号 HOT 1
- Support training from breakpoint HOT 3
- llama3 70B DPO example script
- where is gradient_accumulation HOT 1
- Support RLOO HOT 1
- 现在Train_PPO_llama_ray 过程中会把Actor Model切分到不同卡上吗 HOT 4
- ConnectionRefusedError: [Errno 111] Connection refused HOT 5
- packing的问题 HOT 2
- "right" padding hardcoded HOT 3
- Error while saving the model under 4bit lora HOT 2
- multinode ppo training extremely slow HOT 15
- 使用ray的时候Request Entity Too Large HOT 3
- dpo 训练显存 OOM HOT 1
- Online DPO 支持 HOT 4
- Feature: add DPO-P
- Zero stage 3 error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openrlhf.