I'm trying to learn the RLHF module recently. I saw the speed timetable of Deepspe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Performance Tuning Guide: <a href="https://github.com/OpenLLMAI/OpenRLHF?tab=readme-ov

Could you give an example of testing deepspeed-chat time? about openrlhf HOT 7 CLOSED

youngyoung321 commented on July 30, 2024

Could you give an example of testing deepspeed-chat time?

from openrlhf.

Comments (7)

hijkzzz commented on July 30, 2024

@wuxibin89 may know how to fix DSChat.

from openrlhf.

youngyoung321 commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable.
But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

	Optimized DSChat in OpenRLHF paper	DSChat in my experiments	gap
E2E time (s)	855.09	About 538	About 297
Generation time (s)	590.157	About 328	About 262
Training time (s)	125.69	About 148	About 23

Can you give me some information? For further guidance, here is the configuration I am trying to run.
The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375

head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

from openrlhf.

hijkzzz commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375

head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results.
see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

from openrlhf.

wuxibin89 commented on July 30, 2024

@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

from openrlhf.

youngyoung321 commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.
Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:
ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \
And this is what I have in the output log about the time:
head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375
Looking forwards to your reply. Thank you for the sharing and suggestion again.
I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results. see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

Thanks again for your reply, I will try to use your suggestions

from openrlhf.

youngyoung321 commented on July 30, 2024

@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

Thanks for your reply and sharing, I will try to use this fork version

from openrlhf.

hijkzzz commented on July 30, 2024

Performance Tuning Guide: https://github.com/OpenLLMAI/OpenRLHF?tab=readme-ov-file#performance-tuning-guide

from openrlhf.

Could you give an example of testing deepspeed-chat time? about openrlhf HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent