Git Product home page Git Product logo

Comments (7)

hijkzzz avatar hijkzzz commented on July 30, 2024

@wuxibin89 may know how to fix DSChat.

from openrlhf.

youngyoung321 avatar youngyoung321 commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable.
But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23

Can you give me some information? For further guidance, here is the configuration I am trying to run.
The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results.
see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

from openrlhf.

wuxibin89 avatar wuxibin89 commented on July 30, 2024

@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

from openrlhf.

youngyoung321 avatar youngyoung321 commented on July 30, 2024

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.
Optimized DSChat in OpenRLHF paper DSChat in my experiments gap
E2E time (s) 855.09 About 538 About 297
Generation time (s) 590.157 About 328 About 262
Training time (s) 125.69 About 148 About 23
Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results. see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

Thanks again for your reply, I will try to use your suggestions

from openrlhf.

youngyoung321 avatar youngyoung321 commented on July 30, 2024

@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

Thanks for your reply and sharing, I will try to use this fork version

from openrlhf.

hijkzzz avatar hijkzzz commented on July 30, 2024

Performance Tuning Guide: https://github.com/OpenLLMAI/OpenRLHF?tab=readme-ov-file#performance-tuning-guide

from openrlhf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.