hi team. I am using the SFT and PPO code to train my model, link <a href="https://gith

my launching . <div class="snippet-clipboard-content notranslate position-re

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[question] how to apply model parallism to solve cuda memory error,about huggingface/trl

Comments (6)

yananchen1989 commented on August 25, 2024

also have tried to use examples/accelerate_configs/deepspeed_zero3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when set max_seq_length to 4096, it can train on MP manner on 8 gpus (A40 48GB)

but when increase max_seq_length to higher number , for example, 10000, it crashes due to OOM.

from trl.

yananchen1989 commented on August 25, 2024

my launching script.

accelerate launch \
    --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft_tp.py \
    --model_name_or_path 'mistralai/Mistral-7B-Instruct-v0.2'\
    --report_to="wandb" \
    --learning_rate=4.41e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --output_dir="tp_deepspeed" \
    --logging_steps=1 \
    --num_train_epochs=100 \
    --max_steps=-1 \
    --gradient_checkpointing \
    --bf16 True \
    --do_eval True \
    --evaluation_strategy 'epoch' \
    --max_seq_length 4096

from trl.

yananchen1989 commented on August 25, 2024

just change max_seq_length from 4096 to 5000 or 5120, without any other change, will also cause oom error:

/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.356724500656128 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.348627805709839 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3698253631591797 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3834192752838135 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3927829265594482 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3720104694366455 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.401362895965576 seconds
Time to load cpu_adam op: 2.3812966346740723 seconds
wandb: Currently logged in as: yananchen1116. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /home/chenyanan/trl/wandb/run-20240406_153830-vb22d3qy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run final-cube-98
wandb: ⭐️ View project at https://wandb.ai/yananchen1116/huggingface
wandb: 🚀 View run at https://wandb.ai/yananchen1116/huggingface/runs/vb22d3qy
0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 5 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 6 has a total capacity of 47.54 GiB of which 34.13 GiB is free. Process 2567931 has 2.41 GiB memory in use. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 3 has a total capacity of 47.54 GiB of which 33.91 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 10.99 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 7 has a total capacity of 47.54 GiB of which 36.31 GiB is free. Process 2567931 has 260.00 MiB memory in use. Including non-PyTorch memory, this process has 10.96 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: Traceback (most recent call last):
CUDA out of memory. Tried to allocate 50.31 GiB. GPU 4 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 2 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-04-06 15:38:49,201] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235888 closing signal SIGTERM
[2024-04-06 15:38:49,202] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235889 closing signal SIGTERM
[2024-04-06 15:38:50,371] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 3235890) of binary: /home/chenyanan/anaconda3/envs/mp/bin/python
Traceback (most recent call last):
File "/home/chenyanan/anaconda3/envs/mp/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
deepspeed_launcher(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
distrib_run.run(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/scripts/sft_tp.py FAILED

Failures:
[1]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3235891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3235892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3235893)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3235894)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3235895)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

from trl.

arivero commented on August 25, 2024

Besides training, model parallelism in the trl chat would be welcome too.

from trl.

iFe1er commented on August 25, 2024

@yananchen1989 any suggestions here?

from trl.

iFe1er commented on August 25, 2024

any updates?

from trl.

[question] how to apply model parallism to solve cuda memory error about trl HOT 6 CLOSED

Comments (6)

examples/scripts/sft_tp.py FAILED

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (6)

examples/scripts/sft_tp.py FAILED

Root Cause (first observed failure): [0]: time : 2024-04-06_15:38:49 host : A40-36-111-143-5 rank : 2 (local_rank: 2) exitcode : 1 (pid: 3235890) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html