Git Product home page Git Product logo

Comments (6)

yananchen1989 avatar yananchen1989 commented on August 25, 2024

also have tried to use examples/accelerate_configs/deepspeed_zero3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when set max_seq_length to 4096, it can train on MP manner on 8 gpus (A40 48GB)

but when increase max_seq_length to higher number , for example, 10000, it crashes due to OOM.

from trl.

yananchen1989 avatar yananchen1989 commented on August 25, 2024

my launching script.

accelerate launch \
    --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft_tp.py \
    --model_name_or_path 'mistralai/Mistral-7B-Instruct-v0.2'\
    --report_to="wandb" \
    --learning_rate=4.41e-5 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --output_dir="tp_deepspeed" \
    --logging_steps=1 \
    --num_train_epochs=100 \
    --max_steps=-1 \
    --gradient_checkpointing \
    --bf16 True \
    --do_eval True \
    --evaluation_strategy 'epoch' \
    --max_seq_length 4096

from trl.

yananchen1989 avatar yananchen1989 commented on August 25, 2024

just change max_seq_length from 4096 to 5000 or 5120, without any other change, will also cause oom error:

/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.356724500656128 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.348627805709839 seconds
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/chenyanan/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Installed CUDA version 12.0 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /home/chenyanan/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3698253631591797 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3834192752838135 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3927829265594482 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.3720104694366455 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.401362895965576 seconds
Time to load cpu_adam op: 2.3812966346740723 seconds
wandb: Currently logged in as: yananchen1116. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /home/chenyanan/trl/wandb/run-20240406_153830-vb22d3qy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run final-cube-98
wandb: ⭐️ View project at https://wandb.ai/yananchen1116/huggingface
wandb: 🚀 View run at https://wandb.ai/yananchen1116/huggingface/runs/vb22d3qy
0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 5 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 6 has a total capacity of 47.54 GiB of which 34.13 GiB is free. Process 2567931 has 2.41 GiB memory in use. Including non-PyTorch memory, this process has 10.97 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 3 has a total capacity of 47.54 GiB of which 33.91 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 10.99 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 7 has a total capacity of 47.54 GiB of which 36.31 GiB is free. Process 2567931 has 260.00 MiB memory in use. Including non-PyTorch memory, this process has 10.96 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: Traceback (most recent call last):
CUDA out of memory. Tried to allocate 50.31 GiB. GPU 4 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "/home/chenyanan/trl/examples/scripts/sft_tp.py", line 159, in
trainer.train()
File "/home/chenyanan/trl/trl/trainer/sft_trainer.py", line 360, in train
output = super().train(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.31 GiB. GPU 2 has a total capacity of 47.54 GiB of which 33.77 GiB is free. Process 2567931 has 2.62 GiB memory in use. Including non-PyTorch memory, this process has 11.12 GiB memory in use. Of the allocated memory 8.73 GiB is allocated by PyTorch, and 1.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-04-06 15:38:49,201] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235888 closing signal SIGTERM
[2024-04-06 15:38:49,202] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3235889 closing signal SIGTERM
[2024-04-06 15:38:50,371] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 3235890) of binary: /home/chenyanan/anaconda3/envs/mp/bin/python
Traceback (most recent call last):
File "/home/chenyanan/anaconda3/envs/mp/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
deepspeed_launcher(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
distrib_run.run(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/scripts/sft_tp.py FAILED

Failures:
[1]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3235891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3235892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3235893)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3235894)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3235895)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-06_15:38:49
host : A40-36-111-143-5
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3235890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

from trl.

arivero avatar arivero commented on August 25, 2024

Besides training, model parallelism in the trl chat would be welcome too.

from trl.

iFe1er avatar iFe1er commented on August 25, 2024

@yananchen1989 any suggestions here?

from trl.

iFe1er avatar iFe1er commented on August 25, 2024

any updates?

from trl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.