Hi, I'm experiencing an Assertion Error during training of miniLLM using ZeRO with opt

Backward pass is invalid for module in evaluation mode during minillm training with ZeRO parameter offload about lmops HOT 4 CLOSED

Ispanicus commented on June 3, 2024

Backward pass is invalid for module in evaluation mode during minillm training with ZeRO parameter offload

from lmops.

Comments (4)

t1101675 commented on June 3, 2024

Thanks for your attention to our work!

This error is caused by DeepSpeed-zero3 (They do not support backward in an evaluation mode). We need the eval mode because the MiniLLM training is similar to the RL training where dropout is usually not applied.

If you use zero3 because of insufficient GPU memory, you can try using zero-offload by setting the DeepSpeed config to the corresponding path .
If you insist on using zero3, you can set the dropout value to 0 in the model config files before running the script and add self.model.train() before every step (for example, at this line). In this way, you can avoid using dropout and do backward with the training mode at the same time.

from lmops.

Ispanicus commented on June 3, 2024

I was using your deepspeed config file for zero2 offload, which I changed to zero3 and added parameter offload. If I understand you correctly, optimizer offload with zero2 works, but adding parameter offload and using zero3 is not yet implemented?

Which model config file are you referring to exactly which contains the dropout value?

I managed to run llama7B_13B minillm with zero2 on a single H100 GPU by reducing the max sequence length to 256. In your opinion, will this degrade performance significantly?

Thank you for your help!

from lmops.

t1101675 commented on June 3, 2024

I was using your deepspeed config file for zero2 offload, which I changed to zero3 and added parameter offload. If I understand you correctly, optimizer offload with zero2 works, but adding parameter offload and using zero3 is not yet implemented?

Yes.

The "model config file" refers to the huggingface config files in the checkpoint directory you load from.

I think reducing the max sequence length to 256 will mainly affect the performance of long instructions/responses. On short instructions, the performance will not be affected much.

from lmops.

Ispanicus commented on June 3, 2024

Alright, thank you for your help!

from lmops.

Recommend Projects

Backward pass is invalid for module in evaluation mode during minillm training with ZeRO parameter offload about lmops HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent