I checked the model gradient and find it retuned 'None'. Anyone have similar issue?</p

It works well on my side (acc_mean > 0.6 ) <div class="snippet-clipboard-conten

DPO Finetuning constantly gives preference loss as 0.6931 about openrlhf HOT 8 OPEN

mandyyyyii commented on July 30, 2024

DPO Finetuning constantly gives preference loss as 0.6931

from openrlhf.

Comments (8)

hijkzzz commented on July 30, 2024

Did this occur on the default dataset we provided?

from openrlhf.

mandyyyyii commented on July 30, 2024

No, I use customized dataset. For SFT, it works fine.

from openrlhf.

hijkzzz commented on July 30, 2024

No, I use customized dataset

Please refer to the
https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py
to organize the datasets.

from openrlhf.

mandyyyyii commented on July 30, 2024

Yes. I use similar template for the customized dataset and it works fine for SFT method.

from openrlhf.

hijkzzz commented on July 30, 2024

It works well on my side (acc_mean > 0.6 )

Train step of epoch 0:   2%|_                                 | 229/12500 [03:19<2:52:35,  1.19it/s, preference_loss=0.763, chosen_reward=-1.03, reject_reward=-1.12, acc_mean=0.658, loss_mean=0.635]

please use a low learning rate: --learning_rate 5e-7 \

from openrlhf.

wzq016 commented on July 30, 2024

May not be relevant.

0.6931 is ln2, which means r(win) = r(lose) almost everywhere.

One probable reason is that the customized dataset is too hard, for example, y_w and y_l is too close and only differ a few tokens, or there are bugs in the dataset preparation, for example, y_w = y_l .

from openrlhf.

mandyyyyii commented on July 30, 2024

Hi, after I set the lr=1e-7, I can see some loss change. However, the dpo loss drops sharply and approach near 0. For this case, would it be possible that the positive and negative samples in the dataset are excessively dissimilar to each other?

from openrlhf.

wzq016 commented on July 30, 2024

There are a lot of possible reasons, such as bugs in processing. If there is a space token at the beginning of every y_l, and no space token at the beginning of every y_w. Then model may quickly converge without learning meaning signals.

from openrlhf.

Recommend Projects

DPO Finetuning constantly gives preference loss as 0.6931 about openrlhf HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent