Git Product home page Git Product logo

dpo_toxic's Introduction

Mechanistically Understanding DPO: Toxicity

This repository provides the models, data, and experiments used in A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

Models, Data

You can download the models and datasets used in our paper here.

Save the checkpoints under ./checkpoints and unzip the data files under ./data.

Experiments

All of our experiments can be found under ./toxicity. To run interventions, see ./toxicity/eval_interventions/run_evaluations.py.

To re-create any of our figures, see ./toxicity/eval_interventions/figures.

Training DPO

To train your own dpo model:

cd toxicity/train_dpo
python train.py exp_name="[name of your experiment]"

How to Cite

If you find our work relevant, please cite as following:

@article{lee2024mechanistic,
  title={A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity},
  author={Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada},
  journal={arXiv preprint arXiv:2401.01967},
  year={2024}
}

dpo_toxic's People

Contributors

ajyl avatar

Stargazers

CooperLeong avatar Thanh Tin Nguyen avatar Andrew Chauzov avatar Klaus Hipp avatar Kailuo avatar chanchan7 avatar  avatar Junjie Guo avatar Dhruv Gautam avatar zuijiang avatar Xiaojie Gu avatar Sayem Mohammad Imtiaz avatar Yifei Wang avatar DOHYUN CHUNG avatar  avatar Joschka Braun avatar Li Wenjie avatar  avatar Xiang Pan (潘翔) avatar Harry Mayne avatar zzxxxl avatar Kunat Pipatanakul avatar Yong Zheng-Xin avatar Jacob avatar  avatar Yong Liu avatar Qinyuan Cheng avatar  avatar Shan Gao avatar  avatar JanWehner avatar Binwei Yao avatar Viraj Prabhu avatar Haitao Mao avatar Thien Q. Tran avatar HyeongKyu Froilan Choi avatar Haizhong avatar Xiaoyan Bai avatar yangchao avatar Joe Brucker avatar Haoyan Luo avatar Changho Shin avatar Xiangyu Qi avatar Renat Zayashnikov avatar Zhihui Xie avatar Zeyu Qin avatar

Watchers

 avatar

dpo_toxic's Issues

Un aligning DPO

I couldn't find the code for dpo unaligning in section 5.3 in the paper in this repo. Would highly appreciate that.

Also, maybe I'm understanding it wrong but I wonder how 10x ing the MLP_key vector that promote toxic MLP_v neurons recover toxic behaviour. Isn't dot(MLP_key vector, residual stream) already negative after DPO?

missing configuration in DPO training

"" File "/root/hz/dpo_toxic/toxicity/train_dpo/trainers.py", line 463, in get_batch_metrics
if loss_config.kl_gamma > 0:
omegaconf.errors.ConfigAttributeError: Key 'kl_gamma' is not in struct
full_key: loss.kl_gamma
object_type=dict
""

Hi, thanks for sharing your code. I tried to reproduce the DPO training experiment. But it seems like some configurations are missing in the dpo.yaml.

Probe.pt tensor and Mistral 7b Model Tensor Dimension Mismatch (1024 vs 4096)

I would like to express my deepest gratitude for your open source code. Your work is solid, and I deeply appreciate your contribution to the community.

I have been attempting to replicate your work using the Mistral model, but I've encountered a challenge related to tensor dimensions. Specifically, the probe.pt tensor you provided is of 1024 dimensions, whereas the Mistral model operates with tensors of 4096 dimensions. This discrepancy has led to an issue of incompatible tensor dimensions.

As the code for training probe.pt has not been made public, I was wondering if it would be possible for you to provide a version of probe.pt with 4096 dimensions? Such assistance would be very helpful to us and likely benefit others facing similar challenges.

DPO training

Thanks for this repo! I am wondering if there are codes for performing the DPO on the pairwise toxic data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.