Light

ajyl / dpo_toxic Goto Github PK

View Code? Open in Web Editor NEW

46.0 1.0 6.0 224 KB

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

License: MIT License

Python 31.97% Jupyter Notebook 68.03%

dpo_toxic's Introduction

Mechanistically Understanding DPO: Toxicity

This repository provides the models, data, and experiments used in A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

Models, Data

You can download the models and datasets used in our paper here.

Save the checkpoints under ./checkpoints and unzip the data files under ./data.

Experiments

All of our experiments can be found under ./toxicity. To run interventions, see ./toxicity/eval_interventions/run_evaluations.py.

To re-create any of our figures, see ./toxicity/eval_interventions/figures.

Training DPO

To train your own dpo model:

cd toxicity/train_dpo
python train.py exp_name="[name of your experiment]"

How to Cite

If you find our work relevant, please cite as following:

@article{lee2024mechanistic,
  title={A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity},
  author={Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada},
  journal={arXiv preprint arXiv:2401.01967},
  year={2024}
}

dpo_toxic's People

Contributors

Stargazers

Watchers

Forkers

shaangao yongzx instantsages migabaj sayemimtiaz chanchan7

dpo_toxic's Issues

Un aligning DPO

I couldn't find the code for dpo unaligning in section 5.3 in the paper in this repo. Would highly appreciate that.

Also, maybe I'm understanding it wrong but I wonder how 10x ing the MLP_key vector that promote toxic MLP_v neurons recover toxic behaviour. Isn't dot(MLP_key vector, residual stream) already negative after DPO?

missing configuration in DPO training

"" File "/root/hz/dpo_toxic/toxicity/train_dpo/trainers.py", line 463, in get_batch_metrics
if loss_config.kl_gamma > 0:
omegaconf.errors.ConfigAttributeError: Key 'kl_gamma' is not in struct
full_key: loss.kl_gamma
object_type=dict
""

Hi, thanks for sharing your code. I tried to reproduce the DPO training experiment. But it seems like some configurations are missing in the dpo.yaml.

Probe.pt tensor and Mistral 7b Model Tensor Dimension Mismatch (1024 vs 4096)

I would like to express my deepest gratitude for your open source code. Your work is solid, and I deeply appreciate your contribution to the community.

I have been attempting to replicate your work using the Mistral model, but I've encountered a challenge related to tensor dimensions. Specifically, the probe.pt tensor you provided is of 1024 dimensions, whereas the Mistral model operates with tensors of 4096 dimensions. This discrepancy has led to an issue of incompatible tensor dimensions.

As the code for training probe.pt has not been made public, I was wondering if it would be possible for you to provide a version of probe.pt with 4096 dimensions? Such assistance would be very helpful to us and likely benefit others facing similar challenges.

shuffling the dataset

Hi, thanks for your contribution, but I guess in toxicity/train_dpo/pplm_dataset.py, the code random.shuffle(file_data) should be random.shuffle(data) instead?

DPO training

Thanks for this repo! I am wondering if there are codes for performing the DPO on the pairwise toxic data.

code for dpo leading to a drop in activations for the toxic vectors

I didn't find the code for dpo that you research in your paper in this repo, could you please give me a link for that? I would be highly appreciate for that.

train_dpo/train.py toxicity module

Running train.py has encountered a problem with the toxicity module not importing correctly

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.