Git Product home page Git Product logo

sycophancysteering's Introduction

Sycophancy Activation Steering

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Then create a .env file with the following variables (see .env.example):

HF_TOKEN=huggingface_token_with_access_to_llama2
CLAUDE_API_KEY=api_key_for_claude (optional, only needed for LLM-enabled eval)

Available commands

# Format datasets for generating steering vector and testing effect
python make_datasets.py --generate_test_split 0.8 --anthropic_custom_split 0.6 --n_datapoints 1200
# Generate steering vectors and optionally save full activations
python generate_vectors.py --layers 15 20 25 --save_activations
# Optionally, plot projected activations
python plot_activations.py --activations_pos_file activations/activations_pos_15_Llama-2-7b-hf.pt --activations_neg_file activations/activations_neg_15_Llama-2-7b-hf.pt --fname activations_proj_15_Llama-2-7b-hf.png --title "Activations layer 15"
# Apply steering vectors to model and test effect (--type can by one of "in_distribution", "out_of_distribution", "truthful_qa"), (--few_shot can be one of "positive", "negative", "unbiased", "none")
python prompting_with_steering.py --type in_distribution --layers 15 20 25 --multipliers -1.5 -1 0 1 1.5 --few_shot positive
# Plot results of activation steering experiment (use same arguments as the experiment)
python plot_results.py --type in_distribution --layers 15 20 25 --multipliers -1.5 -1 0 1 1.5 --few_shot positive

Full replicable experiments

Scripts that can be run to replicate the experiments are in the scripts/ folder. Run ./scripts/generate_all_vectors.sh first to generate steering vectors.

Analysis / charts

analysis/ contains scripts for Claude-enabled eval of out-of-distribution steering

Running tests

I have added a few unit tests for some of the utility functions. To run them, simply run:

pytest

Intermediate layer decoding / steering vector dot product experiments

See /activation_steering_interp.ipynb

sycophancysteering's People

Contributors

nrimsky avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.