openai / automated-interpretability Goto Github PK

Jupyter Notebook 4.04% Python 72.49% CSS 2.97% JavaScript 9.34% TypeScript 10.17% HTML 0.99%

automated-interpretability's Introduction

Automated interpretability

Code and tools

This repository contains code and tools associated with the Language models can explain neurons in language models paper, specifically:

Code for automatically generating, simulating, and scoring explanations of neuron behavior using the methodology described in the paper. See the neuron-explainer README for more information.

Note: if you run into errors of the form "Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'"." you might be able to fix this by signing up for an azure account and specifying the credentials as described in the error message.

A tool for viewing neuron activations and explanations, accessible here. See the neuron-viewer README for more information.

Public datasets

Together with this code, we're also releasing public datasets of GPT-2 XL neurons and explanations. Here's an overview of those datasets.

Neuron activations: az://openaipublic/neuron-explainer/data/collated-activations/{layer_index}/{neuron_index}.json
- Tokenized text sequences and their activations for the neuron. We provide multiple sets of tokens and activations: top-activating ones, random samples from several quantiles; and a completely random sample. We also provide some basic statistics for the activations.
- Each file contains a JSON-formatted NeuronRecord dataclass.
Neuron explanations: az://openaipublic/neuron-explainer/data/explanations/{layer_index}/{neuron_index}.jsonl
- Scored model-generated explanations of the behavior of the neuron, including simulation results.
- Each file contains a JSON-formatted NeuronSimulationResults dataclass.
Related neurons: az://openaipublic/neuron-explainer/data/related-neurons/weight-based/{layer_index}/{neuron_index}.json
- Lists of the upstream and downstream neurons with the most positive and negative connections (see below for definition).
- Each file contains a JSON-formatted dataclass whose definition is not included in this repo.
Tokens with high average activations: az://openaipublic/neuron-explainer/data/related-tokens/activation-based/{layer_index}/{neuron_index}.json
- Lists of tokens with the highest average activations for individual neurons, and their average activations.
- Each file contains a JSON-formatted TokenLookupTableSummaryOfNeuron dataclass.
Tokens with large inbound and outbound weights: az://openaipublic/neuron-explainer/data/related-tokens/weight-based/{layer_index}/{neuron_index}.json
- List of the most-positive and most-negative input and output tokens for individual neurons, as well as the associated weight (see below for definition).
- Each file contains a JSON-formatted WeightBasedSummaryOfNeuron dataclass.

Update (July 5, 2023): We also released a set of explanations for GPT-2 Small. The methodology is slightly different from the methodology used for GPT-2 XL so the results aren't directly comparable.

Neuron activations: az://openaipublic/neuron-explainer/gpt2_small_data/collated-activations/{layer_index}/{neuron_index}.json
Neuron explanations: az://openaipublic/neuron-explainer/gpt2_small_data/explanations/{layer_index}/{neuron_index}.jsonl

Update (August 30, 2023): We recently discovered a bug in how we performed inference on the GPT-2 series models used for the paper and for these datasets. Specifically, we used an optimized GELU implementation rather than the original GELU implementation associated with GPT-2. While the model’s behavior is very similar across these two configurations, the post-MLP activation values we used to generate and simulate explanations differ from the correct values by the following amounts for GPT-2 small:

Median: 0.0090
90th percentile: 0.0252
99th percentile: 0.0839
99.9th percentile: 0.1736

Definition of connection weights

Refer to GPT-2 model code for understanding of model weight conventions.

Neuron-neuron: For two neurons (l1, n1) and (l2, n2) with l1 < l2, the connection strength is defined as h{l1}.mlp.c_proj.w[:, n1, :] @ diag(h{l2}.ln_2.g) @ h{l2}.mlp.c_fc.w[:, :, n2].

Neuron-token: For token t and neuron (l, n), the input weight is computed as wte[t, :] @ diag(h{l}.ln_2.g) @ h{l}.mlp.c_fc.w[:, :, n] and the output weight is computed as h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :].

Misc Lists of Interesting Neurons

Lists of neurons we thought were interesting according to different criteria, with some preliminary descriptions.

automated-interpretability's People

Contributors

Stargazers

Watchers

Forkers

firstuserhere mzntaka0 codingchild2424 m-izadmehr stjordanis renormalizedkat techthiyanes evdcush disappointmentinc dumpmemory tinkerlin mathisall allthingsllm apollohuang1 yangwang92 andy0731 terryfyl jsoref xyc1207 flyingtiger99 isuyu solmur jaedukseo birblewin notskynet animascare joskid ethicalsecurity-agency petercao wenjinw bigdan88 shunsunsun jsanzolac deisler134 suenavc yulang007 gahljust sullamij squareandcompass mariakesa pchen35 jinlmsft cybernetix-s3c darkilluminatus 346926484 yilanchen6 ghas-results jimmytoluene tuomaso searchgame liuyixin-louis sophiaas zhaojialiang001 liujuncn ljjbluesky unknownhl darcstar-solutions-tech mikeyshechter balakreshnan aanchala baselmousi feralunsettler professor-codephreak ghas-results sysnu11 sunilgitb janhuman yiwen-ding alexeigannon thedegeneratedev5150 singhkshubh nicholasdow mbrukman avvaramesh salmaelsayd saurabhh333 alignment-lab-ai dearborn-open-ai swinesite-org baidicoot todototry tsupine qblockq hijohnnylin james4ever0 id-2 cassiaus sp-mujuni whymath mrcodechef lingbai-kong seitozhen cathylaucx cashbeario wreccabu warisgill akkikiki aminamiridarban helehm liumx2020

automated-interpretability's Issues

Is there a demo that shows this great project?

Thank you very much!

Text-davinci-003 deprecated

The simulator model text-davinci-003 is now deprecated and the other models (babbage-002 and ada-002) are super unreliable. Is there a workaround on this?

No Azure credentials were found

Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'
Access Failure: message=Could not access container, request=, status=404, error=ResourceNotFound, error_description=The specified resource does not exist.
RequestId:ef3dbd4b-701e-00d0-03cc-87e199000000
Time:2023-05-16T07:59:35.0263720Z, error_headers=Content-Length: 223, Content-Type: application/xml, Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0, x-ms-request-id: ef3dbd4b-701e-00d0-03cc-87e199000000, x-ms-version: 2019-02-02, x-ms-error-code: ResourceNotFound, Date: Tue, 16 May 2023 07:59:34 GMT

No Azure credentials were found. If the container is not marked as public, please do one of the following:

Log in with 'az login', blobfile will use your default credentials to lookup your storage account key
Set the environment variable 'AZURE_STORAGE_KEY' to your storage account key which you can find by following this guide: https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage
Create an account with 'az ad sp create-for-rbac --name ' and set the 'AZURE_APPLICATION_CREDENTIALS' environment variable to the path of the output from that command or individually set the 'AZURE_CLIENT_ID', 'AZURE_CLIENT_SECRET', and 'AZURE_TENANT_ID' environment variables

About the 'logprobs' in the response object

Hi,

I find that the simulator will postprocess the response of 'text-davinci-003' using the response field 'logprobs'.

However, as I read the document of openai, the field 'logprobs' is going to be deprecated since the completion response object will be replaced by chat completion object, and also the model 'text-davince-003' is going to be deprecated.

I now have the access to the gpt-4 and gpt-3.5-turbo with chat completion response object. Is there any way to conduct the neuron-explainer using the two models? i.e. without the field 'logprobs'.

Or, is it necessary to call 'text-davince-003' and other models with completion response object that return 'logprobs'.

Thanks a lot!

Getting Top Activating Text Excerpts Per Neuron

Hello,

Can you please clarify how do you get the top activating text excerpts per neuron? Do you average the activation values for all tokens in the text excerpt or do you sum them up?

missing data

@diziet noticed explanations/31/1593.jsonl was missing at #4

i believe this is actually the only data missing (at least from explanations). but I will keep a list of missing data in this issue (someone will probably figure out all of it once they try to do things programmatically)

we are not likely to fix any of these issues

Problem about activation calculation

I would like to know how neuron activation is calculated and how to map neuron activation to each input token. Or can you provide me with related work on calculating neuron activation, I would be very grateful.

Dataset for neuron activation

Hi.

I am wondering where can I get all the random samples you used to calculate the activation, instead of opening tens of thousands of JSON files to get them. I am trying to use the same random samples on other LLMs.

Thanks.

Retrying after connection error... (https://api.openai.com/v1/chat/completions)

Hello, I have been with this problem, how to solve: Retrying after connection error... (https://api.openai.com/v1/chat/completions)

Not possible to read from public Azure blobs without authentication

It's such a cool project, great work 👍

While trying to run it, I noticed that in neuron-viewer/python/server.py file, blobfile is used to get the JSON from azure.

However, if you are not logged in to Azure, it will throw:

  File "/opt/homebrew/lib/python3.10/site-packages/blobfile/_azure.py", line 797, in _get_access_token
    raise Error(msg)
blobfile._common.Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'
    Access Failure: message=Could not access container, request=<Request method=GET url=https://openaipublic.blob.core.windows.net/neuron-explainer params={'restype': 'container', 'comp': 'list', 'maxresults': '1'}>, status=404, error=ResourceNotFound, error_description=The specified resource does not exist.
RequestId:ea50e029-201e-00bf-04d4-82eb6a000000

It seems to be related to this issue blobfile/blobfile#118, and you have to login to azure to use this repository. If created a small PR to solve this issue in #2

Make a demo script for getting neuron activations using transformer lens?

(no plans to do this right now, but could be useful)

Make easily installable via pip

Can't be installed on windows

RuntimeError: uvloop does not support Windows at the moment.

I'm just running in a vm, but this would be nice.

Requires python >=3.9 instead of 3.7 specified in setup.py

Running demos/generate_and_score_explanation.ipynb with python 3.8 gives a type error due to type hints used:

TypeError Traceback (most recent call last)
Input In [2], in
1 import os
3 os.environ["OPENAI_API_KEY"] = "put-key-here"
----> 5 from neuron_explainer.activations.activation_records import calculate_max_activation
6 from neuron_explainer.activations.activations import ActivationRecordSliceParams, load_neuron
7 from neuron_explainer.explanations.calibrated_simulator import UncalibratedNeuronSimulator

File ~/interpretability/automated-interpretability/neuron-explainer/neuron_explainer/activations/activation_records.py:6, in
3 import math
4 from typing import Optional, Sequence
----> 6 from neuron_explainer.activations.activations import ActivationRecord
8 UNKNOWN_ACTIVATION_STRING = "unknown"
11 def relu(x: float) -> float:

File ~/interpretability/automated-interpretability/neuron-explainer/neuron_explainer/activations/activations.py:36, in
31 neuron_index: int
32 """The neuron's index within in its layer. Indices start from 0 in each layer."""
35 def _check_slices(
---> 36 slices_by_split: dict[str, slice],
37 expected_num_values: int,
38 ) -> None:
39 """Assert that the slices are disjoint and fully cover the intended range."""
40 indices = set()

TypeError: 'type' object is not subscriptable

More unified dataset

Hi!

Spectacular work here folks. Is there any plan to release a more unified dataset, as in rather than having to request every neuron on every layer, downloading a single monolithic file that could be, say, indexed in a database for searcheability, or whatever?

This would be very useful for guiding alignment efforts and generic research on how GPTs internal ontology works. (Ie loading the data into Neo4J and applying some good old fashion graph-theory number crunching to try and work out whats up with the nodes GPT4 couldnt make heads and tails of (Ie are they part of the deep structure of its linguistic thinking, are they secondary nodes to superpositions, etc. My intuition tells me these are solveable)

explain_puzzles.ipynb - You didn't provide an API key

When I try to run explain_puzzles.ipynb, It is telling me I didn't provide an API key. But I the api key is already set in the os.environ["OPENAI_API_KEY"] from having added the API key to both ~/.zshrc and ~/.bash_profile as suggested here: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
And I have tried removing the key from those files and manually setting it in the notebook as suggested, and the error persists.

Output:

puzzle_name='colors'
{'error': {'message': 'The model: `gpt-4` does not exist', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

and this error:

HTTPStatusError: Client error '404 Not Found' for url 'https://api.openai.com/v1/chat/completions'
For more information check: https://httpstatuses.com/404

That url displays the following error:

"You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys."

I am getting this error whether I run the notebook with jupyter labs or jupyter notebook.

Code for revising explanations

Hi there,

Thanks for such great work on interpretability!

I need help finding the code for revising explanations. Is it included in the code base?

About Direction Finding

Dear authors， do you plan to open source the “Finding explainable directions” part of the code in the future? Thanks.

Make it easy to launch a new colab notebook from the demo notebooks

Installing neuron-explainer doesn't seem to work.

If I run pip install "git+https://github.com/openai/automated-interpretability.git#subdirectory=neuron-explainer" I understand that this should install the repo as part of my environment, but it only installs a select few of the required files. Is there something wrong with the setup.py?

# installed-files.txt
../neuron_explainer/__init__.py
../neuron_explainer/__pycache__/__init__.cpython-39.pyc
../neuron_explainer/__pycache__/api_client.cpython-39.pyc
../neuron_explainer/api_client.py
PKG-INFO
SOURCES.txt
dependency_links.txt
requires.txt
top_level.txt