Git Product home page Git Product logo

automated-interpretability's Introduction

Automated interpretability

Code and tools

This repository contains code and tools associated with the Language models can explain neurons in language models paper, specifically:

  • Code for automatically generating, simulating, and scoring explanations of neuron behavior using the methodology described in the paper. See the neuron-explainer README for more information.

Note: if you run into errors of the form "Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'"." you might be able to fix this by signing up for an azure account and specifying the credentials as described in the error message.

  • A tool for viewing neuron activations and explanations, accessible here. See the neuron-viewer README for more information.

Public datasets

Together with this code, we're also releasing public datasets of GPT-2 XL neurons and explanations. Here's an overview of those datasets.

  • Neuron activations: az://openaipublic/neuron-explainer/data/collated-activations/{layer_index}/{neuron_index}.json
    • Tokenized text sequences and their activations for the neuron. We provide multiple sets of tokens and activations: top-activating ones, random samples from several quantiles; and a completely random sample. We also provide some basic statistics for the activations.
    • Each file contains a JSON-formatted NeuronRecord dataclass.
  • Neuron explanations: az://openaipublic/neuron-explainer/data/explanations/{layer_index}/{neuron_index}.jsonl
    • Scored model-generated explanations of the behavior of the neuron, including simulation results.
    • Each file contains a JSON-formatted NeuronSimulationResults dataclass.
  • Related neurons: az://openaipublic/neuron-explainer/data/related-neurons/weight-based/{layer_index}/{neuron_index}.json
    • Lists of the upstream and downstream neurons with the most positive and negative connections (see below for definition).
    • Each file contains a JSON-formatted dataclass whose definition is not included in this repo.
  • Tokens with high average activations: az://openaipublic/neuron-explainer/data/related-tokens/activation-based/{layer_index}/{neuron_index}.json
    • Lists of tokens with the highest average activations for individual neurons, and their average activations.
    • Each file contains a JSON-formatted TokenLookupTableSummaryOfNeuron dataclass.
  • Tokens with large inbound and outbound weights: az://openaipublic/neuron-explainer/data/related-tokens/weight-based/{layer_index}/{neuron_index}.json
    • List of the most-positive and most-negative input and output tokens for individual neurons, as well as the associated weight (see below for definition).
    • Each file contains a JSON-formatted WeightBasedSummaryOfNeuron dataclass.

Update (July 5, 2023): We also released a set of explanations for GPT-2 Small. The methodology is slightly different from the methodology used for GPT-2 XL so the results aren't directly comparable.

  • Neuron activations: az://openaipublic/neuron-explainer/gpt2_small_data/collated-activations/{layer_index}/{neuron_index}.json
  • Neuron explanations: az://openaipublic/neuron-explainer/gpt2_small_data/explanations/{layer_index}/{neuron_index}.jsonl

Update (August 30, 2023): We recently discovered a bug in how we performed inference on the GPT-2 series models used for the paper and for these datasets. Specifically, we used an optimized GELU implementation rather than the original GELU implementation associated with GPT-2. While the model’s behavior is very similar across these two configurations, the post-MLP activation values we used to generate and simulate explanations differ from the correct values by the following amounts for GPT-2 small:

  • Median: 0.0090
  • 90th percentile: 0.0252
  • 99th percentile: 0.0839
  • 99.9th percentile: 0.1736

Definition of connection weights

Refer to GPT-2 model code for understanding of model weight conventions.

Neuron-neuron: For two neurons (l1, n1) and (l2, n2) with l1 < l2, the connection strength is defined as h{l1}.mlp.c_proj.w[:, n1, :] @ diag(h{l2}.ln_2.g) @ h{l2}.mlp.c_fc.w[:, :, n2].

Neuron-token: For token t and neuron (l, n), the input weight is computed as wte[t, :] @ diag(h{l}.ln_2.g) @ h{l}.mlp.c_fc.w[:, :, n] and the output weight is computed as h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :].

Misc Lists of Interesting Neurons

Lists of neurons we thought were interesting according to different criteria, with some preliminary descriptions.

automated-interpretability's People

Contributors

henktillman avatar hijohnnylin avatar jsoref avatar m-izadmehr avatar nicholasdow avatar stevenbills avatar williamrs-openai avatar wuthefwasthat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automated-interpretability's Issues

Text-davinci-003 deprecated

The simulator model text-davinci-003 is now deprecated and the other models (babbage-002 and ada-002) are super unreliable. Is there a workaround on this?

No Azure credentials were found

Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'
Access Failure: message=Could not access container, request=, status=404, error=ResourceNotFound, error_description=The specified resource does not exist.
RequestId:ef3dbd4b-701e-00d0-03cc-87e199000000
Time:2023-05-16T07:59:35.0263720Z, error_headers=Content-Length: 223, Content-Type: application/xml, Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0, x-ms-request-id: ef3dbd4b-701e-00d0-03cc-87e199000000, x-ms-version: 2019-02-02, x-ms-error-code: ResourceNotFound, Date: Tue, 16 May 2023 07:59:34 GMT

No Azure credentials were found. If the container is not marked as public, please do one of the following:

  • Log in with 'az login', blobfile will use your default credentials to lookup your storage account key
  • Set the environment variable 'AZURE_STORAGE_KEY' to your storage account key which you can find by following this guide: https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage
  • Create an account with 'az ad sp create-for-rbac --name ' and set the 'AZURE_APPLICATION_CREDENTIALS' environment variable to the path of the output from that command or individually set the 'AZURE_CLIENT_ID', 'AZURE_CLIENT_SECRET', and 'AZURE_TENANT_ID' environment variables

About the 'logprobs' in the response object

Hi,

I find that the simulator will postprocess the response of 'text-davinci-003' using the response field 'logprobs'.

However, as I read the document of openai, the field 'logprobs' is going to be deprecated since the completion response object will be replaced by chat completion object, and also the model 'text-davince-003' is going to be deprecated.

I now have the access to the gpt-4 and gpt-3.5-turbo with chat completion response object. Is there any way to conduct the neuron-explainer using the two models? i.e. without the field 'logprobs'.

Or, is it necessary to call 'text-davince-003' and other models with completion response object that return 'logprobs'.

Thanks a lot!

missing data

@diziet noticed explanations/31/1593.jsonl was missing at #4

i believe this is actually the only data missing (at least from explanations). but I will keep a list of missing data in this issue (someone will probably figure out all of it once they try to do things programmatically)

we are not likely to fix any of these issues

Problem about activation calculation

I would like to know how neuron activation is calculated and how to map neuron activation to each input token. Or can you provide me with related work on calculating neuron activation, I would be very grateful.

Dataset for neuron activation

Hi.

I am wondering where can I get all the random samples you used to calculate the activation, instead of opening tens of thousands of JSON files to get them. I am trying to use the same random samples on other LLMs.

Thanks.

Not possible to read from public Azure blobs without authentication

It's such a cool project, great work 👍

While trying to run it, I noticed that in neuron-viewer/python/server.py file, blobfile is used to get the JSON from azure.

However, if you are not logged in to Azure, it will throw:

  File "/opt/homebrew/lib/python3.10/site-packages/blobfile/_azure.py", line 797, in _get_access_token
    raise Error(msg)
blobfile._common.Error: Could not find any credentials that grant access to storage account: 'openaipublic' and container: 'neuron-explainer'
    Access Failure: message=Could not access container, request=<Request method=GET url=https://openaipublic.blob.core.windows.net/neuron-explainer params={'restype': 'container', 'comp': 'list', 'maxresults': '1'}>, status=404, error=ResourceNotFound, error_description=The specified resource does not exist.
RequestId:ea50e029-201e-00bf-04d4-82eb6a000000

It seems to be related to this issue blobfile/blobfile#118, and you have to login to azure to use this repository. If created a small PR to solve this issue in #2

Requires python >=3.9 instead of 3.7 specified in setup.py

Running demos/generate_and_score_explanation.ipynb with python 3.8 gives a type error due to type hints used:

TypeError Traceback (most recent call last)
Input In [2], in
1 import os
3 os.environ["OPENAI_API_KEY"] = "put-key-here"
----> 5 from neuron_explainer.activations.activation_records import calculate_max_activation
6 from neuron_explainer.activations.activations import ActivationRecordSliceParams, load_neuron
7 from neuron_explainer.explanations.calibrated_simulator import UncalibratedNeuronSimulator

File ~/interpretability/automated-interpretability/neuron-explainer/neuron_explainer/activations/activation_records.py:6, in
3 import math
4 from typing import Optional, Sequence
----> 6 from neuron_explainer.activations.activations import ActivationRecord
8 UNKNOWN_ACTIVATION_STRING = "unknown"
11 def relu(x: float) -> float:

File ~/interpretability/automated-interpretability/neuron-explainer/neuron_explainer/activations/activations.py:36, in
31 neuron_index: int
32 """The neuron's index within in its layer. Indices start from 0 in each layer."""
35 def _check_slices(
---> 36 slices_by_split: dict[str, slice],
37 expected_num_values: int,
38 ) -> None:
39 """Assert that the slices are disjoint and fully cover the intended range."""
40 indices = set()

TypeError: 'type' object is not subscriptable

More unified dataset

Hi!

Spectacular work here folks. Is there any plan to release a more unified dataset, as in rather than having to request every neuron on every layer, downloading a single monolithic file that could be, say, indexed in a database for searcheability, or whatever?

This would be very useful for guiding alignment efforts and generic research on how GPTs internal ontology works. (Ie loading the data into Neo4J and applying some good old fashion graph-theory number crunching to try and work out whats up with the nodes GPT4 couldnt make heads and tails of (Ie are they part of the deep structure of its linguistic thinking, are they secondary nodes to superpositions, etc. My intuition tells me these are solveable)

explain_puzzles.ipynb - You didn't provide an API key

When I try to run explain_puzzles.ipynb, It is telling me I didn't provide an API key. But I the api key is already set in the os.environ["OPENAI_API_KEY"] from having added the API key to both ~/.zshrc and ~/.bash_profile as suggested here: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
And I have tried removing the key from those files and manually setting it in the notebook as suggested, and the error persists.

Output:

puzzle_name='colors'
{'error': {'message': 'The model: `gpt-4` does not exist', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

and this error:

HTTPStatusError: Client error '404 Not Found' for url 'https://api.openai.com/v1/chat/completions'
For more information check: https://httpstatuses.com/404

That url displays the following error:

"You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys."

I am getting this error whether I run the notebook with jupyter labs or jupyter notebook.

Code for revising explanations

Hi there,

Thanks for such great work on interpretability!

I need help finding the code for revising explanations. Is it included in the code base?

About Direction Finding

Dear authors, do you plan to open source the “Finding explainable directions” part of the code in the future? Thanks.

Installing neuron-explainer doesn't seem to work.

If I run pip install "git+https://github.com/openai/automated-interpretability.git#subdirectory=neuron-explainer" I understand that this should install the repo as part of my environment, but it only installs a select few of the required files. Is there something wrong with the setup.py?

# installed-files.txt
../neuron_explainer/__init__.py
../neuron_explainer/__pycache__/__init__.cpython-39.pyc
../neuron_explainer/__pycache__/api_client.cpython-39.pyc
../neuron_explainer/api_client.py
PKG-INFO
SOURCES.txt
dependency_links.txt
requires.txt
top_level.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.