stanfordnlp / pyvene Goto Github PK

View Code? Open in Web Editor NEW

493.0 8.0 41.0 18.93 MB

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

Home Page: http://pyvene.ai

License: Apache License 2.0

Jupyter Notebook 36.57% Python 63.43%

interpretability mechanistic-interpretability intervention activation-intervention activation-patching

pyvene's Issues

[P2] Support simple MLP layer for interventions

Description:
Currently, the library only works for transformer-based models. For non-sequence-based models, MLP models; or other sequence-based models like RNN, the library cannot work well.

The first step moving forward to support other model types could be to showcase how this library will work for MLP models. The MLP model can be hand-crafted as well so that we know the counterfactual behaviors. We expect there will be hacks here and there to get things to work, but it will allow more model types.

[Bug]: Whether to support intervene_on_prompt=False

Contact Details

No response

What happened?

A bug happened! My understanding of intervene_on_prompt=False is that it involves adding an intervention vector even when generating new tokens. However, when I tried setting intervene_on_prompt=False, I encountered a RuntimeError:
CUDA error: device-side assert triggered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging, consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Does this toolkit currently support intervene_on_prompt=False? Please correct me if I am wrong.

Code to produce this issue.

No response

[P0] `IntervenableModel.save()` doesn't save trained model parameters

Contact Details

Come to my office on Gates 3rd floor anytime 👍

What happened?

IntervenableModel.save() doesn't save trained model parameters. This is an issue when you are also training e.g. a classification head on top of our model and want to upload it to HF/save it locally. save_intervention() does this properly though.

Code to produce this issue.

No response

[P1] Add unit test for vanilla intervention at different positions and streams

Description:
Adding tests when using vanilla intervention on the GPT-2 model at different streams. This PR focuses on position-based intervention. It should cover a single position as well as multiple positions. It should also cover different layers.

[P1] Introduce grouped interventions where each group is associated with the same source input

Description:
Currently, we assume there is one source example associated with one intervention. Often, we want to reuse the same source example for multiple interventions, where these interventions are grouped for a purpose.

We want to support the concept of "grouping" during intervention configuration, as well as source inputs. With this, you can also flexibly skip some groups if you don't want to.

Changes need to be made in the common classes regarding the config, as well as how alignable class consumes the source inputs.

Bug in BoundlessRotatedSpaceIntervention

In the new version, self.embed_dim in line 262 in interventions.py is initialized as None and never assigned value. This will cause the failure of running Doundless_DAS.ipynb:

Traceback (most recent call last):
File "/work/frink/sun.jiu/function_vectors/src/compute_rotational_subspace.py", line 300, in
intervenable = IntervenableModel(intervenable_config, model)
File "/work/frink/sun.jiu/miniconda3/envs/fv/lib/python3.10/site-packages/pyvene/models/intervenable_base.py", line 111, in init
intervention = intervention_function(
File "/work/frink/sun.jiu/miniconda3/envs/fv/lib/python3.10/site-packages/pyvene/models/interventions.py", line 265, in init
torch.arange(0, self.embed_dim), requires_grad=False
TypeError: arange() received an invalid combination of arguments - got (int, NoneType), but expected one of:

(Number end, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(Number start, Number end, *, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
(Number start, Number end, Number step, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

Simply changing the line into:

self.embed_dim = embed_dim

Would solve the issue.

[P2] Refactor the AlignableConfig to take in intervention not as type but actual instance

Description:
Currently, we have,

class AlignableConfig(PretrainedConfig):
    def __init__(
        self,
        alignable_model_type="gpt2",
        alignable_representations=[
            # we do distributed search over elements in the sublist.
            AlignableRepresentationConfig()
        ],
        alignable_interventions_type=VanillaIntervention,
        alignable_low_rank_dimension=None,
        mode="parallel",
        **kwargs
    ):

We need to specify the type as a class type, not as an instance. This causes trouble. It is better to get alignable interventions as a list of actual instances, e.g., alignable_interventions = [VanillaIntervention()]. This will allow us to have more specifications for customizable interventions.

Error when running run_alignment.py: has no attribute when trying to save rotation matrix.

Hi, I am working with your newest version of the repo and got the tutorial.ipynb to work. However, when I run run_alignment.py with the training script at the end at your README.md, I run into the following error as the model tries to save checkpoints of the rotation layer:

Traceback (most recent call last):
  File "/net/scratch/zhouy1/github/align-transformers-forked/run_alignment.py", line 183, in <module>
    aligner.train(
  File "/net/scratch/zhouy1/github/align-transformers-forked/trainer.py", line 222, in train
    self.save_model(output_dir, 'pytorch-rotate-best.bin')
  File "/net/scratch/zhouy1/github/align-transformers-forked/trainer.py", line 62, in save_model
    'rotate_layer': self.model.module.model.rotate_layer.state_dict(),
  File "/home/zhouy1/miniconda3/envs/BoundlessDAS/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AlignableLlamaForCausalLM' object has no attribute 'module'

Thanks!

[P1] Streamlining trainable intervention artifacts saving and sharing

Description:
After training, the intervention's artifacts are saved in memory without a good way of saving to disk with other metadata or sharing on huggingface marketplace. This will be a change to provide a smooth way of saving/sharing interventions trained by users.

The key thing will be serializing metadata into a shareable format (i.e., serializing and deserializing need both be tested). It will still require sharing parties to know the counterfactual dataset generation, but it is less of a problem of this library and more about sharing the dataset itself. And dataset sharing could be a separate process not included in this library.

This change should also consider sharing interventions that contain a vector store (some truthful direction for sharing, etc..).

Testing Done:

Local Test Log:

.Removing testing dir ./test_output_dir_prefix-d9080f
Removing testing dir ./test_output_dir_prefix-dff621
Removing testing dir ./test_output_dir_prefix-9227e2
Removing testing dir ./test_output_dir_prefix-6cb8c4
Removing testing dir ./test_output_dir_prefix-67cd73

----------------------------------------------------------------------
Ran 25 tests in 4.280s

OK

New Tutorial Added tutorials/basic_tutorials/Load_Save_and_Share_Interventions.ipynb.

[P1] Optionally remove the dependency of the config file

Descriptions:
Currently, all the models need to have a config, and the config needs to be inheriting transformer library's config. The model config is only used in,

                intervention = intervention_function(
                    embed_dim=get_dimension(
                        get_internal_model_type(model), model.config, representation
                    ), **other_medata 
                )

to get the components dimension.

We should however allow config-less models where the dimension is directly read-off from the config dict, or dynamically figure out using some helper functions.

Case 1:
Hard code number

mlp_type_to_dimension_mapping = {
    "block_input": (32,),
    "block_output": (32,),
    "mlp_activation": (32,),
}

Case 2:
pyvene dynamically figures out the input and output dimensions of all the modules. (checkout torch.compile)

Meanwhile, the intervening component can accept arbitrary model component string e.g. model.h[2].attn.c_proj.output, we can dynamically figure the component out.

[P1] Dynamic Intervention Scheduler

Descriptions:
Currently, we only support basic interventions during model generation just like the model forward call. This is not ideal. In model generation, we want to support more free-formed interventions (e.g., intervene based on decoding steps or other decoding parameters, not just unit location as if it is in an intervened forward mode).

The current infra (also this applies to other existing intervention library as well) cannot support this. For instance, it does not support a specific decoding step intervention during decoding without more incisive code change. To support complex cases, we plan to introduce a new notion of Intervention Scheduler.

In the high-level, the scheduler is responsible to schedule interventions dynamically at inference time, and it is customizable. For instance, we can (1) intervene on all decoded punctuation tokens, or (2) all verbs that get decoded, or (3) all the last entity token that gets decoded in a specific entity set.

This enables us to a wide spectrum of ways to steer model behavior with interventions. This ticket may require multiple changes.

[Bug]: Is this repo support MPT architecture ? i got error

Contact Details

No response

What happened?

I am running model Instruction finetuning with MPT architecture but i got this error as image. when i changed to GPT2. don't have any problem. Is this repo support MPT architecture ?

Please suggest me.

Code to produce this issue.

No response

[P0] Adding back GPT2 and other model supports

Descriptions:

Recent PR accidentally removes the supports of a couple of models. This PR adds them back. See issue raised: bf09440#diff-e5c418b4f46cb08393fe2f1c9698b2d20d5e51e746b6d3f092f05d4f9744dc98

[P1] Speed up training of multiple DAS interventions with caching

Commonly, we want to exhaustively train DAS on every layer and position (or e.g. every attention head in a layer) to find which ones are causally relevant for the model's computations. When dealing with a fixed dataset, we could speed this process up by caching and reusing activations. Unclear what the best way to implement this is; should already be possible to have a minimal example with CollectIntervention and the activations_sources= arg in inference.

[Bug]: pyvene.ai is taking too long to respond

Contact Details

[email protected]

What happened?

Receiving error message when visiting "Pyvene.ai" website. The message says the site can't be reached at the moment. It's taking too long to respond.

Code to produce this issue.

Visit https://pyvene.ai/ in Browser

Chrome : Version 123.0.6312.58 (Official Build) (64-bit)
OS : Win10 Pro for Workstations

[P2] Quantized model support

Descriptions:
The library supports interventions on the torch model. Interventions can be attached to any subcomponents in the torch model. However, for a quantized model where the model is quantized with another wrapper, the intervention location can be more dynamic. See the quantized function here.

We want to support interventions on quantized models and intervene on correct components specified in the configuration.

[P2] Mac chip MPS mode support

Descriptions:
Currently, matrix exponential will fail with M1 chip MPS framework for rotation-based interventions. Need to figure out other ways to handle this so that M1 chip can run pyvene with trainable interventions like DAS as well.

[External] Save Interventions

Hi,

I am recently exploring the repo and using Boundless DAS, is there a way to save and load the interventions? Thanks!

[P2] Support argument name based intervention

Description:
When using the hook, we can now support kwargs-based inputs by reading the input as a dictionary. However, we will always assume the dictionary only contains a single input (e.g., hidden representations). This assumption can easily go wrong. What should we do instead is to specify which part of the inputs we do interventions on in the config for the model.

Note that this will still result in coupled code with the Transformers library. Multiple PRs are required to move towards this direction.

[P1] Unit test with hand-crafted models

Description:
Currently, we don't have a systematic way of unit testing the library. A good way is just to create some hand-crafted models (i.e., with fixed weights), do interventions, and check counterfactual behaviors.

Probably not hand-crafted transformers, but just simple MLPs with 3 hidden states, for instance.

[P2] Support interchange intervention training by unfreezing model weights with vanilla intervention

Description:
Currently, we assume model weights are frozen when training intervention for alignments. We can also add support to this library so that models can be tuned with the intervention.

This can help reproduce interchange intervention training experiments in this paper. Or it can be used to reproduce experiments in the causal proxy model (i.e., using another explainer model to explain a Blackbox model)

[P2] Refactor utils in models to smaller files

Description:
Now everything is in the util file, including common model config, import as well as different helpers on hooks, etc.. It is better to separate them out into smaller files to increase readability and extensibility.

LlamaForCausalLM (case sensitive issue?)

https://github.com/frankaging/align-transformers/blob/6f2b7d7ad8203d263cd20b6da7a6be6171d1e690/models/modelings_alignable.py#LL11C2-L11C2

Could you maybe add "LlamaForCausalLM" here? (It's not captured by "LLaMAForCausalLM")

My alpaca weights recovered as guided in https://github.com/tatsu-lab/stanford_alpaca has the following config for some reason:

LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.29.2",
  "use_cache": true,
  "vocab_size": 32001
}

BTW, thanks for this awesome project!

[P0] Library Class Renaming Effort

Descriptions:
We will move away from the concept of "align" soon in light of future releases of this library. We thus change "alignable" to "intervenable" for all the occurrences.

Testing Done:

Unit Tests Log:

.Removing testing dir ./test_output_dir_prefix-185ec0
Removing testing dir ./test_output_dir_prefix-f2e887
Removing testing dir ./test_output_dir_prefix-c7b84c
Removing testing dir ./test_output_dir_prefix-9e7acd
Removing testing dir ./test_output_dir_prefix-762318

----------------------------------------------------------------------
Ran 25 tests in 8.546s

OK

String Match Find

zhengxuanwu@DNa8211a3 align-transformers % grep -Ri "alignable" .
Binary file ./.git/objects/pack/pack-96bfa18d843e36c6082333ae210e2c8c92ce2bbd.pack matches

Making sure all the tutorials pass (currently doing)

[P1] Tutorial of Inference-time Intervention

Descriptions:

Interventions on activations at inference to steer model behaviors are good applications of this library. It fits the ultimate goal of this library well. Ideally, people should be able to share their steering mounting point along with injecting vectors with others easily.

Original GitHub:
https://github.com/likenneth/honest_llama

[P1] Support more huggingface (transformer-based) models

Descriptions:
Ideally, all the models listed here can be supported by this library without exposing the model details to the users of this library.

This requires we set up model folders for all model types and write config metadata for each of them annotating where to do interventions. This requires a lot of effort. This is a PR tracking the process towards the goal of supporting as many as we can.

Each model should take less than an hour to (1) configure and (2) write simple unit tests.

Here is the list of models that are in the pipeline to support (in order):

BERT-family
- RoBERTa
- DeBERTa
- ELECTRA
xlm (multilingual model)
t5
Mistral
Mixtral (MoE, MixtralForCausalLM)
Phi
Mamba (but need to support recurrent interventions, not just layerwise interventions)
backpack-gpt2
please feel free to suggest other new models to support!

[P0] Make interface compatible with HF trainer

Suggestion / Feature Request

Need to make sure HF trainer can easily hook onto pv.IntervenableModel. This means getting the right parameters for the optimiser, making sure model state is set correctly, etc.

[Bug Fix] `MLPConfig` contains conflict field names with HF that cannot be set to other values when initializing

Descriptions:

MLPConfig(num_labels=3).num_labels

prints,

the number of labels has not changed. This is probably because num_labels is a default field in huggingface pretrained model config thus overwriting users' input. We need to change this field to a different name to avoid that.

[P1] Supporting Mistral with Unit Tests (#46)

Descriptions:
Mistral model shares architectures with other models, e.g., gpt-2 and also Mixtral. Supporting Mistral along with unit tests will help us to support models in this family.

[P1] Unit tests for main class objects

Descriptions:
Currently, only basic integration tests around modules are included. It would be best if individual functions could be tested in a single unit test file separating from the integration test.

Testing Done:

Local Test Log

.Removing testing dir ./test_output_dir_prefix-d9080f
Removing testing dir ./test_output_dir_prefix-dff621
Removing testing dir ./test_output_dir_prefix-9227e2
Removing testing dir ./test_output_dir_prefix-6cb8c4
Removing testing dir ./test_output_dir_prefix-67cd73

----------------------------------------------------------------------
Ran 25 tests in 4.280s

OK

[P1] Adding tests for interventions and their util functions

Descriptions:
Specifically, focus on

testing the _do_intervention_by_swap util
all other interventions that do not rely on the util
test them as a regular torch module with some extra initialization steps

[P1] Supports simple use case of intervention (fixed location intervention for all examples)

Descriptions:
The library is built to support flexible intervention schemes. It comes at the cost of easiness to onboard. To make the library easier for simpler use cases, we want to support more fields in the intervenable config for "more static" intervention schemes.

For instance, for all examples, we want to intervene with the 4th token in the input at 6th layer of the transformer block output. With this, we don't need to provide intervention locations at runtime. We can simply provide a pair of base and source examples.

[P2] Intervening through time recurrently without teacher-forcing

Descriptions:
Currently, we support a limited intervention use case on stateful models such as GRU. For instance, after the intervention, although the causal effect of the intervening site would ripple through time, we assume the inputs to be the same as before intervention. This is fine if the task setup doesn't care about the inputs, or it is simply input agnostic when generating, or allows teacher forcing (forcing to discard) when generating.

Here are some illustrations. Right now, we can support cross-time interventions as,

example 1:
(hiddens)    h1, h2, h3, h4, h5, h6
(inputs)       x1, x2,  x3, x4, x5, x6
                    ^
                     |
                     ---
                       |
                       v
example 2:
(hiddens)    h1', h2', h3', h4', h5', h6'
(inputs)       x1', x2',  x3', x4', x5',  x6'

where we take h3' from the second example to intervene in h2 from the first example through time. We then also update h3 to h6 after the intervention. However, we assume x3 to x6 still use the inputs from the example 1. This is acceptable, if during training, x3 to x6 are agnostic in terms of the model's generation (e.g., x2 is some trigger token so the model is in generation mode).

However, this is not ideal. Ideally, if we are dealing with autoregressive LMs, we want x3 to be the intervened model output at the previous step. This requires the model to pass gradients through time. One simple solution is to update the model to do Gumbel-Softmax to softly select the token and pass it to the next time step as the input embedding.

The change may be only on the modeling side. We need to change the model to do soft token selection which allows gradients. However, this is compatible with the library since only in intervention mode, does this input-based unrolling make sense.

[P2] Supports list of subspace indices instead of chunk partition

Description:
Currently, we are enabling interventions to take in a new argument, subspaces so that interventions can only intervene on selected subspaces.

Right now, the subspace is partitioned into different chunks during configurations. This makes interventions that work with discrete neuron level subspace hard.

We need to change the intervention to accept subspaces as a list of dimensions e.g., [[1,3,5]] meaning we are intervening on 1st, 3rd, and 5th neurons.

[P0] Upgrade tutorials to new API

A lot of the old tutorials are using the old API for representation configs and interventions; some of them might even be broken because of this. We should upgrade all of them to the new format to showcase how the library is designed to be used. Thanks to @smejak for pointing this out.

Edit: Specifically, intro DAS tutorial isn't working in colab, intervened forward pass is failing. Will debug this later today.

[P1] Adding tests for functions in `modeling_utils.py`

Descriptions:

Specifically, focus only on these 3 functions

gather_neurons
scatter_neurons
output_to_subcomponent

[External] Counterfactual sampling in price-tagging tutorial utils code

Hello,

I was wondering what exactly the counterfactual sampling procedure in lower_bound_alignment_example_sampler does. Do the base and counterfactual labels have to be different, or can they be the same? For example, for a counterfactual label like "No", do we only want to sample base and source amounts such that the base label "Yes" is changed to "No" after intervention, or can the base label also be "No"? The code seems to suggest the latter scenario.

In that case, I put in in-line comments for what seems like a potential bug. When base_source_regions is [2, 3], The base left and right boundary values are (Yes, Yes), and that of the source is (Yes, No). The base label is "Yes", but after intervening on the left boundary, it is still "Yes". Any clarification is much appreciated!

[P2] Probe training with interventions

Description:
The training for interventions is done through backprop from the model loss. There are other use cases such as intervention itself could be supervised by other objectives such as probes attached to the intervention site.

We can add a basic classification head on top of the intervention object so that additional gradients can be backprop through the probe to the interventions. This could allow this library to support basic probing experiments with desired class labels.

[P1] Initial support of stateful model architectures such as RNN

Description:
This PR only includes minimum changes to the repo to make the library extensible to all types of RNN models. The key difference is that the hook function has to be stateful, i.e., it needs to be aware of its "step" when it is called. This ties back to the fact that RNN-based models or transformer model when generating sequences are acting like stateful models.

This change will ideally include a simple tutorial for how to do intervention with a very simple RNN at any step. The input fields should stay the same, but the hook needs to do bookkeeping of extra fields in the memory to do stateful interventions.

[P2] Add a new huggingface collator for working with Pyvene models

Suggestion / Feature Request

Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.

Hugging face transformers only pad the "input_ids" entries in the dataset

In addition to above, DataCollatorForSeq2Seq only pads "labels".

So dataset entries like "source_input_ids" are not padded, a problematic issue.

Adding a utility supporting this may help pyvene develop in general.

[P0] enable `use_fast` option in the alignable to hyper boost training speed in case intervention locations (for position+subspace) are fixed in a batch

Description:
Currently, the library aims for flexibility in the inputs as well as a small training batch size in case the intervention is trainable. For instance, we assume each example in the batch can have different intervention locations as well as different intervention subspaces allowing more flexible configurations.

This is not desired when we have a large batch size, and intervention location does not change within a batch. Suppose we want to localize (a+b) with a simple NN that solves (a+b)*c, and we want to localize (a+b) with DAS and a fixed dimensionality of 16, the intervention location stays the same. However, current code will actually do the intervention in the example-level, not in the batch level. See,

for batch_i, locations in enumerate(unit_locations):
    tensor_input[
        batch_i, locations, start_index:end_index
    ] = replacing_tensor_input[batch_i]

this can be,

tensor_input[
    :, locations, start_index:end_index
] = replacing_tensor_input[:]

subspace intervention,

    if subspaces is not None:
        for example_i in range(len(subspaces)):
            # render subspace as column indices
            sel_subspace_indices = []
            for subspace in subspaces[example_i]:
                sel_subspace_indices.extend(
                    [
                        i for i in range(
                            subspace_partition[subspace][0], 
                            subspace_partition[subspace][1]
                        )
                    ])
            if mode == "interchange":
                base[example_i, ..., sel_subspace_indices] = \
                    source[example_i, ..., sel_subspace_indices]
            elif mode == "add":
                base[example_i, ..., sel_subspace_indices] += \
                    source[example_i, ..., sel_subspace_indices]
            elif mode == "subtract":
                base[example_i, ..., sel_subspace_indices] -= \
                    source[example_i, ..., sel_subspace_indices]

can be,

if subspaces is not None:
    if subspace_partition is None:
        sel_subspace_indices = subspaces[0]
    else:
        sel_subspace_indices = []
        for subspace in subspaces[0]:
            sel_subspace_indices.extend(
                [
                    i for i in range(
                        subspace_partition[subspace][0], 
                        subspace_partition[subspace][1]
                    )
                ])
    if mode == "interchange":
        base[..., sel_subspace_indices] = \
            source[..., sel_subspace_indices]
    elif mode == "add":
        base[..., sel_subspace_indices] += \
            source[..., sel_subspace_indices]
    elif mode == "subtract":
        base[..., sel_subspace_indices] -= \
            source[..., sel_subspace_indices]
else:
    base[..., :interchange_dim] = source[..., :interchange_dim]

We should enable a flag as use_fact in the alignable config, and do a validation check that fails fast during the forward call.

This PR tracks the use_fast effort for position-based intervention as well as subspace-based intervention. It does not cover head-based or head+position-based yet. Will cover the latter one in a separate PR.

Testing Done:

writing additional integration tests (4)
log:

In case multiple location tags are passed only the first one will be considered
testing stream: value_output with a single position
WARNING:root:Detected use_fast=True means the intervention location will be static within a batch.

In case multiple location tags are passed only the first one will be considered
.
----------------------------------------------------------------------
Ran 18 tests in 30.117s

OK

[P1] Adding support to intervene selected positions of selected heads

Description:
There is a clear use-case of intervening in certain positions of selected heads (e.g., intervening x-th token in head y). For instance, we can see how different heads at each position handle the information. We can see where there is an induction head on top of certain positions.

We probably need multiple changes to achieve the goal without bugs. This ticket marks the first step of it. We will support basic nested intervention locations. Specifically, we want to have the following capability,

        _, counterfactual_outputs = alignable(
            base,
            sources,
            {"sources->base": ([
                [[[target_head]], [[pos_i]]] # intervene w/ target_head's pos_i
            ], [
                [[[target_head]], [[pos_i]]] # intervene on target_head's pos_i
            ])}
        )

where target_head is a list of specific heads, and we want to intervene on a list of pos_i for each head in the target_head. With this, we can intervene on 3rd token representation from 4th head.

[P2] Enable `use_fast` for head related interventions

Descriptions:
The first PR to enable use_fast (https://github.com/frankaging/align-transformers/issues/33) does not cover head-related interventions. We want to enable this for head index + position index as well when used compositionally.

The following code needs to be changed,
https://github.com/frankaging/align-transformers/blob/main/models/modeling_utils.py#L464

        if "head" in alignable_representation_type:
            start_index = 0 if start_index is None else start_index
            end_index = 0 if end_index is None else end_index
            # head-based scattering
            if alignable_unit in {"h.pos"}:
                # we assume unit_locations is a tuple
                for head_batch_i, head_locations in enumerate(unit_locations[0]):
                    for head_loc_i, head_loc in enumerate(head_locations):
                        for pos_loc_i, pos_loc in enumerate(unit_locations[1][head_batch_i]):
                            h_start_index = start_index+head_loc*attn_head_size
                            h_end_index = start_index+(head_loc+1)*attn_head_size
                            tensor_input[
                                head_batch_i, pos_loc, h_start_index:h_end_index
                            ] = replacing_tensor_input[head_batch_i, head_loc_i, pos_loc_i] # [dh]
            else:
                for batch_i, locations in enumerate(unit_locations):
                    for loc_i, loc in enumerate(locations):
                        h_start_index = start_index+loc*attn_head_size
                        h_end_index = start_index+(loc+1)*attn_head_size
                        tensor_input[
                            batch_i, :, h_start_index:h_end_index
                        ] = replacing_tensor_input[batch_i, loc_i] # [s, dh]

[P2] Multi-GPU model sharding with intervening evaluation and training

Descriptions:

The library is not tested with multi-GPU use cases. We assume the intervening model can be loaded into a single GPU. This is not ideal for interventions on 70B models, for instance. We want to be able to load the model into multiple GPUs using sharding.

Static interventions need to be attached to the right component on the right machine in case of model sharing. Training interventions need to be mapped onto the right machine where the corresponding model component lives as well.

This could be a large task. The first step is clear: try out static interventions (e.g., vanilla interventions) when models are loaded into multiple GPUs during inference time.

[P0] Adding more intervention support in generate mode

Description:
Currently, in the generate mode, the code only works if you intervene on the first prompt token, and every decoding token, given the activation caching mechanism of the huggingface library.

We want to provide more generic support: 1) intervene on different tokens in the prompt; 2) every decoding step.

[P2] Having a simple tutorial on how to use intervenable model inside another torch module, or pipeline (e.g., HF pipeline)

Descriptions:
The IntervenableModel is a torch.nn.Module. So, this can be used inside another torch model, or even pipeline object (e.g., Huggingface pipeline). Here is a quick code snippet,

import pyvene
from pyvene import IntervenableRepresentationConfig, IntervenableConfig, IntervenableModel

# provided wrapper for huggingface gpt2 model
_, tokenizer, gpt2 = pyvene.create_gpt2()

# turn gpt2 into intervenable_gpt2
intervenable_gpt2 = IntervenableModel(
    intervenable_config = IntervenableConfig(
        intervenable_representations=[
            IntervenableRepresentationConfig(
                0,            # intervening layer 0
                "mlp_output", # intervening mlp output
                "pos",        # intervening based on positional indices of tokens
                1             # maximally intervening one token
            ),
        ],
    ), 
    model = gpt2
)

import torch
import torch.nn as nn
from typing import List, Optional, Tuple, Union, Dict

class ModelWithIntervenables(nn.Module):
    def __init__(self):
        super(ModelWithIntervenables, self).__init__()
        self.intervenable_gpt2 = intervenable_gpt2
        self.relu = nn.ReLU()
        self.fc = nn.Linear(768, 1)
        # Your other downstream components go here

    def forward(
        self, 
        base,
        sources: Optional[List] = None,
        unit_locations: Optional[Dict] = None,
        activations_sources: Optional[Dict] = None,
        subspaces: Optional[List] = None,
    ):
        _, counterfactual_x = self.intervenable_gpt2(
            base,
            sources,
            unit_locations,
            activations_sources,
            subspaces
        )
        counterfactual_x = counterfactual_x.last_hidden_state
        
        counterfactual_x = self.relu(counterfactual_x)
        counterfactual_x = self.fc(counterfactual_x)
        return counterfactual_x

and then you can run forward as usual,

model = ModelWithIntervenables()

base = tokenizer("The capital of Spain is", return_tensors="pt")
sources = [
    tokenizer("The capital of Italy is", return_tensors="pt"),
]

model(
    base, sources, {"sources->base": ([[[4]]], [[[4]]])}
)

which returns,

tensor([[[2.7027],
         [6.3036],
         [6.1785],
         [6.4302],
         [8.0921]]], grad_fn=<ViewBackward0>)

[P2] Sparse autoencoders

We should add support for training sparse autoencoders (Bricken et al., 2023, Cunningham et al., 2023). Cool be cool as a way of obtaining a feature basis for interventions.

[P1] Support multi-variable alignment with a single trainable basis intervention

Description:
One limitation with the current codebase is that, for DAS or anything similar, we only support aligning one causal variable at a time. This breaks the assumption that we are using DAS to find a new basis from which we can interpret orthogonal causal variables among different axes in that learned basis! Learning separate basis for different causal variables break this core assumption.

We need to have a new type of intervention that supports multi-variable alignments. And this is more like a special need for DAS, or other basis-respect interventions.

Impact files may include intervention, and alignable config as well as the input fields needed to show causal variable index.

[P1] Support base-only inputs with activation collection intervention

Descriptions:
Currently, if only base is provided, it will just run the regular model forward without any interventions. We want to support the use case of collecting activations if only base is provided as well.

The activation collection is done using a no-op intervention. See: https://github.com/frankaging/align-transformers/issues/26

[P1] Tuned lens support

Suggestion / Feature Request

https://github.com/AlignmentResearch/tuned-lens

stanfordnlp / pyvene Goto Github PK

pyvene's Issues

Contact Details

What happened?

Code to produce this issue.

Contact Details

What happened?

Code to produce this issue.

Contact Details

What happened?

Code to produce this issue.

Contact Details

What happened?

Code to produce this issue.

Suggestion / Feature Request

Suggestion / Feature Request

Suggestion / Feature Request

Recommend Projects

Recommend Topics

Recommend Org