jalammar / ecco Goto Github PK

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).

Home Page: https://ecco.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 9.80% HTML 2.28% CSS 0.19% Jupyter Notebook 87.74%

nlp visualization explorables natural-language-processing pytorch language-models

ecco's Introduction

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Ecco provides multiple interfaces to aid the explanation and intuition of Transformer-based language models. Read: Interfaces for Explaining Transformer Language Models.

Ecco runs inside Jupyter notebooks. It is built on top of pytorch and transformers.

Ecco is not concerned with training or fine-tuning models. Only exploring and understanding existing pre-trained models. The library is currently an alpha release of a research project. You're welcome to contribute to make it better!

Documentation: ecco.readthedocs.io

Features

Support for a wide variety of language models (GPT2, BERT, RoBERTA, T5, T0, and others) [notebook & instructions for adding more models].
Ability to add your own local models (if they're based on Hugging Face pytorch models).
Feature attribution (IntegratedGradients, Saliency, InputXGradient, DeepLift, DeepLiftShap, GuidedBackprop, GuidedGradCam, Deconvolution, and LRP via Captum)
Capture neuron activations in the FFNN layer in the Transformer block
Identify and visualize neuron activation patterns (via Non-negative Matrix Factorization)
Examine neuron activations via comparisons of activations spaces using SVCCA, PWCCA, and CKA (See this video on inspecting neural networks with CCA)
Visualizations for:
- Evolution of processing a token through the layers of the model (Logit lens)
- Candidate output tokens and their probabilities (at each layer in the model)

Installation

You can install ecco either with pip or with conda.

with pip

pip install ecco

with conda

conda install -c conda-forge ecco

Examples:

You can run all these examples from this [notebook] | [colab].

What is the sentiment of this film review?

Use a large language model (T5 in this case) to detect text sentiment. In addition to the sentiment, see the tokens the model broke the text into (which can help debug some edge cases).

Which words in this review lead the model to classify its sentiment as "negative"?

Feature attribution using Integrated Gradients helps you explore model decisions. In this case, switching "weakness" to "inclination" allows the model to correctly switch the prediction to positive.

Explore the world knowledge of GPT models by posing fill-in-the blank questions.

Does GPT2 know where Heathrow Airport is? Yes. It does.

What other cities/words did the model consider in addition to London?

Visualize the candidate output tokens and their probability scores.

Which input words lead it to think of London?

At which layers did the model gather confidence that London is the right answer?

The model chose London by making the highest probability token (ranking it #1) after the last layer in the model. How much did each layer contribute to increasing the ranking of London? This is a logit lens visualizations that helps explore the activity of different model layers.

What are the patterns in BERT neuron activation when it processes a piece of text?

A group of neurons in BERT tend to fire in response to commas and other punctuation. Other groups of neurons tend to fire in response to pronouns. Use this visualization to factorize neuron activity in individual FFNN layers or in the entire model.

Read the paper:

Ecco: An Open Source Library for the Explainability of Transformer Language Models Association for Computational Linguistics (ACL) System Demonstrations, 2021

Tutorials

Video: Take A Look Inside Language Models With Ecco. [Colab Notebook]

How-to Guides

API Reference

The API reference and the architecture page explain Ecco's components and how they work together.

Gallery & Examples

Predicted Tokens: View the model's prediction for the next token (with probability scores). See how the predictions evolved through the model's layers. [Notebook] [Colab]

Rankings across layers: After the model picks an output token, Look back at how each layer ranked that token. [Notebook] [Colab]

Layer Predictions:Compare the rankings of multiple tokens as candidates for a certain position in the sequence. [Notebook] [Colab]

Primary Attributions: How much did each input token contribute to producing the output token? [Notebook] [Colab]

Detailed Primary Attributions: See more precise input attributions values using the detailed view. [Notebook] [Colab]

Neuron Activation Analysis: Examine underlying patterns in neuron activations using non-negative matrix factorization. [Notebook] [Colab]

Getting Help

Having trouble?

The Discussion board might have some relevant information. If not, you can post your questions there.
Report bugs at Ecco's issue tracker

Bibtex for citations:

@inproceedings{alammar-2021-ecco,
    title = "Ecco: An Open Source Library for the Explainability of Transformer Language Models",
    author = "Alammar, J",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

ecco's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory jbdatascience bigdatasciencegroup nostalgebraist deeplearning2012 gradjitta mbrukman joserfjuniorllms georgi-petkov adbmd smitakshigupta freakeinstein newsteps8 gquaresma89 databill86 mossydidar gyanachand1 sonamtripathi helioxgroup codeaudit wsr692 ssahgal alan-ai-learner bharatr21 org-mars hikmah94 analogiks ssundaranathan tspannhw cybernetics shafiahmed any0503 donkas arnaudhillen keshabb o7s8r6 oznurakyuz wissamantoun pgfeldman hanfeijp siewlinyap stprior anupama020412 krupalmodi18 vittot chiragjn syncdoth chaitanya-analytix-labs sharejing lapnd nepp1d0 bharathbolla joaolages vanessadourado yogii786 seanahmad techthiyanes nikshrimali amitkayal hzmarrou haziqa-coder aucan personx000 kamel773 sugatoray paperwave jinglishi0206 litanlitudan mathsml hasanm08 kunato joaonadkarni thomas-chong aesmin ronaldseoh ianliyi1996 guustfranssensey maksymdel demkejon001 sufeheisenberg haohowzhan alirezabayatmk tejas20 farhanferoz aadorian cyber-raskolnikov me301 w-is-h shainaraza aseifert leilin1995 mdmotaharmahtab stjordanis patrickcnkm ntaylorox jaqujaqu eudaimoniatech fst813 huguanglong standardgalactic

ecco's Issues

Running Eccomap for Pre Trained BertForMaskedLM

Hi, I was trying to run my pretrained model for which i had used BERTForMaskedLM model class from hugging face but its giving me this error. Plese help me in resolving this error. Thanks in advance.

Saliency Map for other transformer based models

Hi @jalammar

Thanks for creating this library , I do follow your posts and explanation on transformer models and you explain i quite well and in simple way . However i was working on BERT/BART models for various NLP downstream tasks like (sentiment ,intent etc classifications ) which uses pre-trained large language models , I wish to ask you can we do saliency map visualizations on this library for custom transformer models , if so can you explain how ?

Looking forward for your reply

Thanks

Ajay sahu

import ecco ERRO

Hey,

I tried pip install ecco on google colab then import ecco

but I have the error below:

RuntimeError Traceback (most recent call last)
init.pxd in numpy.import_array()

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in <cell line: 1>()
----> 1 import ecco
2 lm = ecco.from_pretrained('distilgpt2')

5 frames
/usr/local/lib/python3.10/dist-packages/ecco/init.py in
14
15 version = '0.1.2'
---> 16 from ecco.lm import LM
17 from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel, AutoModelForSeq2SeqLM
18 from typing import Any, Dict, Optional, List

/usr/local/lib/python3.10/dist-packages/ecco/lm.py in
13 from torch.nn import functional as F
14 from ecco.attribution import compute_primary_attributions_scores
---> 15 from ecco.output import OutputSeq
16 from typing import Optional, Any, List, Tuple, Dict, Union
17 from operator import attrgetter

/usr/local/lib/python3.10/dist-packages/ecco/output.py in
9 import torch
10 from torch.nn import functional as F
---> 11 from sklearn import decomposition
12 from typing import Dict, Optional, List, Tuple, Union
13 from ecco.util import strip_tokenizer_prefix, is_partial_token

/usr/local/lib/python3.10/dist-packages/sklearn/init.py in
80 from . import _distributor_init # noqa: F401
81 from . import __check_build # noqa: F401
---> 82 from .base import clone
83 from .utils._show_versions import show_versions
84

/usr/local/lib/python3.10/dist-packages/sklearn/base.py in
15 from . import version
16 from ._config import get_config
---> 17 from .utils import _IS_32BIT
18 from .utils._tags import (
19 _DEFAULT_TAGS,

/usr/local/lib/python3.10/dist-packages/sklearn/utils/init.py in
20 from scipy.sparse import issparse
21
---> 22 from .murmurhash import murmurhash3_32
23 from .class_weight import compute_class_weight, compute_sample_weight
24 from . import _joblib

sklearn/utils/murmurhash.pyx in init sklearn.utils.murmurhash()

init.pxd in numpy.import_array()

ImportError: numpy.core.multiarray failed to import

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

minGPT support

minGPT is a minimal PyTorch re-implementation of OpenAI GPT by @karpathy: https://github.com/karpathy/minGPT

It's extremely readable and great for teaching purpose. I was wondering if you would consider extending ecco to also support it?

If I was to work on it, is it something you'd prefer to see in this repository or in a separate "ecco-mingpt" repository?

How to increase fontsize of saliency output?

Currently, the saliency output looks like the above. When I put this image in my paper, it looks very small. How can I increase the font size of the output?

bug -object has no attribute 'lm_head'-

I am testing the BioGPT model with your visualizer. After completing the configuration, I have the following bug: AttributeError: 'BioGptForCausalLM' object has no attribute 'lm_head'

Adding AraBERT

Hello,

I was wondering if you could add AraBERT as one of the models supported by Ecco. Is there any procedure where I can add it myself?

Support batched inference

Currently, generate() only works with a single input. Ecco should support batched inference.

I am currently working on this in https://github.com/jalammar/ecco/tree/batch-input-activations

This is tied to #18 and __call__ should support batching on its launch. This has an effect on the output.activations tensor, and will bump its dimensions from
(layer, neuron, position)
to become:
(batch, layer, neuron, position)

Support for Distributing Model on Multiple GPU's

I'm working with a pretty big model (GPTNeo1.3B) and the gradient attribution features cause cuda out of memory errors on the GPU's I have access to.

I know that the huggingface from_pretrained method takes in an optional parameter 'device_map', that if set to "auto" will split model params into all available GPU's. The syntax is:

from transformers import AutoModelForSeq2SeqLM
t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto")

I was wondering if there was a similar feature in ecco? If not, could I submit a PR sometime in the near future?

Request for feature addition: Visualization of Neuron Activations and Clustered Neuron Firings

Is it possible to include a method to visualize neuron activations and clustering neurons by activation values from https://jalammar.github.io/explaining-transformers/?

Some method to visualize this kind of figures:
https://jalammar.github.io/images/explaining/activations-4.PNG
https://jalammar.github.io/images/explaining/activations_1.PNG
https://jalammar.github.io/images/explaining/activations_2.PNG

It would be very interesting to incorporate those visualizations as improve the interpretability of the Transformer architecture.
Kind regards.

How to use ECCO to explain FineTuned T5 for Conditional Generation ?

I am trying to use ECCO to explain the output of FinedTuned T5 for Conditional Generation model which is fine tuned on table-to-text dataset. I am not able to load my model with the conditional generation. Can you give me an example code of how to load local transformer models?

Support for T5-like Seq2SeqLM

Hello, I was wondering if there are any plans for explicit encoder-decoder models like T5. Although T5 was not pre-trained with auto-regressive LM objective it is a pretty good candidate for ecco's generate method.
I tried running t5 as it was listed in model-config.yaml but soon ran into issues because the current implementation is very much suited to gpt like models.

I made some changes on a fork to get attribution working, but not sure if I did it correctly
https://colab.research.google.com/drive/1zahIWgOCySoQXQkAaEAORZ5DID11qpkH?usp=sharing
https://github.com/chiragjn/ecco/tree/t5_exp

I would love to contribute to add support with some help, especially on the overall implementation design

llama support

Hi,
I am trying to use llama (https://huggingface.co/docs/transformers/main/model_doc/llama) with ecco. It shows not supported yet. Could you please let me know how I can make it work?

Best Regards,
Srikanth

Merge partial tokens in token visualizations [Generation & Attribution]

When showing generation and attribution tokens, merge partial tokens so they appear as one token. Similarly, merge their attribution scores. This is especially important to support languages where characters tend to be span multiple tokens which end up looking broken.

T5 Generation Error

v0.1.0 gives the following error when attempting to generate using T5:
TypeError: ones(): argument 'size' must be tuple of ints, but found element of type Tensor at pos 1

Problem with saliency()

I'm unable to get the saliency() commands in the notebook "Language_Models_and_Ecco_PyData.ipynb" to function. Given the command "output_3.saliency()", the error message is: "'OutputSeq' object has no attribute 'saliency'". Would greatly appreciate suggestions. Poul

Adding GENRE model to ecco

Hi @jalammar , thanks a lot for this amazing package and the amazing blogs you have created.
I am using the facebook GENRE model for entity linking. I wanted to use ecco with it. How to add it to ecco? I tried changing model_config, but not sure how to do it exactly, I also followed the notebook - Identifying model configuration.ipynb, but ran into errors. Can you provide some guidance on how to add new models? Would be of great help.

Remove `tokenizer_config` usage from the library

This config parameter was made to easily package config to send to the Javascript components. Ecco now handles all tokenization on the Python side to separate the concerns between the python and JS components. Subsequently, this needs to be removed.

AttributeError: 'OutputSeq' object has no attribute 'saliency'

captum 0.5.0
torch 1.13.0+cu117

Language_Models_and_Ecco_PyData_Khobar.ipynb

text= "The countries of the European Union are:\n1. Austria\n2. Belgium\n3. Bulgaria\n4."
output_3 = lm.generate(text, generate=20, do_sample=True)
output_3.saliency()

AttributeError Traceback (most recent call last)
Cell In [13], line 1
----> 1 output_3.saliency()

AttributeError: 'OutputSeq' object has no attribute 'saliency'

[Bug/improvement] Issue with saliency maps in loop

Hey,

In Colab, I am trying to generate saliency maps in a loop to study variations of similar inputs, but when you print the maps generate in a loop, they are stuck next to each other which makes the visualisation quite confusing.
See screenshot.

Add support for more attribution methods

Hi,
Currently, the project seems to be relying on grad-norm and grad-x-input to obtain the attributions. However, there are other arguably better (as discussed in recent work) methods to obtain saliency maps. Integrating them in this project would also provide a good way to compare them on the same input examples.

Some of these methods from the top of my head are- integrated gradients, gradient shapley, and LIME. Perhaps support for visualizing the attention map from the model being interpreted itself could also be added. Methods based on feature ablation are also possible but they might need more work to integrate.

There is support for these aforementioned methods on Captum, but it takes effort to get them working for NLP tasks, especially those based on language modeling. Thus, I feel this would be a useful addition here.

Bug: Broken links in docs

In the sections "Input Saliency" and "Detailed Saliency," the links to Notebook and Colab are broken.

GPT Generation error on GPU

Generating text with GPT2 on GPU throws the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

BART Support

@jalammar et al,

Can't thank you enough for your thoughtful approach with "Illustrated Transformers" & for further uncovering Transformers (visually) with the help of "Ecco". These articles not just helping folks fundamentally understand things better but are acting as a stimulator for responsible future research. So thank you for the great work!! [ I apologize for the distraction, couldn't resist my gratitude while I had this opportunity to write to you! ]

I am primarily working on BART, T5, PEGASUS base & fine-tuned variants for my research on summarization & extreme summarization objectives for my company. I understand that ecco support generative models, perhaps the support for mentioned is not there yet.

Not sure, if these can be easily accommodated & your team is already in process of adding them?
If not, I am sure you must be overwhelmed with similar asks, I (& am sure rest) would love to participate. I have not gone through the code yet, but if there is some jump start documentation for the code, specifically on how to refactor the code for a different type of models, perhaps we can help contribute enriching this beautiful library.
While the above is looked into, would you have a suggestion of any other library which can support similar outputs for BART and Pegasus for now ??

Best,
Anshoo

Huggingface v4 support

Adding Huggingface transformers v4 support is work in progress in https://github.com/jalammar/ecco/tree/hf4-support. You can install it via:

!pip install git+https://github.com/jalammar/ecco.git@hf4-support

Token generation, and NMF on activations should work.
Input saliency could be having issues with pytorch v1.7.1. backwards() breaks for some reason.
rankings, rankings_watch, and layer_prediction need a small fix since v4 changed the dimensions of the hidden state vector. Just need account for the additional dimension. I still haven't decided whether to ignore the batch dimension, or support it and have the visualizations only show the first element in the batch (or have the user select the element number). Open to ideas.

Not support transformers 4.2.2?

I tried to import ecco and transformers at the same time but I got an error like follows:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-585e05415cf8> in <module>
      1 import os
      2 import numpy as np
----> 3 import ecco
      4 import torch
      5 from transformers import GPT2LMHeadModel

~/anaconda3/lib/python3.7/site-packages/ecco/__init__.py in <module>
      1 __version__ = '0.0.12'
----> 2 from ecco.lm import LM, MockGPT, MockGPTTokenizer
      3 from transformers import AutoTokenizer, AutoModelForCausalLM
      4 
      5 def from_pretrained(hf_model_id,

~/anaconda3/lib/python3.7/site-packages/ecco/lm.py in <module>
     11 from ecco.attribution import *
     12 from typing import Optional, Any
---> 13 from transformers.modeling_gpt2 import GPT2Model
     14 
     15 

ModuleNotFoundError: No module named 'transformers.modeling_gpt2'

But it goes well if I comment off import ecco. Will ecco support transformers 4.2.2 in the near future?

Best,

"Predicted tokens" - add visual indication for partial tokens

Mark partial tokens with an ellipses character

Add models to config

Wishlist of models to add to model-config:

DeBERTa
GPT-J
T0

Which french model is supported in ecco library

Hello,
i would like to use ecco library for my french dataset, is there any french model that works in ecco?

Thank you

Support for Constrained Beam Search?

Hola! Great project, thank you for making it available!

I am curious now that beam search is supported, is it possible to pass constraints to ecco using for example PhrasalConstraints, to constrain T5 model output for a sequence-to-sequence task prior to ecco evaluation?

Thanks in advance!

Error message of run_nmf when using a model with activations=False

If I load a model like this

lm = ecco.from_pretrained('gpt2')

without specifically setting activations=True

The error I get when trying to run NFM is not very clear, specifically, it shows

/usr/local/lib/python3.6/dist-packages/ecco/output.py in __init__(self, activations, n_input_tokens, token_ids, _path, n_components, tokens, **kwargs)
    498         from_layer = kwargs['from_layer'] if 'from_layer' in kwargs else None
    499         to_layer = kwargs['to_layer'] if 'to_layer' in kwargs else None
--> 500         if len(activations.shape) != 3:
    501             raise ValueError(f"The 'activations' parameter should have three dimensions: (layers, neurons, positions). "
    502                              f"Supplied dimensions: {activations.shape}", 'activations')

AttributeError: 'list' object has no attribute 'shape'

This seems an easy mistake that a beginner user could make and could easily be improved by checking earlier that output object has properties required for doing NFM.

Add Primary Attribution support for BERT classification

Primary attribution are currently supported for causal (GPT) models and enc-dec (T5/T0) models. It would be great to add support for MLM models. A good first implementation would target either BERT or RoBERTa. My sense is that Sequence Classification takes precedent, then token classification.

BertForSequenceClassification
RobertaForSequenceClassification

Captum's Interpreting BERT Models guide is a good place to look.

As far as how to fit it into Ecco, these are initial ideas:

We'll need a way to indicate the classification head in init.py. Likely a new head config parameter .
We may have to add a 'predict()' or classify() method to lm.py. ("predict" to match the familiar sklearn api, "classify" to be more specific as generation is also prediction)

token prefix in roberta model?

Trying to use a custom trained Roberta model by loading the config file but getting the error the token prefix is not present in the config. Any idea how to fix it?

Add ability to run models via call(), and not just using generate()

Ecco currently only runs models using the generate() method. This makes it more geared towards GPT-like models. adding support to running models via call(), and collecting activations/saliency of that run would make it much useful. It opens the door towards supporting MLMs #6.

I am currently working on this in https://github.com/jalammar/ecco/tree/batch-input-activations

lm.generate and HuggingFace's generate give different results with do_sample=False

Hi, thanks for the great work.

I'm generating text with the lm function without sampling:

output = lm.generate(text, generate=200,max_length=1024,
        eos_token_id=1, pad_token_id=0, attribution=['grad_x_input','ig'])

Then using the original HuggingFace library using the same code as in ecco (by literally copying the function from lm.py):

outputs=model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=1,
            # FIXME: +1 in max_length to account for first start token in decoder, find a better way to do this
            max_length=1024,
            do_sample=False,
            top_p=None,
            top_k=None,
            temperature=1,
            eos_token_id=1, pad_token_id=0,
            return_dict_in_generate=True,
            output_scores=True)

In both cases, we use the same seed (MKDIDTLISNNAL). But the first method produces this: WSKMLVEEDPGFFERLSQAQKPRALFITCSDSRLVPEQ. And the second produces this: WSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERL.

The two functions should give the same sequence of tokens since we are not sampling. There must be a bug in how lm.generates the tokens iteratively (we know the second sequence is the right one).

IndexError when using analysis.pwcca

OS Info

Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 5.4.0-100-generic
Architecture: x86-64

Environment Info

python 3.6
ecco==0.1.2

Replicate

import ecco.analysis as analysis
a = np.random.rand(4, 200)
b = np.random.rand(8, 200)
analysis.pwcca(a, b)  # This works
analysis.pwcca(b, a)  # It fails here

Error:

~/Documents/TOMMAS/venv/lib/python3.6/site-packages/ecco/svcca_lib/pwcca.py in compute_pwcca(acts1, acts2, epsilon)
47 else:
48 dirns = np.dot(sresults["coef_y"],
---> 49 (acts1[sresults["y_idxs"]] -
50 sresults["neuron_means2"][sresults["y_idxs"]])) + sresults["neuron_means2"][sresults["y_idxs"]]
51 coefs = sresults["cca_coef2"]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 8 but corresponding boolean dimension is 4

I believe ecco/svcca_lib/pwcca.py line 49 is supposed to be acts2 instead of acts1.

Rankings_watch displaying wrong sequence

Hello,
I have a problem with the rankings_watch() function. I used a predefined GPT2 model and gave it the input "Today, the weather is". However, in the visualization, only the first token is shown although the model creates the output correctly:

Thank you for your help :D

KeyError: 'tokenizer_config'

I am working on integrating my custom model vinai\bertweet-base with Ecco, however I ran into the following issue:

Traceback (most recent call last):
  File "experiment_ecco.py", line 44, in <module>
    nmf_1.explore()
  File "C:\Users\XXXX\anaconda3\envs\disaster_tweets\lib\site-packages\ecco\output.py", line 827, in explore
    }})"""
KeyError: 'tokenizer_config'

I created the lm in the following way:

# loading in tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, use_fast=False)
model = torch.load("bertmodel.pth")
''' 
this model is obtained from training 
AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", 
output_hidden_states=True, output_attentions=True, num_labels=2)
'''

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': ''
}

lm = LM(model=model, tokenizer=tokenizer, model_name="vinai/bertweet-base",
        config=model_config, collect_activations_flag=True, verbose=True)

tweet = "So running down the stairs was a bad idea full on collided... With the floor ??"
inputs = lm.tokenizer([tweet], return_tensors="pt")
output = lm(inputs)

nmf_1 = output.run_nmf(n_components=8)
nmf_1.explore()

upon further inspection I believe the error comes from the following line:

js = f"""
         requirejs(['basic', 'ecco'], function(basic, ecco){{
            const viz_id = basic.init()
            
            ecco.interactiveTokensAndFactorSparklines(viz_id, {data},
            {{
            'hltrCFG': {{'tokenization_config': {json.dumps(self.config['tokenizer_config'])}
                }}
            }})
         }}, function (err) {{
            console.log(err);
        }})"""

I could not traceback the origin of tokenizer_config. Therefore I assume it also has to be passed in the model_config for a custom model? If so, this needs to be specified in the docs.

Or could this issue be related in a strange way to #65

Save vectorized figure

There's a way to save the output of primary_attributes to svg (or pdf)?

Why do we need to construct `one-hot` vectors of the input_ids and then multiply by the embeddings, as opposed to applying the embedding directly?

Hey there,

Thanks for releasing this library! I was reviewing your lm.py file, and in particular, I was unclear why you were constructing one-hot vectors and multiplying by the embedding matrix, as opposed to simply applying the embedding directly.

See here:

https://github.com/jalammar/ecco/blob/main/src/ecco/lm.py#L118

my approach:

import torch
from transformers import pipeline

text = "We are very happy to show you the 🤗 Transformers library."

classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")
model = classifier.model
tokenizer = classifier.tokenizer

encoding = tokenizer.encode_plus(
	text,
	return_tensors="pt",
	add_special_tokens=True,
	return_attention_mask=True,
)

inputs_embeds = model.base_model.embeddings(encoding['input_ids'])

assert inputs_embeds.is_leaf is False
inputs_embeds.retain_grad()

logits = model(inputs_embeds=inputs_embeds, attention_mask=encoding['attention_mask']).logits.squeeze(dim=0)
score = logits[logits.argmax()]
score.backward(gradient=None, retain_graph=True)

inputs_embeds__grad = (inputs_embeds.grad * inputs_embeds)[:, 1:-1, :]  # remove CLS and SEP tokens

feature_importance = torch.norm(inputs_embeds__grad, dim=2)
feature_importance_normalized = (feature_importance / torch.sum(feature_importance)).squeeze(dim=0)

attributions = [{tokenizer.convert_ids_to_tokens(input_id.item()): feature_importance_normalized[index].item()} for index, input_id in enumerate(encoding['input_ids'].squeeze(dim=0)[1:-1])]

note how the inputs_embeds are calculated directly.
Also note the scores computed by the logits using your approach don't match mine (or the pipeline).

# my approach
inputs_embeds = model.base_model.embeddings(encoding['input_ids'])

# your approach:
embedding_dim = model.base_model.embeddings.word_embeddings.embedding_dim
num_embeddings = model.base_model.embeddings.word_embeddings.num_embeddings

input_ids__one_hot = torch.nn.functional.one_hot(encoding["input_ids"], num_classes=num_embeddings).float()
input_ids__one_hot.requires_grad_(True)

assert input_ids__one_hot.requires_grad
assert input_ids__one_hot.is_leaf # leaf node!

embedding_matrix = model.base_model.embeddings.word_embeddings.weight
inputs_embeds__yours = torch.matmul(input_ids__one_hot, embedding_matrix)

assert torch.all(inputs_embeds__yours == inputs_embeds).item() == False

this is because the embedding is a sequence with multiple functions:

# model.base_model.embeddings

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

if I instead just apply the word_embedding directly, I then recover your solution:

inputs_embeds__partial = model.base_model.embeddings.word_embeddings(encoding['input_ids'])

assert torch.all(inputs_embeds__partial == inputs_embeds__yours).item() == True

Add Beam Search Support

I'm not finding any way to do beam search using ecco, can you add the same in documentation?

Error when generate using T5

Hi,

I installed the ecco package from this repo and runner the following code:
lm = ecco.from_pretrained('t5-small', activations=True)
text= "translate English to Portuguese: I like to eat rice."
output = lm.generate(text, generate=1, do_sample=True)

However, I got the following error:

ModuleAttributeError: 'Embedding' object has no attribute 'shape'

After doing some digging I found that the problem is that you are treating the self.model_embeddings as a Tensor, however, it is a nn.Embedding (so you need to do self.model_embeddings.weight). I also found other problems:

The model expects tensors of the type (batch, seq_len, hidden_dim) and the code is passing only (seq_len, hidden_dim)
The T5 model should be load as AutoModelForSeq2SeqLM to have outputs and the decoding needs the decoder_input_ids

generate() doesn't print output tokens as model generates them

Tested on colab with GPU. Expected behaviour is that output tokens are displayed as soon as they're generated. That does not happen with GPUs on Colab.

Tokenizer has partial token suffix instead of prefix

Following your guide for identifying model configuration

MODEL_ID = "vinai/bertweet-base"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, normalization=True, use_fast=False)

ids= tokenizer('tokenization')
ids

returns:

{'input_ids': [0, 969, 6186, 6680, 2], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

Then

tokenizer.convert_ids_to_tokens(ids['input_ids'])

returns:

['<s>', 'to@@', 'ken@@', 'ization', '</s>']

Here I noticed that the tokenizer adds a partial token suffix instead of partial token prefix. Having a suffix instead of prefix is not configurable in the config.

output.saliency() displays nothing

I am trying to visualize saliency maps from a custom GPT model. Since I am concerned only about saliency maps, I just do the following:

out = OutputSeq(token_ids = input_ids, n_input_tokens = n_input_tokens, tokens = tokens, attribution = attr)
out.saliency()

I get no errors and nothing is displayed in the jupyter notebook, but when I open Chrome's Javascript console, I see the following thing.

(unknown) Ecco initialize.

  | l | @ | storage.googleapis.c…ust=1610606118793:1
-- | -- | -- | --
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | autoTextColor | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | each | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | style | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | enter | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | join | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | setupTokenBoxes | @ | storage.googleapis.c…ust=1610606118793:1
  | init | @ | storage.googleapis.c…ust=1610606118793:1
  | eval
  | execCb | @ | require.js:1693
  | check | @ | require.js:881
  | enable | @ | require.js:1173
  | init | @ | require.js:786
  | (anonymous) | @ | require.js:1457

DevTools failed to load SourceMap: Could not load content for http://localhost:8888/static/notebook/js/main.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
DevTools failed to load SourceMap: Could not load content for https://storage.googleapis.com/wandb-cdn/production/d4e2434e6/raven.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

How do I resolve this issue? Btw, I am running this notebook by sshing into my institute's remote machine.

attention head

Hi @jalammar, I tested some examples with Ecco, and I wanted to know if it is possible to change the head to view the activations for each one and for each layer?

Presence of character Ġ before each token in output

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

Your help is appreciated.

BERT/MLM support

The first versions of the library focused on GPT models. Adding support to BERT-based models shouldn't be difficult. Off the top of my head, the things that need to be changed are:

Support for the tokenizer. The javascript library needs some changes (due to the differences in indicating a partial token between the GPT/BPE and BERT/WordPiece) in displaying the tokens (and when to tag the div as partial token).
To collect activations, we'll need the code that attaches the hook that collects activations to consult a new dict that we create for this purpose. 'model name' => 'layer name' -- where 'layer name' is the layer to collect the activations from. This tends to be the input tensor to the layer after the GELU/RELU activation in the MLP.
Masked Langauge Models like BERT will not be based on generation. So the API needs to accommodate a more fitting method name than generate(). There could be a case for a separate MLM class (different from ecco.LM). But that could be overengineering for now. A good first approach would be to adjust LM, then potentially split off if it turns out to be too different.

Add a `conda` install option for `ecco`

A conda install option for ecco could be helpful for two reasons:

Easy installation with version management with conda.
For other libraries, which if depend on ecco, if you want them on conda-forge channel as well, ecco must be available on conda-forge.

💡 I have already have started work on this. PR: conda-forge/staged-recipes#17388

Once, the PR gets merged, you will be able to install ecco as:

conda install -c conda-forge ecco

I will send a PR to update your documentation, once the PR gets merged.

Help understanding position arg of layer_predictions

In the gpt2 model, I am measuring the distribution of calendar dates.

import ecco
lm = ecco.from_pretrained('gpt2')
output_0 = lm.generate("On January", generate=1, do_sample=False)

I assumed that to read predictions for the next token, I would need either position=0 or position=2 depending on whether it referred to the 0th token of the full string or the generated output. I was surprised to see these return the same tokens and probabilities:

output_0.layer_predictions(position=0, layer=11, topk=5)
output_0.layer_predictions(position=2, layer=11, topk=5)

If I query position=1 then I see 'the' and other tokens which might follow "On " in the original sentence.

output_0.layer_predictions(position=1, layer=11, topk=5)