Git Product home page Git Product logo

jalammar / ecco Goto Github PK

View Code? Open in Web Editor NEW
1.9K 1.9K 159.0 4.15 MB

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).

Home Page: https://ecco.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 9.78% HTML 2.28% CSS 0.19% Jupyter Notebook 87.75%
explorables language-models natural-language-processing nlp pytorch visualization

ecco's People

Contributors

biechi avatar chiragjn avatar haziqa-coder avatar jalammar avatar joaolages avatar nostalgebraist avatar ssamdav avatar sugatoray avatar thomas-chong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ecco's Issues

Bug: Broken links in docs

In the sections "Input Saliency" and "Detailed Saliency," the links to Notebook and Colab are broken.

BART Support

@jalammar et al,

Can't thank you enough for your thoughtful approach with "Illustrated Transformers" & for further uncovering Transformers (visually) with the help of "Ecco". These articles not just helping folks fundamentally understand things better but are acting as a stimulator for responsible future research. So thank you for the great work!! [ I apologize for the distraction, couldn't resist my gratitude while I had this opportunity to write to you! ]

I am primarily working on BART, T5, PEGASUS base & fine-tuned variants for my research on summarization & extreme summarization objectives for my company. I understand that ecco support generative models, perhaps the support for mentioned is not there yet.

  1. Not sure, if these can be easily accommodated & your team is already in process of adding them?
  2. If not, I am sure you must be overwhelmed with similar asks, I (& am sure rest) would love to participate. I have not gone through the code yet, but if there is some jump start documentation for the code, specifically on how to refactor the code for a different type of models, perhaps we can help contribute enriching this beautiful library.
  3. While the above is looked into, would you have a suggestion of any other library which can support similar outputs for BART and Pegasus for now ??

Best,
Anshoo

Presence of character Ġ before each token in output

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

image

Your help is appreciated.

Add Primary Attribution support for BERT classification

Primary attribution are currently supported for causal (GPT) models and enc-dec (T5/T0) models. It would be great to add support for MLM models. A good first implementation would target either BERT or RoBERTa. My sense is that Sequence Classification takes precedent, then token classification.

BertForSequenceClassification
RobertaForSequenceClassification

Captum's Interpreting BERT Models guide is a good place to look.

As far as how to fit it into Ecco, these are initial ideas:

  • We'll need a way to indicate the classification head in init.py. Likely a new head config parameter .
  • We may have to add a 'predict()' or classify() method to lm.py. ("predict" to match the familiar sklearn api, "classify" to be more specific as generation is also prediction)

BERT/MLM support

The first versions of the library focused on GPT models. Adding support to BERT-based models shouldn't be difficult. Off the top of my head, the things that need to be changed are:

  • Support for the tokenizer. The javascript library needs some changes (due to the differences in indicating a partial token between the GPT/BPE and BERT/WordPiece) in displaying the tokens (and when to tag the div as partial token).
  • To collect activations, we'll need the code that attaches the hook that collects activations to consult a new dict that we create for this purpose. 'model name' => 'layer name' -- where 'layer name' is the layer to collect the activations from. This tends to be the input tensor to the layer after the GELU/RELU activation in the MLP.
  • Masked Langauge Models like BERT will not be based on generation. So the API needs to accommodate a more fitting method name than generate(). There could be a case for a separate MLM class (different from ecco.LM). But that could be overengineering for now. A good first approach would be to adjust LM, then potentially split off if it turns out to be too different.

How to use ECCO to explain FineTuned T5 for Conditional Generation ?

I am trying to use ECCO to explain the output of FinedTuned T5 for Conditional Generation model which is fine tuned on table-to-text dataset. I am not able to load my model with the conditional generation. Can you give me an example code of how to load local transformer models?

Request for feature addition: Visualization of Neuron Activations and Clustered Neuron Firings

Is it possible to include a method to visualize neuron activations and clustering neurons by activation values from https://jalammar.github.io/explaining-transformers/?

Some method to visualize this kind of figures:
https://jalammar.github.io/images/explaining/activations-4.PNG
https://jalammar.github.io/images/explaining/activations_1.PNG
https://jalammar.github.io/images/explaining/activations_2.PNG

It would be very interesting to incorporate those visualizations as improve the interpretability of the Transformer architecture.
Kind regards.

Support batched inference

Currently, generate() only works with a single input. Ecco should support batched inference.

I am currently working on this in https://github.com/jalammar/ecco/tree/batch-input-activations

This is tied to #18 and __call__ should support batching on its launch. This has an effect on the output.activations tensor, and will bump its dimensions from
(layer, neuron, position)
to become:
(batch, layer, neuron, position)

Support for T5-like Seq2SeqLM

Hello, I was wondering if there are any plans for explicit encoder-decoder models like T5. Although T5 was not pre-trained with auto-regressive LM objective it is a pretty good candidate for ecco's generate method.
I tried running t5 as it was listed in model-config.yaml but soon ran into issues because the current implementation is very much suited to gpt like models.

I made some changes on a fork to get attribution working, but not sure if I did it correctly
https://colab.research.google.com/drive/1zahIWgOCySoQXQkAaEAORZ5DID11qpkH?usp=sharing
https://github.com/chiragjn/ecco/tree/t5_exp

I would love to contribute to add support with some help, especially on the overall implementation design

lm.generate and HuggingFace's generate give different results with do_sample=False

Hi, thanks for the great work.

I'm generating text with the lm function without sampling:

output = lm.generate(text, generate=200,max_length=1024,
        eos_token_id=1, pad_token_id=0, attribution=['grad_x_input','ig'])

Then using the original HuggingFace library using the same code as in ecco (by literally copying the function from lm.py):

outputs=model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=1,
            # FIXME: +1 in max_length to account for first start token in decoder, find a better way to do this
            max_length=1024,
            do_sample=False,
            top_p=None,
            top_k=None,
            temperature=1,
            eos_token_id=1, pad_token_id=0,
            return_dict_in_generate=True,
            output_scores=True)

In both cases, we use the same seed (MKDIDTLISNNAL). But the first method produces this: WSKMLVEEDPGFFERLSQAQKPRALFITCSDSRLVPEQ. And the second produces this: WSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERL.

The two functions should give the same sequence of tokens since we are not sampling. There must be a bug in how lm.generates the tokens iteratively (we know the second sequence is the right one).

Why do we need to construct `one-hot` vectors of the input_ids and then multiply by the embeddings, as opposed to applying the embedding directly?

Hey there,

Thanks for releasing this library! I was reviewing your lm.py file, and in particular, I was unclear why you were constructing one-hot vectors and multiplying by the embedding matrix, as opposed to simply applying the embedding directly.

See here:

https://github.com/jalammar/ecco/blob/main/src/ecco/lm.py#L118

my approach:

import torch
from transformers import pipeline

text = "We are very happy to show you the 🤗 Transformers library."

classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")
model = classifier.model
tokenizer = classifier.tokenizer

encoding = tokenizer.encode_plus(
	text,
	return_tensors="pt",
	add_special_tokens=True,
	return_attention_mask=True,
)

inputs_embeds = model.base_model.embeddings(encoding['input_ids'])

assert inputs_embeds.is_leaf is False
inputs_embeds.retain_grad()

logits = model(inputs_embeds=inputs_embeds, attention_mask=encoding['attention_mask']).logits.squeeze(dim=0)
score = logits[logits.argmax()]
score.backward(gradient=None, retain_graph=True)

inputs_embeds__grad = (inputs_embeds.grad * inputs_embeds)[:, 1:-1, :]  # remove CLS and SEP tokens

feature_importance = torch.norm(inputs_embeds__grad, dim=2)
feature_importance_normalized = (feature_importance / torch.sum(feature_importance)).squeeze(dim=0)

attributions = [{tokenizer.convert_ids_to_tokens(input_id.item()): feature_importance_normalized[index].item()} for index, input_id in enumerate(encoding['input_ids'].squeeze(dim=0)[1:-1])]
  1. note how the inputs_embeds are calculated directly.
  2. Also note the scores computed by the logits using your approach don't match mine (or the pipeline).
# my approach
inputs_embeds = model.base_model.embeddings(encoding['input_ids'])

# your approach:
embedding_dim = model.base_model.embeddings.word_embeddings.embedding_dim
num_embeddings = model.base_model.embeddings.word_embeddings.num_embeddings

input_ids__one_hot = torch.nn.functional.one_hot(encoding["input_ids"], num_classes=num_embeddings).float()
input_ids__one_hot.requires_grad_(True)

assert input_ids__one_hot.requires_grad
assert input_ids__one_hot.is_leaf # leaf node!

embedding_matrix = model.base_model.embeddings.word_embeddings.weight
inputs_embeds__yours = torch.matmul(input_ids__one_hot, embedding_matrix)

assert torch.all(inputs_embeds__yours == inputs_embeds).item() == False

this is because the embedding is a sequence with multiple functions:

# model.base_model.embeddings

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

if I instead just apply the word_embedding directly, I then recover your solution:

inputs_embeds__partial = model.base_model.embeddings.word_embeddings(encoding['input_ids'])

assert torch.all(inputs_embeds__partial == inputs_embeds__yours).item() == True

Adding AraBERT

Hello,

I was wondering if you could add AraBERT as one of the models supported by Ecco. Is there any procedure where I can add it myself?

Problem with saliency()

I'm unable to get the saliency() commands in the notebook "Language_Models_and_Ecco_PyData.ipynb" to function. Given the command "output_3.saliency()", the error message is: "'OutputSeq' object has no attribute 'saliency'". Would greatly appreciate suggestions. Poul

import ecco ERRO

Hey,

I tried pip install ecco on google colab then import ecco

but I have the error below:

RuntimeError Traceback (most recent call last)
init.pxd in numpy.import_array()

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in <cell line: 1>()
----> 1 import ecco
2 lm = ecco.from_pretrained('distilgpt2')

5 frames
/usr/local/lib/python3.10/dist-packages/ecco/init.py in
14
15 version = '0.1.2'
---> 16 from ecco.lm import LM
17 from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel, AutoModelForSeq2SeqLM
18 from typing import Any, Dict, Optional, List

/usr/local/lib/python3.10/dist-packages/ecco/lm.py in
13 from torch.nn import functional as F
14 from ecco.attribution import compute_primary_attributions_scores
---> 15 from ecco.output import OutputSeq
16 from typing import Optional, Any, List, Tuple, Dict, Union
17 from operator import attrgetter

/usr/local/lib/python3.10/dist-packages/ecco/output.py in
9 import torch
10 from torch.nn import functional as F
---> 11 from sklearn import decomposition
12 from typing import Dict, Optional, List, Tuple, Union
13 from ecco.util import strip_tokenizer_prefix, is_partial_token

/usr/local/lib/python3.10/dist-packages/sklearn/init.py in
80 from . import _distributor_init # noqa: F401
81 from . import __check_build # noqa: F401
---> 82 from .base import clone
83 from .utils._show_versions import show_versions
84

/usr/local/lib/python3.10/dist-packages/sklearn/base.py in
15 from . import version
16 from ._config import get_config
---> 17 from .utils import _IS_32BIT
18 from .utils._tags import (
19 _DEFAULT_TAGS,

/usr/local/lib/python3.10/dist-packages/sklearn/utils/init.py in
20 from scipy.sparse import issparse
21
---> 22 from .murmurhash import murmurhash3_32
23 from .class_weight import compute_class_weight, compute_sample_weight
24 from . import _joblib

sklearn/utils/murmurhash.pyx in init sklearn.utils.murmurhash()

init.pxd in numpy.import_array()

ImportError: numpy.core.multiarray failed to import


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Saliency Map for other transformer based models

Hi @jalammar

Thanks for creating this library , I do follow your posts and explanation on transformer models and you explain i quite well and in simple way . However i was working on BERT/BART models for various NLP downstream tasks like (sentiment ,intent etc classifications ) which uses pre-trained large language models , I wish to ask you can we do saliency map visualizations on this library for custom transformer models , if so can you explain how ?

Looking forward for your reply

Thanks

Ajay sahu

[Bug/improvement] Issue with saliency maps in loop

Hey,

In Colab, I am trying to generate saliency maps in a loop to study variations of similar inputs, but when you print the maps generate in a loop, they are stuck next to each other which makes the visualisation quite confusing.
See screenshot.

Screen Shot 2020-12-29 at 20 13 34

Error message of run_nmf when using a model with activations=False

If I load a model like this

lm = ecco.from_pretrained('gpt2')

without specifically setting activations=True

The error I get when trying to run NFM is not very clear, specifically, it shows

/usr/local/lib/python3.6/dist-packages/ecco/output.py in __init__(self, activations, n_input_tokens, token_ids, _path, n_components, tokens, **kwargs)
    498         from_layer = kwargs['from_layer'] if 'from_layer' in kwargs else None
    499         to_layer = kwargs['to_layer'] if 'to_layer' in kwargs else None
--> 500         if len(activations.shape) != 3:
    501             raise ValueError(f"The 'activations' parameter should have three dimensions: (layers, neurons, positions). "
    502                              f"Supplied dimensions: {activations.shape}", 'activations')

AttributeError: 'list' object has no attribute 'shape'

This seems an easy mistake that a beginner user could make and could easily be improved by checking earlier that output object has properties required for doing NFM.

Running Eccomap for Pre Trained BertForMaskedLM

Hi, I was trying to run my pretrained model for which i had used BERTForMaskedLM model class from hugging face but its giving me this error. Plese help me in resolving this error. Thanks in advance.
image

attention head

Hi @jalammar, I tested some examples with Ecco, and I wanted to know if it is possible to change the head to view the activations for each one and for each layer?

Rankings_watch displaying wrong sequence

Hello,
I have a problem with the rankings_watch() function. I used a predefined GPT2 model and gave it the input "Today, the weather is". However, in the visualization, only the first token is shown although the model creates the output correctly:
image

Thank you for your help :D

bug -object has no attribute 'lm_head'-

I am testing the BioGPT model with your visualizer. After completing the configuration, I have the following bug: AttributeError: 'BioGptForCausalLM' object has no attribute 'lm_head'

T5 Generation Error

v0.1.0 gives the following error when attempting to generate using T5:
TypeError: ones(): argument 'size' must be tuple of ints, but found element of type Tensor at pos 1

AttributeError: 'OutputSeq' object has no attribute 'saliency'

captum 0.5.0
torch 1.13.0+cu117

Language_Models_and_Ecco_PyData_Khobar.ipynb

text= "The countries of the European Union are:\n1. Austria\n2. Belgium\n3. Bulgaria\n4."
output_3 = lm.generate(text, generate=20, do_sample=True)
output_3.saliency()

AttributeError Traceback (most recent call last)
Cell In [13], line 1
----> 1 output_3.saliency()

AttributeError: 'OutputSeq' object has no attribute 'saliency'

Support for Distributing Model on Multiple GPU's

I'm working with a pretty big model (GPTNeo1.3B) and the gradient attribution features cause cuda out of memory errors on the GPU's I have access to.

I know that the huggingface from_pretrained method takes in an optional parameter 'device_map', that if set to "auto" will split model params into all available GPU's. The syntax is:

from transformers import AutoModelForSeq2SeqLM
t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto")

I was wondering if there was a similar feature in ecco? If not, could I submit a PR sometime in the near future?

minGPT support

minGPT is a minimal PyTorch re-implementation of OpenAI GPT by @karpathy: https://github.com/karpathy/minGPT

It's extremely readable and great for teaching purpose. I was wondering if you would consider extending ecco to also support it?

If I was to work on it, is it something you'd prefer to see in this repository or in a separate "ecco-mingpt" repository?

Huggingface v4 support

Adding Huggingface transformers v4 support is work in progress in https://github.com/jalammar/ecco/tree/hf4-support. You can install it via:

!pip install git+https://github.com/jalammar/ecco.git@hf4-support

  • Token generation, and NMF on activations should work.
  • Input saliency could be having issues with pytorch v1.7.1. backwards() breaks for some reason.
  • rankings, rankings_watch, and layer_prediction need a small fix since v4 changed the dimensions of the hidden state vector. Just need account for the additional dimension. I still haven't decided whether to ignore the batch dimension, or support it and have the visualizations only show the first element in the batch (or have the user select the element number). Open to ideas.

KeyError: 'tokenizer_config'

I am working on integrating my custom model vinai\bertweet-base with Ecco, however I ran into the following issue:

Traceback (most recent call last):
  File "experiment_ecco.py", line 44, in <module>
    nmf_1.explore()
  File "C:\Users\XXXX\anaconda3\envs\disaster_tweets\lib\site-packages\ecco\output.py", line 827, in explore
    }})"""
KeyError: 'tokenizer_config'

I created the lm in the following way:

# loading in tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, use_fast=False)
model = torch.load("bertmodel.pth")
''' 
this model is obtained from training 
AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", 
output_hidden_states=True, output_attentions=True, num_labels=2)
'''

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': ''
}

lm = LM(model=model, tokenizer=tokenizer, model_name="vinai/bertweet-base",
        config=model_config, collect_activations_flag=True, verbose=True)

tweet = "So running down the stairs was a bad idea full on collided... With the floor ??"
inputs = lm.tokenizer([tweet], return_tensors="pt")
output = lm(inputs)

nmf_1 = output.run_nmf(n_components=8)
nmf_1.explore()

upon further inspection I believe the error comes from the following line:

js = f"""
         requirejs(['basic', 'ecco'], function(basic, ecco){{
            const viz_id = basic.init()
            
            ecco.interactiveTokensAndFactorSparklines(viz_id, {data},
            {{
            'hltrCFG': {{'tokenization_config': {json.dumps(self.config['tokenizer_config'])}
                }}
            }})
         }}, function (err) {{
            console.log(err);
        }})"""

I could not traceback the origin of tokenizer_config. Therefore I assume it also has to be passed in the model_config for a custom model? If so, this needs to be specified in the docs.

Or could this issue be related in a strange way to #65

Help understanding position arg of layer_predictions

In the gpt2 model, I am measuring the distribution of calendar dates.

import ecco
lm = ecco.from_pretrained('gpt2')
output_0 = lm.generate("On January", generate=1, do_sample=False)

I assumed that to read predictions for the next token, I would need either position=0 or position=2 depending on whether it referred to the 0th token of the full string or the generated output. I was surprised to see these return the same tokens and probabilities:

output_0.layer_predictions(position=0, layer=11, topk=5)
output_0.layer_predictions(position=2, layer=11, topk=5)

If I query position=1 then I see 'the' and other tokens which might follow "On " in the original sentence.

output_0.layer_predictions(position=1, layer=11, topk=5)

Error when generate using T5

Hi,

I installed the ecco package from this repo and runner the following code:
lm = ecco.from_pretrained('t5-small', activations=True)
text= "translate English to Portuguese: I like to eat rice."
output = lm.generate(text, generate=1, do_sample=True)

However, I got the following error:

ModuleAttributeError: 'Embedding' object has no attribute 'shape'

After doing some digging I found that the problem is that you are treating the self.model_embeddings as a Tensor, however, it is a nn.Embedding (so you need to do self.model_embeddings.weight). I also found other problems:

  • The model expects tensors of the type (batch, seq_len, hidden_dim) and the code is passing only (seq_len, hidden_dim)
  • The T5 model should be load as AutoModelForSeq2SeqLM to have outputs and the decoding needs the decoder_input_ids

GPT Generation error on GPU

Generating text with GPT2 on GPU throws the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

IndexError when using analysis.pwcca

OS Info

Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 5.4.0-100-generic
Architecture: x86-64

Environment Info

python 3.6
ecco==0.1.2

Replicate

import ecco.analysis as analysis
a = np.random.rand(4, 200)
b = np.random.rand(8, 200)
analysis.pwcca(a, b)  # This works
analysis.pwcca(b, a)  # It fails here

Error:

~/Documents/TOMMAS/venv/lib/python3.6/site-packages/ecco/svcca_lib/pwcca.py in compute_pwcca(acts1, acts2, epsilon)
47 else:
48 dirns = np.dot(sresults["coef_y"],
---> 49 (acts1[sresults["y_idxs"]] -
50 sresults["neuron_means2"][sresults["y_idxs"]])) + sresults["neuron_means2"][sresults["y_idxs"]]
51 coefs = sresults["cca_coef2"]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 8 but corresponding boolean dimension is 4

I believe ecco/svcca_lib/pwcca.py line 49 is supposed to be acts2 instead of acts1.

Remove `tokenizer_config` usage from the library

This config parameter was made to easily package config to send to the Javascript components. Ecco now handles all tokenization on the Python side to separate the concerns between the python and JS components. Subsequently, this needs to be removed.

Support for Constrained Beam Search?

Hola! Great project, thank you for making it available!

I am curious now that beam search is supported, is it possible to pass constraints to ecco using for example PhrasalConstraints, to constrain T5 model output for a sequence-to-sequence task prior to ecco evaluation?

Thanks in advance!

Add support for more attribution methods

Hi,
Currently, the project seems to be relying on grad-norm and grad-x-input to obtain the attributions. However, there are other arguably better (as discussed in recent work) methods to obtain saliency maps. Integrating them in this project would also provide a good way to compare them on the same input examples.

Some of these methods from the top of my head are- integrated gradients, gradient shapley, and LIME. Perhaps support for visualizing the attention map from the model being interpreted itself could also be added. Methods based on feature ablation are also possible but they might need more work to integrate.

There is support for these aforementioned methods on Captum, but it takes effort to get them working for NLP tasks, especially those based on language modeling. Thus, I feel this would be a useful addition here.

token prefix in roberta model?

Trying to use a custom trained Roberta model by loading the config file but getting the error the token prefix is not present in the config. Any idea how to fix it?
Screenshot 2022-02-02 at 3 30 31 PM

Tokenizer has partial token suffix instead of prefix

Following your guide for identifying model configuration

MODEL_ID = "vinai/bertweet-base"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, normalization=True, use_fast=False)
ids= tokenizer('tokenization')
ids

returns:

{'input_ids': [0, 969, 6186, 6680, 2], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

Then

tokenizer.convert_ids_to_tokens(ids['input_ids'])

returns:

['<s>', 'to@@', 'ken@@', 'ization', '</s>']

Here I noticed that the tokenizer adds a partial token suffix instead of partial token prefix. Having a suffix instead of prefix is not configurable in the config.

Add a `conda` install option for `ecco`

A conda install option for ecco could be helpful for two reasons:

  1. Easy installation with version management with conda.
  2. For other libraries, which if depend on ecco, if you want them on conda-forge channel as well, ecco must be available on conda-forge.

💡 I have already have started work on this. PR: conda-forge/staged-recipes#17388

Once, the PR gets merged, you will be able to install ecco as:

conda install -c conda-forge ecco

I will send a PR to update your documentation, once the PR gets merged.

Add Beam Search Support

I'm not finding any way to do beam search using ecco, can you add the same in documentation?

Not support transformers 4.2.2?

I tried to import ecco and transformers at the same time but I got an error like follows:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-585e05415cf8> in <module>
      1 import os
      2 import numpy as np
----> 3 import ecco
      4 import torch
      5 from transformers import GPT2LMHeadModel

~/anaconda3/lib/python3.7/site-packages/ecco/__init__.py in <module>
      1 __version__ = '0.0.12'
----> 2 from ecco.lm import LM, MockGPT, MockGPTTokenizer
      3 from transformers import AutoTokenizer, AutoModelForCausalLM
      4 
      5 def from_pretrained(hf_model_id,

~/anaconda3/lib/python3.7/site-packages/ecco/lm.py in <module>
     11 from ecco.attribution import *
     12 from typing import Optional, Any
---> 13 from transformers.modeling_gpt2 import GPT2Model
     14 
     15 

ModuleNotFoundError: No module named 'transformers.modeling_gpt2'

But it goes well if I comment off import ecco. Will ecco support transformers 4.2.2 in the near future?

Best,

output.saliency() displays nothing

I am trying to visualize saliency maps from a custom GPT model. Since I am concerned only about saliency maps, I just do the following:

out = OutputSeq(token_ids = input_ids, n_input_tokens = n_input_tokens, tokens = tokens, attribution = attr)
out.saliency()

I get no errors and nothing is displayed in the jupyter notebook, but when I open Chrome's Javascript console, I see the following thing.

(unknown) Ecco initialize.

  | l | @ | storage.googleapis.c…ust=1610606118793:1
-- | -- | -- | --
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | autoTextColor | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | each | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | style | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | enter | @ | storage.googleapis.c…ust=1610606118793:1
  | (anonymous) | @ | storage.googleapis.c…ust=1610606118793:1
  | join | @ | d3js.org/d3.v5.min.j…ust=1610606118793:2
  | setupTokenBoxes | @ | storage.googleapis.c…ust=1610606118793:1
  | init | @ | storage.googleapis.c…ust=1610606118793:1
  | eval
  | execCb | @ | require.js:1693
  | check | @ | require.js:881
  | enable | @ | require.js:1173
  | init | @ | require.js:786
  | (anonymous) | @ | require.js:1457

DevTools failed to load SourceMap: Could not load content for http://localhost:8888/static/notebook/js/main.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
DevTools failed to load SourceMap: Could not load content for https://storage.googleapis.com/wandb-cdn/production/d4e2434e6/raven.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

How do I resolve this issue? Btw, I am running this notebook by sshing into my institute's remote machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.