marcotcr / checklist Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 203.0 128 MB

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

License: MIT License

Python 3.90% Jupyter Notebook 93.40% HTML 0.01% JavaScript 0.16% TypeScript 2.40% CSS 0.12%

checklist's People

Contributors

Stargazers

Watchers

Forkers

generalzh nicole-he nagoudi jahangircsebuet marcelomata tbrooks007 ericxsun gptcod charlottesean lfhase nicole-ma zuiwufenghua jrinconcol vickzhang xrosliang saisrik bharatr21 floatsdsds pankajmehar pinjiahe zengyy8 lapolonio yang9112 jassimran zhaoqiuye tonylv qiming-zou abinayam02 terrifyzhao mhmdsaiid studioetrange 17714196157 pememoni burakakrishna sak098 dsksd flyingwaters yuanboxu bilalsal ahzz1207 sumhncku fgksgf srisatish hfxunlp yyht w-cheng mistobaan colinsongf mingyingyu codedecde sdxshuai ssundaranathan louisowen6 maxbartolo y12uc231 bytesamurai paulacza ssameerr jeremywood-ai kevinrobinson zhuoerfeng yudezhi j-chim laomagic vishal-ramesh2 hillarymarian ibrahim85 kaiprod chris-zou2981 shpotes kampamocha adbmd msmsf maybeee18 mcloarec001 lx-devdev martamarchiori nilanshrajput wuyuehao ramji-c fagan2888 yash2998chhabria isabellefang rcruzjs lis-kp milan-chicago nlpprj canslove mtoub zeta1999 mns1yash losimons sofiane8384 motazsaad amiller22 tannonk sachuin23 junzh821 asgersand prrao87

checklist's Issues

Can't get the template editor to show in colab

Code here:
https://colab.research.google.com/github/lapolonio/checklist/blob/master/notebooks/Sentiment.ipynb

Sources for names and locations?

hello! 👋

I didn't see in the paper or the repo the sources for the list of names, locations, etc (eg, names.json or lexicons/basic.json. Could you share how you put these together?

Thanks for publishing your work in the open! 👍

Perturb.strip_punctuation possible endless loop for some cases

The while loop in Perturb.strip_punctuation function becomes an endless loop for inputs like :.

checklist/checklist/perturb.py

Line 118 in e640d79

while doc[-1].pos_ == 'PUNCT':

To reproduce it:

import spacy
from checklist.perturb import Perturb
model_path = ''  # Spacy model path
nlp = spacy.load(model_path)
sent = nlp(':')
Perturb.strip_punctuations(sent)  # Endless loop!

I checked the code and found doc[-1].pos_ after stripping the last token : was always PUNCT... it seems like a spacy bug.

To avoid this, I suggest checking the length of doc in the while condition and return when the length of doc becomes 0.

while len(doc) and doc[-1].pos_ == 'PUNCT':

RuntimeError: selected index k out of range

When trying the "Multilingual suggestions" example an error occurs:

/usr/local/lib/python3.6/dist-packages/checklist/text_generation.py in unmask(self, text_with_mask, beam_size, candidates)
180 else:
181 if forbid:
--> 182 v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
183 top_preds = self.with_space[top_preds]
184 else:

RuntimeError: selected index k out of range

In jupyter viwer, INV and DIR test would change two sentences pred in reverse order.

- In INV and DIR test, every example would be two sentences, one original and one changed. I use Suite.summary() it would be okay. But in jupyter that I used Suite.Summarizer() to show the result, I found that every INV and DIR test, the predict label and conf of two sentences are in reversed order.

```
For one example, (orginal: I'm the guy. model predict is 1), I use add_typos to this sentence and get(changed:I'm eth guy. model predict is 0).
```
If I use Suite.summary(), I can get the corret result just like:
Example fails: 1 (0.8) I'm the guy. 0 (0.9) I'm eth guy.
But in Suite.Summarizer(), I may get the result show in jupyter just like:
I'm the guy.→ I'm eth guy. Pred: 0(0.9)→1(0.8)
I can't find the bug where it happends, so please help me to debug , thanks!

Custom aggregate function results in attribute error.

Writing custom expectation aggregate functions for test cases results in attribute error.

Following examples demonstrate this :


import checklist
import torch
import numpy as np

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from checklist.pred_wrapper import PredictorWrapper
from checklist.expect import Expect
from checklist.test_types import INV
from checklist.perturb import Perturb

dataset = [
    'I am checking the checklist',
    'There is a bug in the code',
]


class Model(object):
    THRESHOLD = 0.9

    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
        self.model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

    def get_encoding(self, sentences):
        encoded_input = self.tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            return self._mean_pooling(model_output, encoded_input['attention_mask'])

    def get_similarities(self, sentence1, other_sentences):    
        e1 = self.get_encoding(str(sentence1))
        e2 = self.get_encoding([str(x) for x in other_sentences])
        return np.squeeze(cosine_similarity(e1, e2))


def similarity_score(inputs):
    all_preds = list()
    for sentence1, other_sentences in inputs:
        scores = model.get_similarities(sentence1, other_sentences)
        all_preds.append(scores)
    return np.array(all_preds)


def all_similar(x, pred, conf, label=None, meta=None):
    """if any of the results is is below threshold testcase doesn't pass"""
    ret = np.sum(pred < Model.THRESHOLD) == 0
    print(f'pred = {pred}, ret = {ret}')
    return ret

def add_typos(sentence, n=5):
    typos = []
    for i in range(n):
        typos.append(Perturb.perturb([sentence], Perturb.add_typos, keep_original=False))
    return sentence, typos


wrapped_pp = PredictorWrapper.wrap_predict(similarity_score)
expect_all_similar = Expect.single(all_similar)

model = Model()

t = Perturb.perturb(dataset, add_typos, nsamples=200, keep_original=False)
test = INV(**t, name='add typos', capability='typo',
           description='', expect=expect_all_similar, agg_fn=expect_all_similar)

test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
test.summary()

This results in the following exception :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-0b1b8f7c7467> in <module>
      1 test = INV(**t, name='add typos', capability='typo',
      2            description='', expect=expect_all_similar, agg_fn=expect_all_similar)
----> 3 test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
      4 test.summary()

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run(self, predict_and_confidence_fn, overwrite, verbose, n, seed)
    351             print('Predicting %d examples' % len(examples))
    352         preds, confs = predict_and_confidence_fn(examples)
--> 353         self.run_from_preds_confs(preds, confs, overwrite=overwrite)
    354 
    355     def fail_idxs(self):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run_from_preds_confs(self, preds, confs, overwrite)
    291         self._check_create_results(overwrite)
    292         self.update_results_from_preds(preds, confs)
--> 293         self.update_expect()
    294 
    295     def run_from_file(self, path, file_format=None, format_fn=None, ignore_header=False, overwrite=False):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in update_expect(self)
    128         self._check_results()
    129         self.results.expect_results = self.expect(self)
--> 130         self.results.passed = Expect.aggregate(self.results.expect_results, self.agg_fn)
    131 
    132     def example_list_and_indices(self, n=None, seed=None):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate(data, agg_fn)
    145         # data is a list of lists or list of np.arrays
    146         # import pdb; pdb.set_trace()
--> 147         return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
    148 
    149     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in <listcomp>(.0)
    145         # data is a list of lists or list of np.arrays
    146         # import pdb; pdb.set_trace()
--> 147         return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
    148 
    149     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate_testcase(expect_results, agg_fn)
    160             return None
    161         else:
--> 162             return agg_fn(np.array(r))
    163 
    164     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in expect(self)
     75         """
     76         def expect(self):
---> 77             zipped = iter_with_optional(self.data, self.results.preds, self.results.confs, self.labels, self.meta, self.run_idxs)
     78             return [fn(x, pred, confs, labels, meta) for x, pred, confs, labels, meta in zipped]
     79         return expect

AttributeError: 'numpy.ndarray' object has no attribute 'results'

Feature request Static HTML output

Is it possible to output checklist results in some structured format, eg html, xml.

We'd like to to integrate Checklist as a CI step and want to be able to persist the model performance.

Irony & Sarcasm as new capability

Hi there, :) Just wondering if you have any suggestions, for the context of Sentiment Analysis, how to include-create a new capability concerning irony and sarcasm.
The fact is that given the three types of tests, the most suitable to start would be MFT, but what about the labels? If "irony datasets" contain tweets only marked as ironic or not, they could be negative or positive with respect to the Sentiment, so what I thought is to use the Expect function is_not_1, expecting the label NOT to be neutral... But I am not convinced by this solution

Import checklist

Sentiment Analysis Different Results

Hi,
I tried to replicate the transformers results for sentiment analysis. In order to do that, I've created results based on tutorial in this link: https://github.com/marcotcr/checklist/blob/master/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb. I compare it with release_data/sentiment/predictions/bert file. I expected same results but got different ones. Why is it happening?

Testing transformer pipelines Results:

File Results:

Multilingual perturbations?

Thanks for your work! perturb.by is for English only, any plans for the API to support other languages?

Squad model source

Hi,

What is the source of the squad model with this API? (taken from the SQuaD tutorial)

model = bert_squad_model.BertSquad()
invert = lambda a: model.predict_pairs([(x[1], x[0]) for x in a])
new_pp = PredictorWrapper.wrap_predict(invert)

Using downstream-trained model for Editor

Can I use a transformer model that is fine-tuned on a downstream task as the masked language model for the Editor class?

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long

I have to say that it's really a good job！I met some problem when I run the demo （Multilingual suggestions by RoBERTa ），I try to set language as English and Chinese，however they all fail to run，The reason is that “RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)” ，I have no idea how to solve it ，it disturb me a lot time，can you give me some suggestions? I will appreciate it，by a way，my Environment are as follows：

torch==1.2.0
torchvision==0.4.0
tensorflow-gpu==1.13.2
4.Python==3.6.1

Any suggestion for NER test?

NER example

Hi!
In the paper you describe some invariant tests for NER models, however, in the library I can't find any code to this. Could you share this as well? Thanks :)

Roberta-specific tokenization?

checklist/checklist/text_generation.py

Line 89 in d26abda

self.space_prefix = self.tokenizer.tokenize(' John')[0].split('John')[0]

This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (##) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John -> john). I can't really see a way to fix it without knowing something about different model names, though.

Dose the Perturb Module support Multilingual ?

Thank you for sharing such a great tool, it‘s really useful and amazing!

When i used the Perturb.change_names to get a INV test case, it could not perturb the names in my chinese dataset.
Does this module currently support multiple languages? can you tell me how to use it?

How to use checklist to evaluate a machine translation system?

Could you give a more detailed environment?

There is much troubleshooting when following your guide in README.md, none of the demo codes can be replicated on my computer. I don't know if there are any hidden bugs or it's only for my issue. Then could you give us a more detailed environment, including the libraries needed and their version?

Loading a finetuned model

Hi,

What's the easiest way to load a finetuned BERT model in the CheckList process? Seems like models are wrapped in a ModelWrapper class, and that wrapper is not available.

Thanks.

pipeline step in 4. The CheckList process tutorial fails if there is no GPU

The code snippet pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="pt", device=0) in cell#2 of checklist/notebooks/tutorials/4. The CheckList process.ipynb fails on MacOS and other systems that do not have a GPU with the following error assertionerror: torch not compiled with cuda enabled. GPU devices are 0 indexed and changing the parameter device=0 to device=-1 resolves the problem. As there are no explicit requirements to have a GPU, it is perhaps better to change this parameter value so that notebook runs on all systems, including those that don't have a GPU.

How to perturb data by mixing two independent functions ?

Hi,
As in your tutorials, Perturb.perturb usually inputs 1 function (example: Perturb.change_names). But in my scenario, I want 2 functions as parameters of Perturb.perturb, then it will return sentences which are mixture of func1 and func2 .
Ex:
func1: change_names
func2: change_city
I hope it returned a sentence which name is changed by func1 or/and city is changed by func2 (something like: Perturb.perturb(data, Perturb.change_names, Perturb.change_city ) ). Otherwise, I still can use Perturb.perturb(data, Perturb.change_names) and Perturb.perturb(data, Perturb.change_city) independently.
How can I do that ?

Tutorial not reproducable because some modules aren't part of the repo.

Following lines from tutorial-4 use the hard coded path to import a module which is not part of the repo.

sys.path.append('/home/marcotcr/work/ml-tests/')
from mltests import model_wrapper
model = model_wrapper.ModelWrapper()

how to replace with other masked language model

Editor.suggest() ignores nsamples parameter

Code to reproduce

import checklist
from checklist.editor import Editor

editor = Editor()
ret = editor.suggest('I am a {mask} {mask}.', nsamples=5)
print(ret)

Expected behavior

It should generate 5 samples.

Actual behavior

It generates 11000 samples.

Note that editor.template() correctly handles nsamples, only editor.suggest() is broken.

Labelled data

Hi,
First of all, thank you for sharing your work.
How can I get labelled data(text, ground truth) to calculate uncertainty in these tasks? I couldn't see any function about it. In the tests_n500 file, there are only sample texts without any label.

Thanks for advance
Best

Change the data in tests_n500

Hi!
First of all, thanks for your work, it is very inspiring.

I would like to conduct tests on another test set, semantically different from airline related tweets (in particular, I would like to use data from this competition https://amievalita2018.wordpress.com/, which collects misogynistic tweets, in order to explore the fairness of the models).

To do this, I just have to replace the file tests_n500 and insert in the folder "predictions/" the file containing the predictions in the usual format (i.e. the label 0/1/2 and the three probabilities), right?

Excuse the beginner's question :) Thanks a lot!

Checklist viewer installation fails in macintosh

jupyter nbextension install --py --user checklist.viewer

Results in:

Installing test/lib/python3.7/site-packages/checklist/viewer/static -> viewer
Making directory: user/Library/Jupyter/nbextensions/viewer/
Traceback (most recent call last):
  File "test/bin/jupyter-nbextension", line 8, in <module>
    sys.exit(main())
  File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 270, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "test/lib/python3.7/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 983, in start
    super(NBExtensionApp, self).start()
  File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 259, in start
    self.subapp.start()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 711, in start
    self.install_extensions()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 690, in install_extensions
    **kwargs
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 220, in install_nbextension_python
    destination=dest, logger=logger
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 187, in install_nbextension
    os.makedirs(dest_dir)
  File "test/bin/../lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'user/Library/Jupyter/nbextensions/viewer/'

In a Mac OSX virtualenv for python 3

Labels cannot be list of int

The following code gives me an error:

from checklist.editor import Editor
editor = Editor()

exs = ['hello', 'good bye']
labels = [0, 1]
ret = editor.template('{ex}', ex=exs, labels=labels, save=True, meta=True)

Based on the documentation I expect labels to accept a list of ints.

pip install fails

pip install keeps failing with the following msg:

ERROR: Command errored out with exit status 1:
command: 'd:\python\vir37v1\scripts\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'D:\Users\CAIZ~~1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"'; file='"'"'D:\Users\CAIZ~~1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'D:\Users\CAIZ~1\AppData\Local\Temp\pip-pip-egg-info-hi8iwmvg'

SQuAD test doesn't work on Windows

The following line from the tutorial causes an error:
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6884: character maps to <undefined>

I fixed the issue by changing abstract_test.py line 15 to:
f = open(path, 'r', encoding='utf-8')

Missing documentation for Editor and Perturb classes (readthedocs)

The editor and perturb modules are currently missing documentation - is this still a work in progress? It would be very nice to be able to browse through the various methods and types for these modules, just like the expect module allows us to do currently, so I just wanted to point this out.

Thanks for this awesome library!

question about prediction rendering

Should the old prediction be on the left side of the arrow?

code

          predTag = <Tag style={{verticalAlign: "middle"}}>
                Pred: <span className="example-token rewrite-remove">{newobj.pred}{confStr}</span>
                {replaceArrow}
                <span className="example-token rewrite-add">{oldobj.pred}{confStrOld}</span>
            </Tag>

Where is the processed_qqp.pkl in the qqp notebook

Where is the spacy_map = pickle.load(open('/home/marcotcr/tmp/processed_qqp.pkl', 'rb'))

ImportError: cannot import name 'AutoModelForMaskedLM'

env:
transformers-2.8.0
torch-1.7.0

ssahar

@marcotcr
Hi, I have tried to install checklist on windows10 but I received an error that I can not handle it. first of all, it gives me the error for this line in the setup file. (check_call([f"{sys.executable} -m pip install jupyter"], shell=True). I removed this line now I got the following error:
error: symbolic link privilege not held
I was wondering if your package is working on Windows. If it is true what I missed here.

Thanks

NER tests

Hi,
First thanks for your great work and this very useful library.

I am looking to test NER models (Transformer and LSTM based), also I would like to know if you had any example/code of how you could test such models?
I haven't found any, even in the notebook 5. Testing transformer pipelines.

I guess the key is to be able to make an expectation function at token-level ? Maybe you already explored something ?

Many thanks!

Very useful product, I want to get some advices.

Thanks for release this useful product, I recently modified it adapted to Chinese. If I want to use it to check/test some other models which not use softmax for last layer like NER model or not one example one probs. What change should I do to adapte to these models?

JSON serialization error when loading examples in visual summary

When trying to use the visual summary functionality on a TestSuite, I ran into an issue with loading examples: I get the error message ValueError: Can't clean for JSON: array([1.]). I get this both when using suite.visual_summary_table() or suite.visual_summary_by_test().

However, when I try suite.summary() it works fine and I get something like this:

NER test
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
0.0 0.0 1.0 Ian Young cooked the burgers in some broth.
----
0.0 0.0 1.0 George Rogers cooked the meats in some broth.
----
0.0 0.0 1.0 Paul Brown cooked the chicken al dente.
----

where the three numbers before every sample are the probability scores (in the case of my model, these are always 1.0 or 0.0).

Is this expected behavior (am I doing something wrong?) or is it a bug?

See traceback from the visualization widget below -- note that the error is raised not when initially loading the widget but only once example fails are being loaded.

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in handle_events(self, _, content, buffers)
     46         elif content.get('event', '') == 'switch_test':
     47             testname = content.get("testname", "")
---> 48             self.on_select_test(testname)
     49 
     50     def on_select_test(self, testname: str) -> None:

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in on_select_test(self, testname)
     54             summary, testcases = self.select_test_fn(testname)
     55         self.reset_summary(summary)
---> 56         self.reset_testcases(testcases)

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in reset_testcases(self, testcases)
     46         self.filtered_testcases = testcases if testcases else []
     47         self.tokenize_testcases()
---> 48         self.search(filter_tags=[], is_fail_case=True)
     49 
     50     def handle_events(self, _, content, buffers):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in search(self, filter_tags, is_fail_case)
    118         self.compute_stats_result(candidate_testcases_not_fail)
    119         self.to_slice_idx = 0
--> 120         self.fetch_example()
    121 
    122     def fetch_example(self):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in fetch_example(self)
    126             new_examples = self.candidate_testcases[self.to_slice_idx : self.to_slice_idx+self.max_return]
    127             self.to_slice_idx += len(new_examples)
--> 128             self.testcases = [e for e in new_examples]

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in __set__(self, obj, value)
    583             raise TraitError('The "%s" trait is read-only.' % self.name)
    584         else:
--> 585             self.set(obj, value)
    586 
    587     def _validate(self, obj, value):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in set(self, obj, value)
    572             # we explicitly compare silent to True just in case the equality
    573             # comparison above returns something other than True/False
--> 574             obj._notify_trait(self.name, old_value, new_value)
    575 
    576     def __set__(self, obj, value):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in _notify_trait(self, name, old_value, new_value)
   1137             new=new_value,
   1138             owner=self,
-> 1139             type='change',
   1140         ))
   1141 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in notify_change(self, change)
    603             if name in self.keys and self._should_send_property(name, getattr(self, name)):
    604                 # Send new state to front-end
--> 605                 self.send_state(key=name)
    606         super(Widget, self).notify_change(change)
    607 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in send_state(self, key)
    487             state, buffer_paths, buffers = _remove_buffers(state)
    488             msg = {'method': 'update', 'state': state, 'buffer_paths': buffer_paths}
--> 489             self._send(msg, buffers=buffers)
    490 
    491 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in _send(self, msg, buffers)
    735         """Sends a message to the model in the front-end."""
    736         if self.comm is not None and self.comm.kernel is not None:
--> 737             self.comm.send(data=msg, buffers=buffers)
    738 
    739     def _repr_keys(self):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in send(self, data, metadata, buffers)
    121         """Send a message to the frontend-side version of this comm"""
    122         self._publish_msg('comm_msg',
--> 123             data=data, metadata=metadata, buffers=buffers,
    124         )
    125 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in _publish_msg(self, msg_type, data, metadata, buffers, **keys)
     63         data = {} if data is None else data
     64         metadata = {} if metadata is None else metadata
---> 65         content = json_clean(dict(data=data, comm_id=self.comm_id, **keys))
     66         self.kernel.session.send(self.kernel.iopub_socket, msg_type,
     67             content,

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    195 
    196     # we don't understand it, it's probably an unserializable object
--> 197     raise ValueError("Can't clean for JSON: %r" % obj)

ValueError: Can't clean for JSON: array([1.])

Possible to install this without mysql?

When trying to install checklist, I run into the error: OSError: mysql_config not found. This happens during pip install mysqlclient, which is a dependency of the pattern package.

The simple solution of course is to install mysql. Unfortunately, this is not possible in many university shared computing environments due to security risks. As far as I can tell, mysql is not used anywhere in this tool. Is there any workaround to install it in an environment without mysql?

error: BadZipFile: File is not a zip file

Thank you for your useful work！
if run：
from checklist.perturb import Perturb
output：
error: BadZipFile: File is not a zip file
Can you tell me how to solve this problem？

BadZipFile: File is not a zip file

Hello, when run
from checklist.perturb import Perturb,
i meet an error: BadZipFile: File is not a zip file

language models choice

Can I use my own language models?

Invariance test cannot be run using test.run_from_file()

Is there any way I can run Invariance tests from predictions saved in files? When I am trying to use test.run_from_file(), I am getting this following error:
AttributeError: 'INV' object has no attribute 'result_indexes'

CUDA OOM for beam search in perturbation

Upon using perturb functions and replace words with synonyms, this line causes CUDA OOM error:

checklist/checklist/text_generation.py

Line 270 in 64a810a

ret = self.unmask(masked, beam_size=100000000, candidates=options)

The beam size is unbounded. Is it possible to make it configurable by users when calling antonyms/synonyms API so that the memory cost is more controllable?
BTW thank you for such great work!

Extra `init.py`

Hi,

In the root directory of this repo, there exists __init__.py file. I'm wondering if this is intentional or not, because it causes Python to think that the repo itself is a Python package when it's not. So for example, if we have a directory such as the following,

.
├── checklist (repo)
└── foo.py

and if I'm importing checklist.editor in foo.py, Python looks for editor in the repo, which it cannot find and trigger ModuleNotFoundError

Win10 type errors with tensors in unmask() text_generation.py

Encountered a couple of type errors with Win10 Python/Torch on tensors in the unmask() function from text_generation.py while running the examples from the introduction page. These required fixing through explicit type casting with .to(torch.int64).

Diff to fix is as follows:

PS C:\src\checklist> git diff .\checklist\text_generation.py
diff --git a/checklist/text_generation.py b/checklist/text_generation.py
index b0ad20e..9a6a5ed 100644
--- a/checklist/text_generation.py
+++ b/checklist/text_generation.py
@@ -163,7 +163,7 @@ class TextGenerator(object):
             # print('ae')
             # print('\n'.join([tokenizer.decode(x) for x in to_pred]))
             # print()
-            to_pred = torch.tensor(to_pred, device=self.device)
+            to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64) # fix for int32 / int64 type mismatch on win10
             with torch.no_grad():
                 outputs = model(to_pred)[0]
             for i, current in enumerate(current_beam):
@@ -179,7 +179,7 @@ class TextGenerator(object):
                     new = [(current[0] + [int(x[0])], float(x[1]) + current[1]) for x in zip(cands_to_use, scores)]
                 else:
                     if forbid:
-                        v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
+                        v, top_preds = torch.topk(outputs[i, masked[size], self.with_space.to(torch.int64)], beam_size + 10) # fix for int32 / int64 type mismatch on win10
                         top_preds = self.with_space[top_preds]
                     else:
                         v, top_preds = torch.topk(outputs[i, masked[size]], beam_size + 10)

local variable 'orig_ret' referenced before assignment

When I tried to use perturb functions and replace words with synonyms, this error occurred.
And I checked the code and found the variable orig_ret would not be defined if no candidates are suggested.

checklist/checklist/text_generation.py

Line 278 in 64a810a

orig_ret = new_ret

To solve this, I suggest define the variable before assigning the value of new_ret to it.

Choice behind RoBERTa and BERT

Just a curiosity.
Is there a specific implementation choice on why the the fill-in for the templates ({mask}) produced by CheckList are created from suggestions of RoBERTa and BERT? Given that these same two research models are analysed and their shortcomings are highlighted in the paper (why in some way rely on """imperfect""" models?)

Thanks a lot

marcotcr / checklist Goto Github PK

checklist's People

Contributors

Stargazers

Watchers

Forkers

checklist's Issues

Code to reproduce

Expected behavior

Actual behavior

Recommend Projects

Recommend Topics

Recommend Org