marcotcr / checklist Goto Github PK
View Code? Open in Web Editor NEWBeyond Accuracy: Behavioral Testing of NLP models with CheckList
License: MIT License
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
License: MIT License
hello! 👋
I didn't see in the paper or the repo the sources for the list of names, locations, etc (eg, names.json or lexicons/basic.json. Could you share how you put these together?
Thanks for publishing your work in the open! 👍
The while
loop in Perturb.strip_punctuation
function becomes an endless loop for inputs like :
.
checklist/checklist/perturb.py
Line 118 in e640d79
To reproduce it:
import spacy
from checklist.perturb import Perturb
model_path = '' # Spacy model path
nlp = spacy.load(model_path)
sent = nlp(':')
Perturb.strip_punctuations(sent) # Endless loop!
I checked the code and found doc[-1].pos_
after stripping the last token :
was always PUNCT
... it seems like a spacy bug.
To avoid this, I suggest checking the length of doc
in the while condition and return when the length of doc
becomes 0.
while len(doc) and doc[-1].pos_ == 'PUNCT':
When trying the "Multilingual suggestions" example an error occurs:
/usr/local/lib/python3.6/dist-packages/checklist/text_generation.py in unmask(self, text_with_mask, beam_size, candidates)
180 else:
181 if forbid:
--> 182 v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
183 top_preds = self.with_space[top_preds]
184 else:
RuntimeError: selected index k out of range
- In INV and DIR test, every example would be two sentences, one original and one changed. I use Suite.summary() it would be okay. But in jupyter that I used Suite.Summarizer() to show the result, I found that every INV and DIR test, the predict label and conf of two sentences are in reversed order.
For one example, (orginal: I'm the guy. model predict is 1), I use add_typos to this sentence and get(changed:I'm eth guy. model predict is 0).
If I use Suite.summary(), I can get the corret result just like:
Example fails: 1 (0.8) I'm the guy. 0 (0.9) I'm eth guy.
But in Suite.Summarizer(), I may get the result show in jupyter just like:
I'm the guy.→ I'm eth guy. Pred: 0(0.9)→1(0.8)
I can't find the bug where it happends, so please help me to debug , thanks!
Writing custom expectation aggregate functions for test cases results in attribute error.
Following examples demonstrate this :
import checklist
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from checklist.pred_wrapper import PredictorWrapper
from checklist.expect import Expect
from checklist.test_types import INV
from checklist.perturb import Perturb
dataset = [
'I am checking the checklist',
'There is a bug in the code',
]
class Model(object):
THRESHOLD = 0.9
def __init__(self):
self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
self.model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
def _mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
def get_encoding(self, sentences):
encoded_input = self.tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
with torch.no_grad():
model_output = self.model(**encoded_input)
return self._mean_pooling(model_output, encoded_input['attention_mask'])
def get_similarities(self, sentence1, other_sentences):
e1 = self.get_encoding(str(sentence1))
e2 = self.get_encoding([str(x) for x in other_sentences])
return np.squeeze(cosine_similarity(e1, e2))
def similarity_score(inputs):
all_preds = list()
for sentence1, other_sentences in inputs:
scores = model.get_similarities(sentence1, other_sentences)
all_preds.append(scores)
return np.array(all_preds)
def all_similar(x, pred, conf, label=None, meta=None):
"""if any of the results is is below threshold testcase doesn't pass"""
ret = np.sum(pred < Model.THRESHOLD) == 0
print(f'pred = {pred}, ret = {ret}')
return ret
def add_typos(sentence, n=5):
typos = []
for i in range(n):
typos.append(Perturb.perturb([sentence], Perturb.add_typos, keep_original=False))
return sentence, typos
wrapped_pp = PredictorWrapper.wrap_predict(similarity_score)
expect_all_similar = Expect.single(all_similar)
model = Model()
t = Perturb.perturb(dataset, add_typos, nsamples=200, keep_original=False)
test = INV(**t, name='add typos', capability='typo',
description='', expect=expect_all_similar, agg_fn=expect_all_similar)
test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
test.summary()
This results in the following exception :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-38-0b1b8f7c7467> in <module>
1 test = INV(**t, name='add typos', capability='typo',
2 description='', expect=expect_all_similar, agg_fn=expect_all_similar)
----> 3 test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
4 test.summary()
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run(self, predict_and_confidence_fn, overwrite, verbose, n, seed)
351 print('Predicting %d examples' % len(examples))
352 preds, confs = predict_and_confidence_fn(examples)
--> 353 self.run_from_preds_confs(preds, confs, overwrite=overwrite)
354
355 def fail_idxs(self):
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run_from_preds_confs(self, preds, confs, overwrite)
291 self._check_create_results(overwrite)
292 self.update_results_from_preds(preds, confs)
--> 293 self.update_expect()
294
295 def run_from_file(self, path, file_format=None, format_fn=None, ignore_header=False, overwrite=False):
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in update_expect(self)
128 self._check_results()
129 self.results.expect_results = self.expect(self)
--> 130 self.results.passed = Expect.aggregate(self.results.expect_results, self.agg_fn)
131
132 def example_list_and_indices(self, n=None, seed=None):
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate(data, agg_fn)
145 # data is a list of lists or list of np.arrays
146 # import pdb; pdb.set_trace()
--> 147 return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
148
149 @staticmethod
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in <listcomp>(.0)
145 # data is a list of lists or list of np.arrays
146 # import pdb; pdb.set_trace()
--> 147 return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
148
149 @staticmethod
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate_testcase(expect_results, agg_fn)
160 return None
161 else:
--> 162 return agg_fn(np.array(r))
163
164 @staticmethod
~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in expect(self)
75 """
76 def expect(self):
---> 77 zipped = iter_with_optional(self.data, self.results.preds, self.results.confs, self.labels, self.meta, self.run_idxs)
78 return [fn(x, pred, confs, labels, meta) for x, pred, confs, labels, meta in zipped]
79 return expect
AttributeError: 'numpy.ndarray' object has no attribute 'results'
Is it possible to output checklist results in some structured format, eg html, xml.
We'd like to to integrate Checklist as a CI step and want to be able to persist the model performance.
Hi there, :) Just wondering if you have any suggestions, for the context of Sentiment Analysis, how to include-create a new capability concerning irony and sarcasm.
The fact is that given the three types of tests, the most suitable to start would be MFT, but what about the labels? If "irony datasets" contain tweets only marked as ironic or not, they could be negative or positive with respect to the Sentiment, so what I thought is to use the Expect function is_not_1, expecting the label NOT to be neutral... But I am not convinced by this solution
Hi,
I tried to replicate the transformers results for sentiment analysis. In order to do that, I've created results based on tutorial in this link: https://github.com/marcotcr/checklist/blob/master/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb. I compare it with release_data/sentiment/predictions/bert file. I expected same results but got different ones. Why is it happening?
Thanks for your work! perturb.by
is for English only, any plans for the API to support other languages?
Hi,
What is the source of the squad model with this API? (taken from the SQuaD tutorial)
model = bert_squad_model.BertSquad()
invert = lambda a: model.predict_pairs([(x[1], x[0]) for x in a])
new_pp = PredictorWrapper.wrap_predict(invert)
Can I use a transformer model that is fine-tuned on a downstream task as the masked language model for the Editor class?
I have to say that it's really a good job!I met some problem when I run the demo (Multilingual suggestions by RoBERTa ),I try to set language as English and Chinese,however they all fail to run,The reason is that “RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)” ,I have no idea how to solve it ,it disturb me a lot time,can you give me some suggestions? I will appreciate it,by a way,my Environment are as follows:
Hi!
In the paper you describe some invariant tests for NER models, however, in the library I can't find any code to this. Could you share this as well? Thanks :)
checklist/checklist/text_generation.py
Line 89 in d26abda
This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (##
) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John
-> john
). I can't really see a way to fix it without knowing something about different model names, though.
Thank you for sharing such a great tool, it‘s really useful and amazing!
When i used the Perturb.change_names
to get a INV test case, it could not perturb the names in my chinese dataset.
Does this module currently support multiple languages? can you tell me how to use it?
There is much troubleshooting when following your guide in README.md, none of the demo codes can be replicated on my computer. I don't know if there are any hidden bugs or it's only for my issue. Then could you give us a more detailed environment, including the libraries needed and their version?
Hi,
What's the easiest way to load a finetuned BERT model in the CheckList process? Seems like models are wrapped in a ModelWrapper class, and that wrapper is not available.
Thanks.
The code snippet pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="pt", device=0)
in cell#2 of checklist/notebooks/tutorials/4. The CheckList process.ipynb
fails on MacOS and other systems that do not have a GPU with the following error assertionerror: torch not compiled with cuda enabled
. GPU devices are 0 indexed and changing the parameter device=0
to device=-1
resolves the problem. As there are no explicit requirements to have a GPU, it is perhaps better to change this parameter value so that notebook runs on all systems, including those that don't have a GPU.
Hi,
As in your tutorials, Perturb.perturb
usually inputs 1 function (example: Perturb.change_names
). But in my scenario, I want 2 functions as parameters of Perturb.perturb
, then it will return sentences which are mixture of func1 and func2 .
Ex:
func1: change_names
func2: change_city
I hope it returned a sentence which name is changed by func1 or/and city is changed by func2 (something like: Perturb.perturb(data, Perturb.change_names, Perturb.change_city )
). Otherwise, I still can use Perturb.perturb(data, Perturb.change_names)
and Perturb.perturb(data, Perturb.change_city)
independently.
How can I do that ?
Following lines from tutorial-4 use the hard coded path to import a module which is not part of the repo.
sys.path.append('/home/marcotcr/work/ml-tests/')
from mltests import model_wrapper
model = model_wrapper.ModelWrapper()
import checklist
from checklist.editor import Editor
editor = Editor()
ret = editor.suggest('I am a {mask} {mask}.', nsamples=5)
print(ret)
It should generate 5 samples.
It generates 11000 samples.
Note that editor.template()
correctly handles nsamples
, only editor.suggest()
is broken.
Hi,
First of all, thank you for sharing your work.
How can I get labelled data(text, ground truth) to calculate uncertainty in these tasks? I couldn't see any function about it. In the tests_n500 file, there are only sample texts without any label.
Thanks for advance
Best
Hi!
First of all, thanks for your work, it is very inspiring.
I would like to conduct tests on another test set, semantically different from airline related tweets (in particular, I would like to use data from this competition https://amievalita2018.wordpress.com/, which collects misogynistic tweets, in order to explore the fairness of the models).
To do this, I just have to replace the file tests_n500 and insert in the folder "predictions/" the file containing the predictions in the usual format (i.e. the label 0/1/2 and the three probabilities), right?
Excuse the beginner's question :) Thanks a lot!
jupyter nbextension install --py --user checklist.viewer
Results in:
Installing test/lib/python3.7/site-packages/checklist/viewer/static -> viewer
Making directory: user/Library/Jupyter/nbextensions/viewer/
Traceback (most recent call last):
File "test/bin/jupyter-nbextension", line 8, in <module>
sys.exit(main())
File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 270, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "test/lib/python3.7/site-packages/traitlets/config/application.py", line 664, in launch_instance
app.start()
File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 983, in start
super(NBExtensionApp, self).start()
File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 259, in start
self.subapp.start()
File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 711, in start
self.install_extensions()
File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 690, in install_extensions
**kwargs
File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 220, in install_nbextension_python
destination=dest, logger=logger
File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 187, in install_nbextension
os.makedirs(dest_dir)
File "test/bin/../lib/python3.7/os.py", line 221, in makedirs
mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'user/Library/Jupyter/nbextensions/viewer/'
In a Mac OSX virtualenv for python 3
The following code gives me an error:
from checklist.editor import Editor
editor = Editor()
exs = ['hello', 'good bye']
labels = [0, 1]
ret = editor.template('{ex}', ex=exs, labels=labels, save=True, meta=True)
Based on the documentation I expect labels to accept a list of ints.
pip install keeps failing with the following msg:
ERROR: Command errored out with exit status 1:
command: 'd:\python\vir37v1\scripts\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'D:\Users\CAIZ1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"'; file='"'"'D:\Users\CAIZ1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'D:\Users\CAIZ~1\AppData\Local\Temp\pip-pip-egg-info-hi8iwmvg'
The following line from the tutorial causes an error:
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6884: character maps to <undefined>
I fixed the issue by changing abstract_test.py line 15 to:
f = open(path, 'r', encoding='utf-8')
The editor
and perturb
modules are currently missing documentation - is this still a work in progress? It would be very nice to be able to browse through the various methods and types for these modules, just like the expect
module allows us to do currently, so I just wanted to point this out.
Thanks for this awesome library!
Should the old prediction be on the left side of the arrow?
predTag = <Tag style={{verticalAlign: "middle"}}>
Pred: <span className="example-token rewrite-remove">{newobj.pred}{confStr}</span>
{replaceArrow}
<span className="example-token rewrite-add">{oldobj.pred}{confStrOld}</span>
</Tag>
Where is the spacy_map = pickle.load(open('/home/marcotcr/tmp/processed_qqp.pkl', 'rb'))
env:
transformers-2.8.0
torch-1.7.0
@marcotcr
Hi, I have tried to install checklist on windows10 but I received an error that I can not handle it. first of all, it gives me the error for this line in the setup file. (check_call([f"{sys.executable} -m pip install jupyter"], shell=True). I removed this line now I got the following error:
error: symbolic link privilege not held
I was wondering if your package is working on Windows. If it is true what I missed here.
Thanks
Hi,
First thanks for your great work and this very useful library.
I am looking to test NER models (Transformer and LSTM based), also I would like to know if you had any example/code of how you could test such models?
I haven't found any, even in the notebook 5. Testing transformer pipelines.
I guess the key is to be able to make an expectation function at token-level ? Maybe you already explored something ?
Many thanks!
Thanks for release this useful product, I recently modified it adapted to Chinese. If I want to use it to check/test some other models which not use softmax for last layer like NER model or not one example one probs. What change should I do to adapte to these models?
When trying to use the visual summary functionality on a TestSuite, I ran into an issue with loading examples: I get the error message ValueError: Can't clean for JSON: array([1.])
. I get this both when using suite.visual_summary_table()
or suite.visual_summary_by_test()
.
However, when I try suite.summary()
it works fine and I get something like this:
NER test
Test cases: 100
Fails (rate): 4 (4.0%)
Example fails:
0.0 0.0 1.0 Ian Young cooked the burgers in some broth.
----
0.0 0.0 1.0 George Rogers cooked the meats in some broth.
----
0.0 0.0 1.0 Paul Brown cooked the chicken al dente.
----
where the three numbers before every sample are the probability scores (in the case of my model, these are always 1.0 or 0.0).
Is this expected behavior (am I doing something wrong?) or is it a bug?
See traceback from the visualization widget below -- note that the error is raised not when initially loading the widget but only once example fails are being loaded.
ValueError Traceback (most recent call last)
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in handle_events(self, _, content, buffers)
46 elif content.get('event', '') == 'switch_test':
47 testname = content.get("testname", "")
---> 48 self.on_select_test(testname)
49
50 def on_select_test(self, testname: str) -> None:
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in on_select_test(self, testname)
54 summary, testcases = self.select_test_fn(testname)
55 self.reset_summary(summary)
---> 56 self.reset_testcases(testcases)
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in reset_testcases(self, testcases)
46 self.filtered_testcases = testcases if testcases else []
47 self.tokenize_testcases()
---> 48 self.search(filter_tags=[], is_fail_case=True)
49
50 def handle_events(self, _, content, buffers):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in search(self, filter_tags, is_fail_case)
118 self.compute_stats_result(candidate_testcases_not_fail)
119 self.to_slice_idx = 0
--> 120 self.fetch_example()
121
122 def fetch_example(self):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in fetch_example(self)
126 new_examples = self.candidate_testcases[self.to_slice_idx : self.to_slice_idx+self.max_return]
127 self.to_slice_idx += len(new_examples)
--> 128 self.testcases = [e for e in new_examples]
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in __set__(self, obj, value)
583 raise TraitError('The "%s" trait is read-only.' % self.name)
584 else:
--> 585 self.set(obj, value)
586
587 def _validate(self, obj, value):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in set(self, obj, value)
572 # we explicitly compare silent to True just in case the equality
573 # comparison above returns something other than True/False
--> 574 obj._notify_trait(self.name, old_value, new_value)
575
576 def __set__(self, obj, value):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in _notify_trait(self, name, old_value, new_value)
1137 new=new_value,
1138 owner=self,
-> 1139 type='change',
1140 ))
1141
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in notify_change(self, change)
603 if name in self.keys and self._should_send_property(name, getattr(self, name)):
604 # Send new state to front-end
--> 605 self.send_state(key=name)
606 super(Widget, self).notify_change(change)
607
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in send_state(self, key)
487 state, buffer_paths, buffers = _remove_buffers(state)
488 msg = {'method': 'update', 'state': state, 'buffer_paths': buffer_paths}
--> 489 self._send(msg, buffers=buffers)
490
491
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in _send(self, msg, buffers)
735 """Sends a message to the model in the front-end."""
736 if self.comm is not None and self.comm.kernel is not None:
--> 737 self.comm.send(data=msg, buffers=buffers)
738
739 def _repr_keys(self):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in send(self, data, metadata, buffers)
121 """Send a message to the frontend-side version of this comm"""
122 self._publish_msg('comm_msg',
--> 123 data=data, metadata=metadata, buffers=buffers,
124 )
125
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in _publish_msg(self, msg_type, data, metadata, buffers, **keys)
63 data = {} if data is None else data
64 metadata = {} if metadata is None else metadata
---> 65 content = json_clean(dict(data=data, comm_id=self.comm_id, **keys))
66 self.kernel.session.send(self.kernel.iopub_socket, msg_type,
67 content,
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
175
176 if isinstance(obj, list):
--> 177 return [json_clean(x) for x in obj]
178
179 if isinstance(obj, dict):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
175
176 if isinstance(obj, list):
--> 177 return [json_clean(x) for x in obj]
178
179 if isinstance(obj, dict):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
175
176 if isinstance(obj, list):
--> 177 return [json_clean(x) for x in obj]
178
179 if isinstance(obj, dict):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
175
176 if isinstance(obj, list):
--> 177 return [json_clean(x) for x in obj]
178
179 if isinstance(obj, dict):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
189 out = {}
190 for k,v in iteritems(obj):
--> 191 out[unicode_type(k)] = json_clean(v)
192 return out
193 if isinstance(obj, datetime):
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
195
196 # we don't understand it, it's probably an unserializable object
--> 197 raise ValueError("Can't clean for JSON: %r" % obj)
ValueError: Can't clean for JSON: array([1.])
When trying to install checklist, I run into the error: OSError: mysql_config not found
. This happens during pip install mysqlclient
, which is a dependency of the pattern
package.
The simple solution of course is to install mysql. Unfortunately, this is not possible in many university shared computing environments due to security risks. As far as I can tell, mysql is not used anywhere in this tool. Is there any workaround to install it in an environment without mysql?
Thank you for your useful work!
if run:
from checklist.perturb import Perturb
output:
error: BadZipFile: File is not a zip file
Can you tell me how to solve this problem?
Hello, when run
from checklist.perturb import Perturb
,
i meet an error: BadZipFile: File is not a zip file
Can I use my own language models?
Is there any way I can run Invariance tests from predictions saved in files? When I am trying to use test.run_from_file(), I am getting this following error:
AttributeError: 'INV' object has no attribute 'result_indexes'
Upon using perturb functions and replace words with synonyms, this line causes CUDA OOM error:
checklist/checklist/text_generation.py
Line 270 in 64a810a
The beam size is unbounded. Is it possible to make it configurable by users when calling antonyms/synonyms
API so that the memory cost is more controllable?
BTW thank you for such great work!
Hi,
In the root directory of this repo, there exists __init__.py
file. I'm wondering if this is intentional or not, because it causes Python to think that the repo itself is a Python package when it's not. So for example, if we have a directory such as the following,
.
├── checklist (repo)
└── foo.py
and if I'm importing checklist.editor
in foo.py
, Python looks for editor
in the repo, which it cannot find and trigger ModuleNotFoundError
Encountered a couple of type errors with Win10 Python/Torch on tensors in the unmask() function from text_generation.py while running the examples from the introduction page. These required fixing through explicit type casting with .to(torch.int64).
Diff to fix is as follows:
PS C:\src\checklist> git diff .\checklist\text_generation.py
diff --git a/checklist/text_generation.py b/checklist/text_generation.py
index b0ad20e..9a6a5ed 100644
--- a/checklist/text_generation.py
+++ b/checklist/text_generation.py
@@ -163,7 +163,7 @@ class TextGenerator(object):
# print('ae')
# print('\n'.join([tokenizer.decode(x) for x in to_pred]))
# print()
- to_pred = torch.tensor(to_pred, device=self.device)
+ to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64) # fix for int32 / int64 type mismatch on win10
with torch.no_grad():
outputs = model(to_pred)[0]
for i, current in enumerate(current_beam):
@@ -179,7 +179,7 @@ class TextGenerator(object):
new = [(current[0] + [int(x[0])], float(x[1]) + current[1]) for x in zip(cands_to_use, scores)]
else:
if forbid:
- v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
+ v, top_preds = torch.topk(outputs[i, masked[size], self.with_space.to(torch.int64)], beam_size + 10) # fix for int32 / int64 type mismatch on win10
top_preds = self.with_space[top_preds]
else:
v, top_preds = torch.topk(outputs[i, masked[size]], beam_size + 10)
When I tried to use perturb functions and replace words with synonyms, this error occurred.
And I checked the code and found the variable orig_ret
would not be defined if no candidates are suggested.
checklist/checklist/text_generation.py
Line 278 in 64a810a
To solve this, I suggest define the variable before assigning the value of new_ret
to it.
Just a curiosity.
Is there a specific implementation choice on why the the fill-in for the templates ({mask}) produced by CheckList are created from suggestions of RoBERTa and BERT? Given that these same two research models are analysed and their shortcomings are highlighted in the paper (why in some way rely on """imperfect""" models?)
Thanks a lot
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.