marcotcr / anchor Goto Github PK
View Code? Open in Web Editor NEWCode for "High-Precision Model-Agnostic Explanations" paper
License: BSD 2-Clause "Simplified" License
Code for "High-Precision Model-Agnostic Explanations" paper
License: BSD 2-Clause "Simplified" License
Hi,
I'm trying to run this simple snippet of code, after having successfully (i.e., no error/warning) installed anchor, spacy and all the requirements (included the command 'python -m spacy download en_core_web_lg'):
import spacy
from anchor import anchor_text
nlp = spacy.load('en_core_web_lg')
explainer = anchor_text.AnchorText(nlp, ['negative', 'positive'], use_unk_distribution=False, use_bert=False)
But I obtain the following error:
Exception Traceback (most recent call last)
<ipython-input-5-7f4e7f3d6066> in <module>
----> 1 explainer = anchor_text.AnchorText(nlp, ['negative', 'positive'], use_unk_distribution=False, use_bert=False)
~/.local/lib/python3.7/site-packages/anchor/anchor_text.py in __init__(self, nlp, class_names, use_unk_distribution, use_bert, mask_string)
117 self.tg = None
118 self.use_bert = use_bert
--> 119 self.neighbors = utils.Neighbors(self.nlp)
120 self.mask_string = mask_string
121 if not self.use_unk_distribution and self.use_bert:
~/.local/lib/python3.7/site-packages/anchor/utils.py in __init__(self, nlp_obj)
319 self.to_check = [w for w in self.nlp.vocab if w.prob >= -15 and w.has_vector]
320 if not self.to_check:
--> 321 raise Exception('No vectors. Are you using en_core_web_sm? It should be en_core_web_lg')
322 self.n = {}
323
Exception: No vectors. Are you using en_core_web_sm? It should be en_core_web_lg
I'm using this setting:
Fedora 30 (but I can replicate it on Ubuntu 18.04)
python 3.7.4
spacy 2.3.2 (but I've also tried with 2.2.3)
Thank you,
Emanuele
Where can I download the datasets used in the notebooks
I was running playing around with Anchors and was getting really weird coverage and precision values. Then I saw the TODO (line 301 of anchor_tabular.py) saying that the precision and coverage measures are incorrect.
Does this mean that the anchors that I computed were also wrong? Or does the issue only affect those metrics?
Thanks!
Hi!
I am trying to use AnchorTabularExplainer to explain the predictions of a random forest classifier, but when calling explain_instance function I get this error: "ValueError: X has different shape than during fitting. Expected 4, got 11." I have checked that the train dataset and test dataset have the same shape.
Any suggestions of what can I do?
Hi Marco,
Any pointers or code on a VQA example ? Any way to modify the Imagenet example for VQA ?
Hey, I'm working on a breast cancer dataset. When I call the explain_instace function, I get the above error message.
This is what the code looks like:
from anchor import anchor_tabular
explainer = anchor_tabular.AnchorTabularExplainer(np.asarray(["0","1"]),list(data.drop(columns=['id','diagnosis']).columns),x)
exp = explainer.explain_instance(X_test[0:1], classifier.predict, threshold=0.95)
I saw the same issue on Shap GitHub, and advice to use NumPy slices but still, that didn't work.
This is what the error log looks like
I use pip install anchor-exp, but get the following error.
File "", line 1, in
File "/usr/local/miniconda3/envs/dl/lib/python3.6/site-packages/anchor/anchor_image.py", line 40
print i
Hello Marco,
Could you provide more details on how to use the SP anchor. Also how will the precision and coverage be defined in case of SP -anchor ?
I have been trying to run anchor on numeric data. Getting this error-
in
----> 1 explainer.fit(X_train, y_train, X_test, y_test)
/opt/anaconda3/lib/python3.7/site-packages/anchor/anchor_tabular.py in fit(self, train_data, train_labels, validation_data, validation_labels, discretizer)
69 self.disc = lime.lime_tabular.QuartileDiscretizer(train_data,
70 self.categorical_features,
---> 71 self.feature_names)
72 elif discretizer == 'decile':
73 self.disc = lime.lime_tabular.DecileDiscretizer(train_data,
/opt/anaconda3/lib/python3.7/site-packages/lime/discretize.py in init(self, data, categorical_features, feature_names, labels, random_state)
190 BaseDiscretizer.init(self, data, categorical_features,
191 feature_names, labels=labels,
--> 192 random_state=random_state)
193
194 def bins(self, data, labels):
/opt/anaconda3/lib/python3.7/site-packages/lime/discretize.py in init(self, data, categorical_features, feature_names, labels, random_state, data_stats)
97 self.maxs[feature] = qts.tolist() + [boundaries[1]]
98 [self.get_undiscretize_value(feature, i)
---> 99 for i in range(n_bins + 1)]
100
101 @AbstractMethod
/opt/anaconda3/lib/python3.7/site-packages/lime/discretize.py in (.0)
97 self.maxs[feature] = qts.tolist() + [boundaries[1]]
98 [self.get_undiscretize_value(feature, i)
---> 99 for i in range(n_bins + 1)]
100
101 @AbstractMethod
/opt/anaconda3/lib/python3.7/site-packages/lime/discretize.py in get_undiscretize_value(self, feature, val)
141 minz, maxz, loc=means[val], scale=stds[val],
142 random_state=self.random_state,
--> 143 size=self.precompute_size))
144 idx = self.undiscretize_idxs[feature][val]
145 ret = self.undiscretize_precomputed[feature][val][idx]
/opt/anaconda3/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py in rvs(self, *args, **kwds)
960 cond = logical_and(self._argcheck(*args), (scale >= 0))
961 if not np.all(cond):
--> 962 raise ValueError("Domain error in arguments.")
963
964 if np.all(scale == 0):
ValueError: Domain error in arguments.
The lastest package seems to be using Bert to perform text perturbation. The previous versions did not use transformers to perturb the utterances. However, when installing anchor from https://pypi.org/project/anchor-exp/, now requires installation of transformers. Could you please advice how versioning works for anchor? It would be ideal to get back to use the previous version if possible. Thank you.
Hi, @marcotcr
I am very interested in the submodular pick for anchor, but I find very little in the paper and code.
Where can I find more about it ?
Hi,
First of all, I should say that this is really a great paper. Good luck with you.
My question is, let's assume we have a tabular dataset. We just do encoding categorical variables as pre-processing and make a model using the random forest. And we save that model using .pickle format.
After that can we use Anchor to explain that saved model.
For example, I don't need to use these lines of code when making the model.
explainer = anchor_tabular.AnchorTabularExplainer(dataset.class_names, dataset.feature_names, dataset.data, dataset.categorical_names)
explainer.fit(dataset.train, dataset.labels_train, dataset.validation, dataset.labels_validation)
c = sklearn.ensemble.RandomForestClassifier(n_estimators=50, n_jobs=5)
c.fit(explainer.encoder.transform(dataset.train), dataset.labels_train)
I just need to use,
c = sklearn.ensemble.RandomForestClassifier(n_estimators=50, n_jobs=5)
c.fit(dataset.train, dataset.labels_train)
these two lines when making the model.
Thank you.
Hello!
I would like to propose a feature to this wallet!
In brief: put a description beside the name account to make it personnal and human readable, just something little to have a quick reminder of what I use this account for!
Thank you in advance!
I have run into problems installing Anchors with pip install anchors_exp onto Python 3.6.4 (Ubuntu) and 3.5.4 (Windows 10).
On ubuntu the installations simply ends with "Killed" and no other information.
On Windows it ends with the following:
Command "c:\dev\prog\anaconda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Julian\\AppData\\Local\\Temp\\pip-install-alvzq035\\murmurhash\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Julian\AppData\Local\Temp\pip-record-3vle_27_\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Julian\AppData\Local\Temp\pip-install-alvzq035\murmurhash\
I looked up this error and it led me here: https://github.com/explosion/spaCy/issues/1395 but there is no windows solution.
I am using HELOC dataset which can be downloaded from https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2.
I am using an XGBoost model as classification function and trying to use Anchors as an explainability technique over and above XGBoost.
I am using the below code to implement Anchors, however the anchors that are being outputted contain all the features (for most instances in test data), which is obviously very hard to read (and therefore, not that interpretable). Moreover, the precision for the whole anchor when given a threshold of 0.8 is only 0.33.
explainer = anchor_tabular.AnchorTabularExplainer(class_names=['Bad', 'Good'],
feature_names=dfTrain.columns, train_data=np.array(dfTrain), categorical_names={})
idx = 100
np.random.seed(1)
predict_fn = lambda x: model.predict(xgb.DMatrix(pd.DataFrame(x, columns=list(dfTest.columns)), label = [yTest[idx]]))
print('Prediction: ', explainer.class_names[int(round(predict_fn(dfTest.iloc[[idx],:])[0]))])
exp = explainer.explain_instance(np.array(dfTest.iloc[[idx],:]), predict_fn, threshold=0.8)
print('Anchor: %s' % (' AND '.join(exp.names())))
print('Precision: %.2f' % exp.precision())
print('Coverage: %.2f' % exp.coverage())
Here is a screenshot of a sample anchor:
Is there something I can do from my end to improve this?
Thanks,
Lara
I get this error
'numpy.ndarray' object has no attribute 'feature_names'
When trying to execute
explainer = AnchorTabular(model.predict, feature_names=X_test.columns.values.tolist())
I tried to convert to numpy array, but that did not work. Should I convert to xgb.Dmatrix(...) ?
I am following the example for ‘Anchor for text’. The only difference is I am using my own dataset. My data is labelled as ‘depressed’ and ‘not depressed’. One thing I don’t understand is when I print the examples it includes the first few words from the entire text given to anchor explainer. For eg : following is the text I am giving to anchor
When I print the examples where anchor apply, and model predicts ‘depressed’ or ‘not depressed’. It only prints the first few words from the whole text.
Can you please explain why it is not taking into account entire given text?
When using anchor on my dataset, I have found that it produces an anchor containing conditions in all (73) features. It seems to have identical output to LIME with discretized continuous variables, rather than intelligently choosing a few high coverage, high precision conditions.
You mention in your code that your calculations for precision and coverage are not currently working. So does the stopping condition for the anchor construction algorithm not work, if it uses these? Do you have plans to fix this?
exp = explainer.explain_instance(numpy_test, catBoostModel.predict, threshold=0.95).
/opt/anaconda3/lib/python3.7/site-packages/anchor/anchor_tabular.py in sample_from_train(self, conditions_eq, conditions_neq, conditions_geq, conditions_leq, num_samples, validation)
98 bla
99 """
--> 100 train = self.train if not validation else self.validation
101 d_train = self.d_train if not validation else self.d_validation
102 idx = np.random.choice(range(train.shape[0]), num_samples,
AttributeError: 'AnchorTabularExplainer' object has no attribute 'validation'
Hello Marco,
I find your paper genuinely interesting, it is great progress towards the model-agnostic interpretability.
I wanted to ask you whether this repository is missing the code connected with SP-Anchor?
Hi,
see second comment with detailed description,
Regards
Christian
I was analyzing the library and discovered that the parameter "num_samples" for anchor_tabular seems to be specified from the parameter coverage_samples in file anchor_base. The value is given to a default of 10,000 samples.
Do you see any problem on using such a large number of samples on relative small datasets? I am working with a dataset that has 1,000 instances. I will try to experiment with different values but would appreciate your input on this.
if word.prob < -15:
--> 337 queries += [w]
338 by_similarity = sorted(
339 queries, key=lambda w: word.similarity(w), reverse=True)
NameError: name 'w' is not defined
https://github.com/marcotcr/anchor/blob/master/anchor/utils.py#L337
Hello,
I am trying to apply anchor on a dataset with 290 numerical features.
Following your test example I wrote the code below. But I get an error.
Could you have a look.
class_names = [0, 1]
classifier_name = "svc"
classifier_var = {"probability": True}
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(feature_vector, target_vector,
train_size=0.80)
classifier = src.classifiers.possible_classifiers[classifier_name](**classifier_var)
classifier.fit(train, labels_train)
print(sklearn.metrics.accuracy_score(labels_test, classifier.predict(test)))
>>> 0.825892857143
explainer = anchor_tabular.AnchorTabularExplainer(class_names, feature_names, train)
explainer.fit(train, labels_train, test, labels_test)
exp = explainer.explain_instance(test[67], classifier.predict, threshold=0.95)
>>>
C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_base.py:58: RuntimeWarning: divide by zero encountered in log
temp = np.log(k * n_features * (t ** alpha) / delta)
C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_base.py:59: RuntimeWarning: invalid value encountered in log
return temp + np.log(temp)
Traceback (most recent call last):
File "C:/Users/adhikaria/Documents/curium/src/lime_application.py", line 142, in <module>
exp = explainer.explain_instance(test[67], classifier.predict, threshold=0.95)
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_tabular.py", line 281, in explain_instance
**kwargs)
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_base.py", line 381, in anchor_beam
1, verbose=verbose)
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_base.py", line 92, in lucb
ut, lt = update_bounds(t)
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\anchor\anchor_base.py", line 89, in update_bounds
ut = not_J[np.argmax(ub[not_J])]
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\numpy\core\fromnumeric.py", line 1004, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "C:\Users\adhikaria\PycharmProjects\SDQ\venv\lib\site-packages\numpy\core\fromnumeric.py", line 52, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: attempt to get argmax of an empty sequence
Hi there,
Is anchor suitable for VS Code? I tried to install / clone anchor / anchor_exp on VS Code but it kept showing error:
An error occurred while installing & ERROR: ERROR: Package installation failed...
I am wondering if I can only install anchor on Jupyter Notebook?
I was reading through the paper, and found interesting examples for MT and POS.
I am little unclear on whether the current library can be used for it. If so, I would be happy to give a pull request with an example.
If not what changes have to be made ?
Hi Marco,
Based off issue #2 , Is there any ongoing work on including code for image examples?
Thanks!
I noticed in anchor_base.py line 305, newly generated candidate anchors are filtered out if they don't have a larger coverage than the previous best candidate whose precision is large enough. But these new candidates are larger, and will always have coverage at most as large as the previous valid candidate. Would it be equivalent if we check if best_coverage == -1 as the first line in the while loop on line 302, and break if best_coverage is not -1 (meaning we select the best coverage anchor with high enough precision from the previous round, and not try making new candidates as their coverage can't possibly be larger)?
Let me know if I'm missing something when you get the chance. Thanks.
Hi, I found the paper on anchor extremely interesting. The dataset I have only has categorical features with values 0 and 1. I tested it for different models but the code, throws an error in the line, classifier_fn(self.encoder.transform(x)) . As the feature vectors that the dataset has are already discretized, anchor discretizes it further, irrespective of the input throwing an error from the predict function. Could you please help me with the issue. Thanks.
Hi,
First I want to thank you for the wonderful paper and creating this library in the first place. I would also like to ask a question.
I want to use anchor on a dataset which is already processed (i.e. everything is binned and properly one-hot encoded so now I have a |'n_cases' x 'm_one-hot-encoded_features'| sparse matrix as (X) and |n_cases x 2| matrix as (y). Unfortunately, there is a lot of 'automated' dataset processing which is done during instatializaton of an anchor_tabular.AnchorTabularExplainer(), which produces various errors when I want to use it on an arbitrary dataset.
Can you please advise what one needs to provide for 'class_names', 'feature_names' and 'categorical_names' in cases when the dataset is already one-hot encoded?
Thank you!
Best regards,
Kiril
In Anchor, do the number of instances in the neighborhood is equal to the number of instances in the dataset which contains the global model predictions as same as the decision of the instance to be explained?
Thanks,
Hi Marco, could you include some image examples as well?
installed package with pip.
When running 'Anchor on tabular data.ipynb' I get an error
----> 1 import anchor_base
2 import anchor_explanation
3 import utils
4 import lime
5 import lime.lime_tabular
ModuleNotFoundError: No module named 'anchor_base'
Possibly you have missed a dot in import statement
Hi @marcotcr ,
Firstly, the paper is great and I'm really looking forward to using the package.
I tried to use it on my own data where the AnchorTabularExplainer()
object does not have any categorical_names
(i.e. categorical features). I see that the code when calling the explain_instance()
method goes to https://github.com/marcotcr/anchor/blob/master/anchor/anchor_tabular.py#L215 and since there are no categorical features, the mapping
dict remains empty and so the method is not working.
Am I missing something? Or, is there something I can do to overcome this?
Hi,
I am trying to apply anchor in my text classification models, but I am wondering if it is applicable to multi-labeled target case because your sample is handling a binary case. In the example, I should set up the "alternative" parameter for the other target.
Please advised me.
Thanks,
I'm having a problem where the anchor appears to be computed (I look at the object and see e.g. features, examples, precision/coverage), but the names() attribute of the explanation is empty. I'm using the included discretizer on numerical variables for AnchorTabular. After some debug I found that add_names_to_exp() in anchor_tabular.py sometimes doesn't return a name (no condition is true, in my case a variable A > [value] has a leq mapped), probably because of errors in the mapping.
Going to the mapping: get_sample_fn() seems to be comparing values, yet uses
for v in range(len(self.categorical_names[f]))
if data_row[f] <= v:
Comparing data directly to a range? Is this intended? Same thing for conditions in sample_fn(). I'd need to debug more to confirm anything but this catched my attention.
Great paper btw
Is there a link for the data sources used in the notebooks?
Hello, I was experimenting with the provided jupyter notebook "Anchor for text.ipynb" and changed the example text "This is a good book ." to a much longer text (an actual book review)
For some reason though, the resulting anchor only shows one feature/anchor and nothing else seems to be considered. Is that the intended behavior or are longer texts not suitable with anchor?
Hi Marco,
I'm trying to use the anchor_image module on python 3 for my master's thesis.
I have a problem with the "print" command. The parenthesis are missing in all of the print calls.
Which is the package I should install? anchor or anchor_exp? can they be together?
Hi Marco,
Although anchor is a great idea and outperforming LIME in experiences, why LIME is still so popular? What are anchor's drawbacks?
Does lime_anchor require the spacy model file to work? or, can I put something in its place?
Hello,
I am trying to use Anchor to explain predictions on the Iris DataSet.
When I try to explain the index 50, it doesn't work.
But with 57 and more, it starts working, then not. It's weird !
This is the code I run:
'''
from sklearn import datasets
from xgboost import XGBClassifier
iris = datasets.load_iris()
X_clf = iris.data
y_clf = iris.target
XGB_clf = XGBClassifier()
XGB_clf.fit(X_clf, y_clf)
explainer = anchor_tabular.AnchorTabularExplainer(
['0','1','2'],
iris.feature_names,
X_clf,)
exp = explainer.explain_instance(X_clf[50], XGB_clf.predict, threshold=0.95)
'''
And I get the error:
IndexError Traceback (most recent call last)
in
---> 18 exp = explainer.explain_instance(X_clf[50], XGB_clf.predict, threshold=0.95)
c:\users\azizeac\venv\lib\site-packages\anchor\anchor_tabular.py in explain_instance(self, data_row, classifier_fn, threshold, delta, tau, batch_size, max_anchor_size, desired_label, beam_size, **kwargs)
280 desired_confidence=threshold, max_anchor_size=max_anchor_size,
281 **kwargs)
--> 282 self.add_names_to_exp(data_row, exp, mapping)
283 exp['instance'] = data_row
284 exp['prediction'] = classifier_fn(self.encoder_fn(data_row.reshape(1, -1)))[0]
c:\users\azizeac\venv\lib\site-packages\anchor\anchor_tabular.py in add_names_to_exp(self, data_row, hoeffding_exp, mapping)
310 if f in self.categorical_names:
311 v = int(v)
--> 312 if ('<' in self.categorical_names[f][v]
313 or '>' in self.categorical_names[f][v]):
314 fname = ''
IndexError: list index out of range
Hello Marco,
I had an IndexError with your code while I wanted to save the exp object. The exp object could be used to print for example exp.names, exp.precision etc.
This is the Error: IndexError: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
I have trained the XGBoost model with a numpy.array and also used a numpy array as train and test set for the explainer. For the labels I used "1" for errors and "0" as OK instance.
I love the paper and am excited to try out the code.
When running the Anchor for Text notebook, I ran into an issue with data. You define:
def load_polarity(path='/home/marcotcr/phd/datasets/sentiment-sentences'):
and it's clear that the user has to change the path to the correct one on the machine the notebook is running... however, I looked at the site where the data is from, # dataset from http://www.cs.cornell.edu/people/pabo/movie-review-data/
, and it was hard to find the data you were using.
Once I figured it out, I updated the notebook a little to pull the data and extract it, so the notebook is more self-contained. Currently it is on Google Colab but I am happy to open a PR with the changed notebook. Or just a change in the Readme to add a link to this Colab notebook.
I'm trying to use anchor on my dataset, but I run into dependencies issues. My environment currently uses sklearn 0.22, but when looking into the anchor code I see that an older version of SKlearn is used.
The setup.py file shows that the packet has dependencies on numpy, scipy, spacy and scikit learn but does not specify the version. Could you give the versions of the packets that you used?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.