Git Product home page Git Product logo

sergioburdisso / pyss3 Goto Github PK

View Code? Open in Web Editor NEW
331.0 20.0 44.0 104.44 MB

A Python package implementing a new interpretable machine learning model for text classification (with visualization tools for Explainable AI :octocat:)

Home Page: https://pyss3.readthedocs.io

License: MIT License

Python 81.36% JavaScript 4.40% HTML 14.06% CSS 0.18%
xai text-classification machine-learning machine-learning-algorithms artificial-intelligence explainable-artificial-intelligence data-mining early-classification ss3-classifier nlp

pyss3's People

Contributors

allcontributors[bot] avatar hbaniecki avatar sergioburdisso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyss3's Issues

Data loading issues while train

Hey ,

[Note] : I have pandas dataframe contain 2 columns as ,

  1. Text
  2. Label

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data ,
                                                    test_size = 0.2, 
                                                    shuffle=False)

train () and fit() methods are not working

here is a reference code

image

How to fix it?

Thanks

[JOSS] comments on the paper

My comments on the software paper wrt to the JOSS submission:

  • In the abstract, you declare two useful tools and then say: "For instance, one of these tools provides (...)". Since there are only two tools, I'd also suggest describing the other one as well. Additionally, instead of "for instance", I'd go with one sentence per each tool. This way, I can only guess what the other functionality is.
  • The last sentence in the abstract is quite long, and I would consider breaking it into shorter pieces.
  • My understanding is that your input is the implementation of the SS3 algorithm. Therefore, I'd be happy to see a bit more details about the algorithm to make the paper self-contained. Also, the title gives the impression of you introducing a new model/algorithm. I believe the input of this work is an implementation? This distinction is not very clear to me. Also, does it mean the SS3 algorithm was proposed without any implementation? That sounds a bit confusing.
  • Is the explanation tool model-specific, or it works with any classification method? What is your exact input here (the explanation algorithm or GUI?). If it's a model-agnostic explanation, perhaps it could be implemented in a different package?
  • github -> GitHub?
  • since explanations of the models are not the primary contribution of this work (?), you could consider adding a reference to some work in this area.
  • Line 40: "On the other hand" doesn't contrast with anything, and definitely no "On the one hand". Maybe you could consider rephrasing this
  • Footnote 2 has an issue with spacing โ€“ no space between "ArXiv" and a bracket

I ticked all the boxes for this part anyways. An exciting paper overall. I like the examples particularly. However, it's not clear what the exact contributions are: model implementation and the explanation GUI (or algorithm)?

In reference to openjournals/joss-reviews#3934

Partial learn

Hi!
I have dataset on 900k records with 800 categories. But I can not train my model because 16gb RAM not enough.
How I can train my model by part?

[joss] software paper comments

openjournals/joss-reviews#3934 Hi, I hope these comments help in improving the paper.

Comments

  1. The paper's title could see a change. It says "PySS3: A new interpretable and simple machine learning model for text classification", but the model is named "SS3" and seems not new. The title of the repository seems more accurate, "A Python package implementing a new simple and interpretable model for text classification", but even then one could drop "new" and use the PyPI package's title, e.g. "PySS3: A Python package implementing the SS3 interpretable text classifier [with interactive/visualization tools for explainable AI]". Just an example to be considered.
  2. I would recommend the authors to highlight in the article the software's aspect of "interactive" (explanation, analysis) and (model, machine learning) "monitoring" as this seems both novel and emerging in discussions lately.
  3. In the end, it would be useful to release a stable version 1.0 of the package (on GitHub, PyPI) and mark that in the paper, e.g. in the Summary section.

Summary

  • L10. "implements novel machine learning model" - It might not be seen as novel when the model was already published in 2019 and extended in 2020.
  • L11. mentioning "two useful tools" without describing what the second does seems off

Statement of need
This part discusses mainly the need for open-source implementation of the machine learning models. However, as I see it, the significant contributions of the software/paper, distinguishing it from the previous work, are the Live_Test/Evaluation tools allowing for visual explanation and hyperparameter optimization. This could be further underlined.

State of the field
The paper lacks a brief discussion on packages in the field of interpretable and explainable machine learning. In that, I suggest the authors reference/compare to the following software related to interactive explainability:

  1. Wexler et al. "The What-If Tool: Interactive Probing of Machine Learning Models" (IEEE TVCG, 2019) https://doi.org/10.1109/TVCG.2019.2934619
  2. Tenney et al. "The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models" (EMNLP, 2020) http://doi.org/10.18653/v1/2020.emnlp-demos.15
  3. Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann. "exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models" (ACL, 2020) https://www.doi.org/10.18653/v1/2020.acl-demos.22
  4. [Ours] "dalex: Responsible Machine Learning with Interactive Explainability and Fairness in Python" (JMLR, 2021) https://www.jmlr.org/papers/v22/20-1473.html

Other possibly missing/useful references:

  1. Pedregosa et al. "Scikit-learn: Machine Learning in Python" (JMLR, 2011) https://www.jmlr.org/papers/v12/pedregosa11a.html
  2. Christoph Molnar "Interpretable Machine Learning - A Guide for Making Black Box Models Explainable" (book, 2018) https://christophm.github.io/interpretable-ml-book
  3. Cynthia Rudin "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead" (Nature Machine Intelligence, 2019) https://doi.org/10.1038/s42256-019-0048-x
  4. [Ours] "modelStudio: Interactive Studio with Explanations for ML Predictive Models" (JOSS, 2019) https://doi.org/10.21105/joss.01798

Implementation

  • L48 github -> GitHub
  • L54 "such as the one introduced later by the same authors" -> "by us" would be easier to read
  • L57 missing the citation of scikit-learn

Illustrative examples

  1. In the beginning, it lacks a brief description of the predictive task used for the example (dataset name, positive/negative text classification, etc.).
  2. Also, it could now be updated with the Dataset.load_from_url() function.

Conclusions
Again, I have doubts that the machine learning model is "novel", as it has been previously published etc.. It might be misunderstood as "introducing a novel machine learning model".

Multilabel Live Test

Hey @sergioburdisso,

I've noticed taht you fixed recently the Multilabel fit issue #6
But the Live_Test.run(clf, X_test, y_test) still does not accept y_test as List[List[str]]
It would be really great to have it
If you don't have time maybe I could submit a PR?
Olivier

Change of category name

Description

The category names are changed in the learning process, this results in a mismatch between predicted category names and true category names.

Example

text = ["Document 1", "Document 2"]
groundtruth = ["Label 1", "Label 2"]

clf = SS3()
clf.fit(doc, groundtruth)

y_pred = clf.predict(doc)
print(y_pred) #["label 1", "label 2"]

Explanation

While training the categories are modified by .lower() here.
When calling .predict() the modified labels are returned here.

Why is this a problem

When calling .predict() with parameter labels=True (the default setting), the predicted category names have to be postprocessed for a direct comparison to the true category names.

Fix

Remove .lower() :)
However, I'm not entirely sure about the consequences for the rest of the project.

Initialization of sanction function

Hey @sergioburdisso,

as far as I understand the SS3 framework, there is an inconsistency between the initialization of SS3 and its documentation.
The initialization describes the parameter sn_m as method used to compute the sanction (sn) function [...]

However in the actual initialization the only function that changes based on the sn_m parameter is the significance function (see here).

I would be great if you could have a look at it and tell me whether I'm wrong? ๐Ÿ˜„

Best,
Florian

Set custom Confidence Vectors

Hi ๐Ÿ‘‹
First of all, thanks for this amazing work !

I really like the transparency of the approach brought by the confidence vectors and since they are bounded between [0,1], it is also quite interpretable. In this way, I was wondering if you plan to add a method to "tweek" these values manually ?

For instance, while it cannot be significant in the available training data, an expert could, by analysing the results with the Live test session, identify new words of importance.

Since one can already extract cv values with clf.cv(ngram, cat), do you plan to add a method, for instance set_custom_cv(ngram, cat) ?

Thanks.

Substantial print pollution when optimising for hyperpameters

Hello,
while I was trying to optimise for hyperparameters I have found that there is little I could do to avoid having a large printout to stdout.

I tried to mute this by doing the following:

import pyss3
pyss3.Print.set_quiet(True)

but besides, it is also tqdm that produces progress bars, hence I propose propagating
Print.__quiet__ to tqdm's disable parameter, for example:

tqdm(..., disable=Print.__quiet__)

Thanks :)
Keep up the good work!

Multilabel Classification Evaluation

Hey @sergioburdisso,

Thank you for this awesome project!
Currently the evaluation class only supports single label classification, even though SS3 inherently supports multilabel classification.
These are the steps (I see) needed to support multilabel classification evaluation:

  • Take the output of classify_multilabel
  • Convert result to binarized vector (same length as confidence vector)
  • Implement multilabel classification metrics usage (e.g. Hamming Loss)
  • Adopt Gridsearch

Error in Live_test

I'm getting an error list index out of range. I'm not sure what happened here. I'm using the latest built as of posting (just installed it prior to using it here), my python is 3.6 if I remember correctly.
image

EDIT: I don't know why but restarting the kernel fixed it.

Divison by 0

I am eager to use the SS3 classifier for text classification task in my master's thesis.
Unfortunately when I run it I get a division by zero error message, see image. My text seems fairly clean (although not yet cleaned exactly the right way) to me, so I am not sure what is causing this.

Is there anything you suspect might be going wrong which I could try? Or anywhere where the data criteria are listed (I've looked but maybe I've overlooked)?

I included the data structure (pandas series), some of what my data looks like and the error.

Many thanks!
image

image

image

Multilabel Classification Dataset Loading

Hey @sergioburdisso,

for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files.
My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.

# Writing Data
dataset = {"Text 1": ["label1", "label2"], 
           "Text 2": ["label2", "label3"], 
           "Text 3": ["label1"]}

for text, labels in dataset.items():

  with open('text.txt', 'a+') as text_file:
    text_file.write(text + '\n')

  with open('labels.txt', 'a+') as label_file:
    label_file.write(';'.join(labels) + '\n')

The result is the following:

# cat text.txt
Text 1
Text 2
Text 3

# cat labels.txt
label1;label2
label2;label3
label1

It would be great if util.Dataset.load_from_files could be adjusted to also support this!
But I'm also open for other suggestions on how to tackle that problem :)

Thanks for your hard work!

Custom metrics for evaluation

Hi!
A way pass a scorer function (e.g using sklearn's make_scorer) on Evaluation would make this pyss3 even greater.

Any plans on this?

This is a very interesting project.
Thank you!

Custom preprocessing in Live Test

@sergioburdisso
It would be a great feature to have custom preprocessing in the Live Test.
This will enable us to visually understand the words, sentences, and paragraphs that helped the model to classify a particular document after custom preprocessing.

[joss] feature request: accessible utility to import a dataset

openjournals/joss-reviews#3934

This package has good documentation. Going through the examples I came up with a feature request, which would greatly benefit introducing newcomers and prototyping code.

I like the first example in README to be straightforward and copy-paste ready, which is not the case here (looking at missing code ...).

How about implementing some import_dataset(url) / download(url) functionality in utils or Dataset that would, for example, download the dataset .zip file and unpack it (sample code) so that one can load the data into exemplary code:

from pyss3 import SS3

Dataset.import_dataset("https://github.com/sergioburdisso/pyss3/blob/master/examples/datasets/movie_review.zip")
x_train, y_train = Dataset.load_from_files("movie_review/train")
x_test, y_test = Dataset.load_from_files("movie_review/test")

clf = SS3()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

Implementation details and naming may vary, but it would be nice to easily run code from README.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.