Cherry - Text classification with no machine learning knowledge needed
Cherry | Windson |
---|---|
Download | https://pypi.python.org/pypi/cherry |
Source | https://github.com/Windsooon/cherry |
Keywords | text classification |
Text classification in 5 minutes, no machine learning knowledge needed. We also provide extra features for users who want to improve their model.
Cherry has three built in English models, you can use them before use your dataset. For more info, checkout Built in model.
Cherry provide performence() and display() api to help you debug and improve the model.
- Python (3.6, 3.7, 3.8)
Install using pip
pip install cherry
# Cherry use nltk for text tokenizer
pip install nltk
# After install nltk, You need to download punkt
>>> import nltk
>>> nltk.download('punkt')
or clone the project from github.
git clone [email protected]:Windsooon/cherry.git
-
These datasets contain 11,315 news. they were organized into 20 different newsgroups, each corresponding to one of the below topic:
- alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware - comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos - rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt - sci.electronics, sci.med, sci.space, soc.religion.christian - talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc
-
These datasets contain 108,463 reviews from the Goodreads book review website, Every book review also has rating from 0 point to 5 points.
-
These datasets contain 5,578 SMS messages manually extracted from the Grumbletext Web site and randomly chosen ham messages of the NUS SMS Corpus (NSC).
Cherry has three built in models, email
, review
and newsgroups
. For instance, in the Comics & Graphic book review datasets, every book review also has rating from 0 point to 5 points. If you want to predict rating based on the book review:
This is an extremely entertaining and often insightful collection by Nobel physicist Richard Feynman drawn from slices of his life experiences. Some might believe that the telling of a physicist’s life would be droll fare for anyone other than a fellow scientist, but in this instance, nothing could be further from the truth.
After finish Installation, in your project path run cherry.train('review')
# You only need to run this line of code at the first time.
# This line of code will:
# 1. Download `review` datasets from remote server (User in China may need use VPN)
# 2. Train datasets using default settings ([Countvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html))
>>> cherry.train('review')
Then you can use classify()
to predict the rating now.
>>> res = cherry.classify('review', text='This is an extremely entertaining and often
insightful collection by Nobel physicist Richard Feynman drawn from slices of his life
experiences. Some might believe that the telling of a physicist’s life would be droll
fare for anyone other than a fellow scientist, but in this instance, nothing could be
further from the truth.')
The return res
is a Classify object has two built-in method. get_probability()
will return an array contains the probability of each category. The order of the return array depend on category name, in this case would be 0, 1, 2, 3, 4.
# The probability of this review had been rating as 4 points is 99.6%
>>> res.get_probability()
array([[6.99908424e-11, 2.48677319e-11, 6.17978214e-06, 3.39472694e-03,
9.96313288e-01, 2.85805135e-04]])
Another method get_word_list()
return a list that contains words that Cherry use for classifying.
>>> res.get_word_list()
[[(2, 'physicist'), (2, 'life'), (1, 'truth'), (1, 'telling'), (1, 'slices'), (1, 'scientist'), (1, 'richard'), (1, 'nobel'), (1, 'instance'), (1, 'insightful'), (1, 'feynman'), (1, 'fellow'), (1, 'fare'), (1, 'extremely'), (1, 'experiences'), (1, 'entertaining'), (1, 'droll'), (1, 'drawn'), (1, 'collection'), (1, 'believe')]]
As you can see, some of the words in the review didin't show up here. There are two reasons for this 1) The training data didn't contain that word. For instance, The word Backend
and Engineer
never show up in training data. 2) the word is a stop word. 'you', 'your' are stop words by default, you can find all stop words Cherry use at here.
Create a folder your_model_name
under datasets in project path
├── project path
│ ├── datasets
| │ ├── your_model_name
| │ │ ├── category1
| | │ ├── file_1
| | │ ├── file_2
| | │ ├── …
| │ │ ├── category2
| | │ ├── file_10
| | │ ├── file_11
| | │ ├── …
Train you datasets:
# By default, encoding will be utf-8,
# You only need to run `train` at the first time
>>> cherry.train('your_model_name', encoding='your_encoding')
# Classify text, `text` can be a list of text too.
>>> res = cherry.classify('your_model_name', text='text to be classified')
Let's build an email classifier from sketch, cherry will use this model to predict an email is spam or not.
mkdir tutorial
cd tutorial
# Create a virtual environment to isolate our package dependencies locally
python3 -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
# Install cherry and nltk
pip install cherry
pip install nltk
>>> import nltk
>>> nltk.download('punkt')
# Create a new folder for email dataset
mkdir -p datasets/email_tutorial
-
Download the datasets from SMS Spam Collection v. 1 then unzip it and put it inside
tutorial/datasets/email_tutorial
folder, now you got a file namedSMSSpamCollection.txt
which contains lots of emails. -
Create a folder name
ham
andspam
insideemail_tutorial
dir. -
Create a script
email.py
in the same folder using code below to extract the email content and group them by category. every file would only contain text.import os import json ham_counter = 0 spam_counter = 0 with open('SMSSpamCollection.txt', 'r') as f: for line in f.readlines(): if line.startswith('ham'): ham_counter += 1 with open(os.path.join('ham', str(ham_counter)), 'w') as nf: _, text = line.split('ham', 1) nf.write(text.strip()) else: spam_counter += 1 with open(os.path.join('spam', str(spam_counter)), 'w') as nf: _, text = line.split('spam', 1) nf.write(text.strip())
-
Now your folder structure should look like this:
tutorial ├── dataset │ ├── email_tutorial | | ├── email.py | | ├── SMSSpamCollection.txt │ │ ├── ham │ │ ├── spam
-
Run
python email.py
-
Delete
SMSSpamCollection.txt
andemail.py
-
Back to the path of
tutorial
, Likecd path_to/tutorial
-
Train the email model:
>>> import cherry >>> cherry.train('email_tutorial', encoding='latin1')
-
Inside
email_tutorial
folder you can findclf.pkz
,ve.pkz
,email_tutorial.pkz
which Cherry will use them for classify later.>>> res = cherry.classify('email_tutorial', 'Thank you for your interest in cherry! We wanted to let you' 'know we received your application for Backend Engineer, and we are delighted that you' 'would consider joining our team.') # 99.9% is a ham email >>> res.get_probability() array([[9.99985571e-01, 1.44288379e-05]]) >>> res.get_word_list() [[(1, 'wanted'), (1, 'thank'), (1, 'team'), (1, 'received'), (1, 'let'), (1, 'joining'), (1, 'consider'), (1, 'application')]]
-
If you want to know good your model did, you can use performance() which will use k-fold cross validation (By default, K equals to 10):
>>> res = cherry.performance('email_tutorial', encoding='latin1', output='files') >>> res.get_score()
The report will be save in
report
files, you can find the precision, recall, and f1-score.precision recall f1-score support 0 0.99 1.00 0.99 485 1 0.97 0.95 0.96 73 accuracy 0.99 558 macro avg 0.98 0.97 0.98 558
weighted avg 0.99 0.99 0.99 558
If you want to know which text had been clasiify wrong:
>>> res = cherry.performance('email_tutorial', encoding='latin1') >>> res.get_score() Text: Dhoni have luck to win some big title.so we will win:) has been classified as: 1 should be: 0 Text: Back 2 work 2morro half term over! Can U C me 2nite 4 some sexy passion B4 I have 2 go back? Chat NOW 09099726481 Luv DENA Calls £1/minMobsmoreLKPOBOX177HP51FL has been classified as: 0 should be: 1 Text: Latest News! Police station toilet stolen, cops have nothing to go on! has been classified as: 0 should be: 1 ...
-
To display the graph, you can use
>>> res.display('email_tutorial', encoding='latin1')
-
If you want to improve your model, you can use search method.
>>> parameters = {'clf__alpha': [0.1, 0.5, 1],'clf__fit_prior': [True, False]} >>> cherry.search('email_tutorial', parameters)
def train(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None)
-
model (String)
The name of the model, you can use build-in models
email
,review
andnewsgroups
, or pass the folder name of your dataset. -
language (String)
The language of the training dataset. Cherry supports
English
andChinese
. -
preprocessing (function)
The function will be called once for every input data before training.
-
categories (List)
Specify the training directory, for instance ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc'].
-
encoding (String)
The encoding of the dataset.
-
vectorizer (Sklearn object)
Feature extraction function use to convert the data into vertcor,by default is
CountVectorizer()
. you can pass different feature extraction function from Sklearn.For some long texts you can use
TfidfVectorizer()
,If you need to save memory you can useHashingVectorizer()
, (get_word_list() function wouldn't work at this case) -
vectorizer_method (String)
Cherry supports shortcut to set up feature extraction function when
vectorizer
isNone
.Count
corresponds toCountVectorizer(tokenizer=tokenizer, stop_words=get_stop_words(model))
,Tfidf
corresponds toTfidfVectorizer
andHashing
corresponds toHashingVectorizer
. -
clf (Sklearn object)
Classify function, by default is
MultinomialNB()
. You can pass classify function from Sklearn. -
clf_method (String)
Cherry supports shortcut to set up classify function when
clf
isNone
,MNB
corresponds toMultinomialNB(alpha=0.1)
,SGD
corresponds toSGDClassifier
,RandomForest
corresponds toRandomForestClassifier
,AdaBoost
corresponds toAdaBoostClassifier
. -
x_data (numpy array)
training text data, if
x_data
andy_data
is None, cherry will try to find the text files data inmodel
-
y_data (numpy array)
correspond labels data, if
x_data
andy_data
is None, cherry will try to find the text files data inmodel
-
model (String)
The name of the model, you can use build-in models
email
,review
andnewsgroups
, or pass the folder name of your dataset. -
text (List / String)
the text to be classify.
def performance(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None, n_splits=10, output='Stdout')
Just as same as train()
API
-
n_splits (Integer)
number of folds. Must be at least 2.
-
output ('Stdout' or 'Files')
'Stdout' will print the scores to standerd output and 'Files' will store the scores into a local file named 'report'.
def search(model, parameters, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None, method='RandomizedSearchCV', cv=3, n_jobs=-1):
def display(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None)
Just as same as train()
API
>>> python runtests.py