Git Product home page Git Product logo

gzip-classifier's Introduction

Gzip Classifier

A python implementation of a gzip-based text classification system based on the algorithm described in "Less is More: Parameter-Free Text Classification with Gzip" by Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, and Jimmy Lin.

This project is in active development. Please don't get mad if it doesn't work.

Installation

pip install git+https://github.com/Sonictherocketman/gzip-classifier

Installation via PyPi is coming soon.

There are no dependencies! gzip-classifier uses only the Python Standard Library.

Usage

There are two steps needed to use the classifier. First you need to train it using some pre-existing data. This data must be classified into a set of discrete labels which will be used to classify new data once the model is trained.

Once the model is trained it can be either exported and saved for later use OR it can be used immediately.

If you have any further questions, please refer to the tests for examples on using the classifier.

Training the Model

The training process requires two sets of data: a list of sample text snippets and a list of the labels that correspond to those snippets.

Consider this sample dataset:

label, text
-----------
science,      Physicists discover new spacetime geometry
arts-culture, Famous author releases new book
world,        War in <country> is now over!

These are fed into the classifier as shown here:

from gzip_classifier import Classifier

training_data = [...]
labels = [...]

classifier = Classifier()
classifier.train(training_data, labels)

# OR

with ParallelClassifier(processes=4) as classifier:
    classifier.train(training_data, labels)

Using the Model

Once trained, the model can be used to classify new text.

For insight on the k parameter, please refer to this Wikipedia article. In short, the value depends on your data. Try playing around with integer values greater than one to see what best fits your data. I'm sure there are ways to determine the best value for this, but I don't know enough about it to tell you. I just played with it.

classifier.classify('some new text', k=20)
>> 'a label'

You can also do bulk classification using the following method:

samples = [
    'some new text',
    'more new text',
]

for label in classifier.classify_bulk(samples, k=20)
    print(label)

>> 'a label'
>> 'another label'

Improving Performance

The gzip classifier is highly parallelizable and so included in the package is a ParallelClassifier that uses Python's built in multiprocessing pool to delegate training and classifying samples across cores.

The parallel classifier is best used as a context manager:

with ParallelClassifier(processes=4) as classifier:
    classifier.train(training_data, labels)

    sample = 'some text'
    classifier.classify(sample, k=10)

Exporting & Importing the Model

Once trained, the model can be exported and saved for later use.

# Normal export

model_contents = classifier.model
with open('model.txt', 'w') as f:
    f.write(model_contents)

# Compressed export

model_contents = classifier.compact_model
with open('model.gz', 'wb') as f:
    f.write(model_contents)

You can easily load a model from disk like this:

# Normal import

with open('model.txt', 'r') as f:
    classifier = Classifier.using_model(f.read())

# Compressed import

model_contents = classifier.compact_model
with open('model.gz', 'rb') as f:
    classifier = Classifier.using_compact_model(f.read())

Testing

Run the tests using the following command.

python -m pytest tests/

Contributing

Contributions are welcome. Just don't be a jerk.

License

This code is released under the MIT license. Use it for whatever.

gzip-classifier's People

Contributors

sonictherocketman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.