Git Product home page Git Product logo

libshorttext's Introduction

LibShortText is an open source library for short-text classification
(http://www.csie.ntu.edu.tw/~cjlin/libshorttext). Please read the COPYRIGHT file
before using LibShortText.

To get started, please read the ``Quick Start'' section first.  

For developers, please check our document at 
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/doc/ for integrating
LibShortText in your software.

Table of Contents
=================

- Installation and Data Format
- Quick Start
- Command-line Usage
- More Examples about Command-line Usage 
- Interactive Error Analysis 
- Additional Information


Installation and Data Format
============================

LibShortText requires UNIX systems with Python 2.6 or newer versions.  
The latest version (Python 2.7) is recommended for better efficiency. 

On Unix systems, type

    $ make

to install the package. For training and test data, every line in the file
contains a label and a short text in the following format:

    <label><TAB><text>

A TAB character is between <label> and <text>. Both the label and the text can 
contain space characters. Here are some examples.

    Jewelry & Watches	handcrafted two strand multi color bead necklace
    Books	big bike magazine february 1973

Two sample sets included in this package are `train_file' and `test_file'.

Quick Start
===========

You can run

    $ cd demo
    $ ./demo.sh

to run a demonstration.

LibShortText provides a simple training-prediction workflow:

short texts ============> model ==============> predictions
        	text-train.py       text-predict.py

The command `text-train.py' trains a text set to obtain a model. For
example, the following command generates `train_file.model' for the
given `train_file'.

    $ python text-train.py train_file
    [output skipped]

`text-predict.py' predicts a test file using the trained model. For example, the
following command predicts `test_file' with `train_file.model' and stores the
results in `predict_result'.

    $ python text-predict.py test_file train_file.model predict_result
    Accuracy = 87.1800% (4359/5000)

Once predict_result is obtained, LibShortText provides several handy utilities 
to conduct error analysis in the Python interactive shell. Please see
the section `Interactive Error Analysis' for more details. 


Command-line Usage
==================
            
-`text-train.py' Usage

    `text-train.py' obtains a model by training either a short-text dataset
    or a LIBSVM-format data set generated by `text2svm.py'.

    Usage: text-train.py [options] training_file [model]
    
    options: 
        -P {0|1|2|3|4|5|6|7|converter_directory}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
            If a preprocssor directory is given instead, then it is assumed
            that the training data is already in LIBSVM format. The preprocessor
            will be included in the model for test. 
        -G {0|1}
            Grid search for the parameter C in linear classifiers. (default 0)
            0   disable grid search (faster)
            1   enable grid search (slightly better results)
        -F {0|1|2|3}
            Feature representation. (default 0)
            0   binary feature
            1   word count 
            2   term frequency
            3   TF-IDF (term frequency + IDF)
        -N {0|1}
            Instance-wise normalization before training/test.
            (default 1 to conduct normalization)
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.
        -L {0|1|2|3}
            Classifier. (default 0)
            0   support vector classification by Crammer and Singer
            1   L1-loss support vector classification
            2   L2-loss support vector classification
            3   logistic regression
        -f
            Overwrite the existing model file.
    Examples:
        text-train.py -L 3 -F 1 -N 1 raw_text_file model_file
        text-train.py -P text2svm_converter -L 1 converted_svm_file

-`text-predict.py' Usage

    `text-predict.py' predicts labels for a test dataset with a trained model. 

    Usage: text-predict.py [options] test_file model output
    
    options:
        -f
            Overwrite the existing output file.
        -a {0|1}
            Output options. (default 1)
            0   Store only predicted labels. The information is NOT sufficient 
                for interactive analysis. Use this option if you would like to get 
                only accuracy.
            1   More information is stored. The output provides information for 
                interactive analysis, but the size of output can become much larger.
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.

-`text2svm.py' Usage

    `text2svm.py' generates a directory containing needed information for
    converting short texts to LIBSVM format. An output file in LIBSVM format is
    also generated.

    Usage: text2svm.py [options] text_src [output]
    
    options:
        -P {0|1|2|3|4|5|6|7}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
    Default output will be a file "text_src.svm" and a directory
    "text_src.text_converter." If "output" is specified, the output will be
    "output" and "output.text_converter."


More Examples about Command-line Usage 
======================================

We use the following questions/answers to demonstrate some examples.

Q: Given many parameters provided by `text-train.py', how to choose the 
   parameters at the first trial? 
A: Although `text-train.py' has several parameters to tune, we carefully 
   choose default parameters based on a study on short-text classification [2].
   Running `text-train.py' without parameters can deliver good
   classification accuracy in general. It is equivalent to the following 
   command, in which default parameters are explicitly specified.

   $ python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file
 
   Meaning for each parameter:

   -P 1: no stemming, no stopword removal, bigram features
   -G 0: no LIBLINEAR parameter selection
   -F 0: binary feature representation
   -N 1: each instance is normalized to unit length
   -L 0: use Crammer and Singer's multi-class method. 

Q: How to select the parameter C in LIBLINEAR automatically?
A: By default, LIBLINEAR (and `text-train.py') sets the parameter C to 1. 
   You can automatically select the best parameter C by using `-G 1`. 

Q: How to generate different models using the same training data?
A: Internally, text-train.py converts data to LIBSVM format and applies 
   LIBLINEAR for training. To reuse the pre-processed data, LibShortText 
   provides another workflow:

short texts ==========> LIBSVM format data ============> model ==============> result
            text2svm.py                    text-train.py       text-predict.py

   The following command generates a LIBSVM-format file `train_file.svm' and a directory
   `train_file.text_converter' containing information for the conversion.

   $ python text2svm.py train_file 
   [`train_file.text_converter' and `train_file.svm' are generated.]

   We then generate two models using the same LIBSVM-format file.

   $ python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model
   [A logistic regression model, `lr.model', is generated.]

   $ python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model
   [An L2-loss linear SVM model, `l2svm.model', is generated.]

Q: How to overwrite existing models or prediction results?
A: If the specified model or output file exists, by default, neither `text-train.py'
   nor `text-predict.py' overwrite them. You can generate new models/prediction 
   outputs by `-f'.
   
   $ python text-train.py -f train_file
   $ python text-predict.py -f test_file train_file.model predict_result

Q: Why is the file of prediction results so large?
A: By default, some additional information for analysis are stored. If you 
   need to get only classification accuracy, you can specify `-a 0' to save disk 
   space. For example,

   $ python text-predict.py -a 0 test_file train_file.model predict_result

Q: If I am an experienced LIBILNEAR user, how should I specify options 
   for LIBLINEAR and `grid.py'?
A: For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted 
   string after `-L' with a special character `@'. For example, if you want to
   use L2-regularized Logistic Regression as the classifier, set the parameter
   C to 0.5, and append a bias term to each instance, you can type

   $ python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file

   To show parameters provided by LIBLINEAR/grid, use

   $ python text-train.py -x liblinear
   $ python text-train.py -x grid

   For `grid.py', to specify the range of C, using `-G @"-log2c begin,end,step"'.  
   For example, the following command selects the best C among 
   [2^-2, 2^-1, 2^0, 2^1] in terms of cross validation rates.

   $ python text-train.py -G @"-log2c -2,1,1" train_file

Q: I have more features for texts, how can I add them in LibShortText?
A: You can use `-A' option in `text2svm.py', `text-train.py', and
   `text-predict.py' to append feature files. Note that you can use multiple
   feature files. If we have 20 features, and these features are included in
   two files, `train_feats1' and `train_feats2', then we can use these files in
   the training stage by

   $ python text-train.py -A train_feats1 -A train_feats2 train_file

   The features you use in the training stage should be identical to those in
   the predict stage. Assume that `test_feats1' and `test_feats2' are feature
   files corresponding to `train_feats1' and `train_feats2', respectively. To
   predict a test file you should use

   $ python text-predict.py -A test_feats1 -A test_feats2 test_file train_file.model predict_result

   The usage of analyzer is the same as before. The features will be
   represented in the following format.

   <feat_filename>:<feat_idx>

Q: I already have some LIBSVM-format features. How can I include these
   features when training the model?
A: You can use the -A option in the command line mode. For example, if you have
   two extra svm files `extra_train_1' and `extra_train_2' in LIBSVM-format, 
   then use:
   
   $ python text-train.py train_file -A extra_train_1 -A extra_train_2
   
   Note that `train_file', `extra_train_1', and `extra_train_2' should 
   have the same number of instances. And then use the following command to 
   predict:

   $ python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result
   


Interactive Error Analysis 
==========================

We provide interactive tools to analyze prediction results. First, you generate a
file of prediction results by the commands introduced in section `Quick Start.'
Note that you CANNNOT specify `-a 0' to `text-predict.py' or the prediction
result will not be analyzable.

You then enter Python, import the module, load the prediction results, and
create an object of `Analyzer' by reading a model.

    $ python
    >>> from libshorttext.analyzer import *
    >>> predict_result = InstanceSet('predict_result')
    >>> analyzer = Analyzer('train_file.model')
    
You can select a subset of test data for analysis using the following options. 

    `wrong'
        Select wrongly predicted instances.
        
    `with_labels(labels, target)'
        If `target' is `true', then instances with labels in the set `labels'
        are selected. If `target' is `predict', those predicted to be in
        `labels' are chosen. `target' can also be `both' or `or'. `both' and
        `or' find the union and the intersection of `true' and `predict',
        respectively. The default value of `target' is `both'.
        
    `sort_by_dec'
        Sort instances by decision values.

    `subset(amount, method)'
        Get a specific amount of data by the method `top' or `random'. The 
        default value of `method' is `top'.

For example, among wrongly predicted instances with labels 'Books', 'Music', 
'Art', and 'Baby', to get those having the highest 100 decision values, you can use

    >>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100))

You can run the following operations to know details of the selected instances.

    >>> analyzer.info(insts)
    Number of instances: 100
    Accuracy: 0.0 (0/100) 
    True labels: "Baby"  "Art"  "Books"  "Music"
    Predicted labels: "Baby"  "Music"  "Books"  "Art"
    Text source: /home/user/libshorttext-1.0/test_file
    Selectors: 
    -> Select wronly predicted instances
    -> labels: "Books", "Music", "Art", "Baby"
    -> Sort by maximum decision values.
    -> Select 100 instances in top.

The following command generates a confusion table on the selected instances:

    >>> analyzer.gen_confusion_table(insts)
             Art  Books  Music  Baby
    Art        0     15      4     5
    Books     10      0     17     3
    Music     10     21      0     3
    Baby       1      7      4     0

To analyze a single short text, you first load it by

    >>> insts.load_text()
    
Then you can print information for each single text in `insts'.

    >>> print(insts[61])
    text = avengers assemble 4 panini uk collector s edition nm 2012
    true label = Books
    predicted label = Music

You can print model weights corresponding to tokens of a short text. The
following operation prints weights of the three classes with the highest 
decision values. (To print weights in all classes, you can change 3 to 0.)

    >>> analyzer.analyze_single(insts[61], 3)
                        Music       Books    Antiques
    edition        -5.232e-02   8.869e-01  -1.303e-01
    s edition      -2.219e-02   1.527e-01  -4.077e-02
    nm              7.269e-01   6.048e-02  -1.495e-01
    collector      -5.253e-02  -5.208e-02   8.804e-02
    uk              9.466e-01  -2.089e-01   2.683e-02
    collector s    -3.174e-02   6.389e-02   9.963e-02
    4              -2.011e-01  -2.062e-01   1.526e-01
    2012           -1.173e-01   2.663e-01  -1.369e-01
    s              -5.142e-02   1.485e-01   1.757e-01
    **decval**      3.816e-01   3.705e-01   2.842e-02
    True label: Books
 
You can also analyze an arbitrary short text.

    >>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3)
                      Music      Crafts      Travel
    sealed        4.828e-01   1.050e-03  -5.383e-02
    cd            2.872e+00  -1.032e-01  -1.723e-01
    cd single     1.663e-01  -5.181e-03  -6.558e-03
    single        4.375e-01  -6.953e-02  -9.960e-02
    usa           2.247e-01   3.530e-02   2.657e-02
    beatles       5.050e-01  -5.710e-02  -6.933e-02
    3 cd          1.320e-02  -3.837e-02  -7.793e-20
    3             3.057e-02   4.712e-02   1.402e-01
    **decval**    1.673e+00  -6.716e-02  -8.299e-02

Additional Information
======================

[1] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library 
for Short-text Classification.

[2] H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product 
title classification versus text classification.

For any questions and comments, please email
[email protected]

libshorttext's People

Contributors

gennad avatar

Watchers

James Cloos avatar yuefeng.chenyf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.