The libshorttext from phecy

LibShortText is an open source library for short-text classification
(http://www.csie.ntu.edu.tw/~cjlin/libshorttext). Please read the COPYRIGHT file
before using LibShortText.

To get started, please read the ``Quick Start'' section first.  

For developers, please check our document at 
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/doc/ for integrating
LibShortText in your software.

Table of Contents
=================

- Installation and Data Format
- Quick Start
- Command-line Usage
- More Examples about Command-line Usage 
- Interactive Error Analysis 
- Additional Information


Installation and Data Format
============================

LibShortText requires UNIX systems with Python 2.6 or newer versions.  
The latest version (Python 2.7) is recommended for better efficiency. 

On Unix systems, type

    $ make

to install the package. For training and test data, every line in the file
contains a label and a short text in the following format:

    <label><TAB><text>

A TAB character is between <label> and <text>. Both the label and the text can 
contain space characters. Here are some examples.

    Jewelry & Watches	handcrafted two strand multi color bead necklace
    Books	big bike magazine february 1973

Two sample sets included in this package are `train_file' and `test_file'.

Quick Start
===========

You can run

    $ cd demo
    $ ./demo.sh

to run a demonstration.

LibShortText provides a simple training-prediction workflow:

short texts ============> model ==============> predictions
        	text-train.py       text-predict.py

The command `text-train.py' trains a text set to obtain a model. For
example, the following command generates `train_file.model' for the
given `train_file'.

    $ python text-train.py train_file
    [output skipped]

`text-predict.py' predicts a test file using the trained model. For example, the
following command predicts `test_file' with `train_file.model' and stores the
results in `predict_result'.

    $ python text-predict.py test_file train_file.model predict_result
    Accuracy = 87.1800% (4359/5000)

Once predict_result is obtained, LibShortText provides several handy utilities 
to conduct error analysis in the Python interactive shell. Please see
the section `Interactive Error Analysis' for more details. 


Command-line Usage
==================
            
-`text-train.py' Usage

    `text-train.py' obtains a model by training either a short-text dataset
    or a LIBSVM-format data set generated by `text2svm.py'.

    Usage: text-train.py [options] training_file [model]
    
    options: 
        -P {0|1|2|3|4|5|6|7|converter_directory}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
            If a preprocssor directory is given instead, then it is assumed
            that the training data is already in LIBSVM format. The preprocessor
            will be included in the model for test. 
        -G {0|1}
            Grid search for the parameter C in linear classifiers. (default 0)
            0   disable grid search (faster)
            1   enable grid search (slightly better results)
        -F {0|1|2|3}
            Feature representation. (default 0)
            0   binary feature
            1   word count 
            2   term frequency
            3   TF-IDF (term frequency + IDF)
        -N {0|1}
            Instance-wise normalization before training/test.
            (default 1 to conduct normalization)
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.
        -L {0|1|2|3}
            Classifier. (default 0)
            0   support vector classification by Crammer and Singer
            1   L1-loss support vector classification
            2   L2-loss support vector classification
            3   logistic regression
        -f
            Overwrite the existing model file.
    Examples:
        text-train.py -L 3 -F 1 -N 1 raw_text_file model_file
        text-train.py -P text2svm_converter -L 1 converted_svm_file

-`text-predict.py' Usage

    `text-predict.py' predicts labels for a test dataset with a trained model. 

    Usage: text-predict.py [options] test_file model output
    
    options:
        -f
            Overwrite the existing output file.
        -a {0|1}
            Output options. (default 1)
            0   Store only predicted labels. The information is NOT sufficient 
                for interactive analysis. Use this option if you would like to get 
                only accuracy.
            1   More information is stored. The output provides information for 
                interactive analysis, but the size of output can become much larger.
        -A extra_svm_file
            Append extra libsvm-format data. This parameter can be applied many
            times if more than one extra svm-format data set need to be appended.

-`text2svm.py' Usage

    `text2svm.py' generates a directory containing needed information for
    converting short texts to LIBSVM format. An output file in LIBSVM format is
    also generated.

    Usage: text2svm.py [options] text_src [output]
    
    options:
        -P {0|1|2|3|4|5|6|7}
            Preprocessor options. The options include stopwrod removal, 
            stemming, and bigram. (default 1)
            0   no stopword removal, no stemming, unigram
            1   no stopword removal, no stemming, bigram
            2   no stopword removal, stemming, unigram
            3   no stopword removal, stemming, bigram
            4   stopword removal, no stemming, unigram
            5   stopword removal, no stemming, bigram
            6   stopword removal, stemming, unigram
            7   stopword removal, stemming, bigram
    Default output will be a file "text_src.svm" and a directory
    "text_src.text_converter." If "output" is specified, the output will be
    "output" and "output.text_converter."


More Examples about Command-line Usage 
======================================

We use the following questions/answers to demonstrate some examples.

Q: Given many parameters provided by `text-train.py', how to choose the 
   parameters at the first trial? 
A: Although `text-train.py' has several parameters to tune, we carefully 
   choose default parameters based on a study on short-text classification [2].
   Running `text-train.py' without parameters can deliver good
   classification accuracy in general. It is equivalent to the following 
   command, in which default parameters are explicitly specified.

   $ python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file
 
   Meaning for each parameter:

   -P 1: no stemming, no stopword removal, bigram features
   -G 0: no LIBLINEAR parameter selection
   -F 0: binary feature representation
   -N 1: each instance is normalized to unit length
   -L 0: use Crammer and Singer's multi-class method. 

Q: How to select the parameter C in LIBLINEAR automatically?
A: By default, LIBLINEAR (and `text-train.py') sets the parameter C to 1. 
   You can automatically select the best parameter C by using `-G 1`. 

Q: How to generate different models using the same training data?
A: Internally, text-train.py converts data to LIBSVM format and applies 
   LIBLINEAR for training. To reuse the pre-processed data, LibShortText 
   provides another workflow:

short texts ==========> LIBSVM format data ============> model ==============> result
            text2svm.py                    text-train.py       text-predict.py

   The following command generates a LIBSVM-format file `train_file.svm' and a directory
   `train_file.text_converter' containing information for the conversion.

   $ python text2svm.py train_file 
   [`train_file.text_converter' and `train_file.svm' are generated.]

   We then generate two models using the same LIBSVM-format file.

   $ python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model
   [A logistic regression model, `lr.model', is generated.]

   $ python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model
   [An L2-loss linear SVM model, `l2svm.model', is generated.]

Q: How to overwrite existing models or prediction results?
A: If the specified model or output file exists, by default, neither `text-train.py'
   nor `text-predict.py' overwrite them. You can generate new models/prediction 
   outputs by `-f'.
   
   $ python text-train.py -f train_file
   $ python text-predict.py -f test_file train_file.model predict_result

Q: Why is the file of prediction results so large?
A: By default, some additional information for analysis are stored. If you 
   need to get only classification accuracy, you can specify `-a 0' to save disk 
   space. For example,

   $ python text-predict.py -a 0 test_file train_file.model predict_result

Q: If I am an experienced LIBILNEAR user, how should I specify options 
   for LIBLINEAR and `grid.py'?
A: For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted 
   string after `-L' with a special character `@'. For example, if you want to
   use L2-regularized Logistic Regression as the classifier, set the parameter
   C to 0.5, and append a bias term to each instance, you can type

   $ python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file

   To show parameters provided by LIBLINEAR/grid, use

   $ python text-train.py -x liblinear
   $ python text-train.py -x grid

   For `grid.py', to specify the range of C, using `-G @"-log2c begin,end,step"'.  
   For example, the following command selects the best C among 
   [2^-2, 2^-1, 2^0, 2^1] in terms of cross validation rates.

   $ python text-train.py -G @"-log2c -2,1,1" train_file

Q: I have more features for texts, how can I add them in LibShortText?
A: You can use `-A' option in `text2svm.py', `text-train.py', and
   `text-predict.py' to append feature files. Note that you can use multiple
   feature files. If we have 20 features, and these features are included in
   two files, `train_feats1' and `train_feats2', then we can use these files in
   the training stage by

   $ python text-train.py -A train_feats1 -A train_feats2 train_file

   The features you use in the training stage should be identical to those in
   the predict stage. Assume that `test_feats1' and `test_feats2' are feature
   files corresponding to `train_feats1' and `train_feats2', respectively. To
   predict a test file you should use

   $ python text-predict.py -A test_feats1 -A test_feats2 test_file train_file.model predict_result

   The usage of analyzer is the same as before. The features will be
   represented in the following format.

   <feat_filename>:<feat_idx>

Q: I already have some LIBSVM-format features. How can I include these
   features when training the model?
A: You can use the -A option in the command line mode. For example, if you have
   two extra svm files `extra_train_1' and `extra_train_2' in LIBSVM-format, 
   then use:
   
   $ python text-train.py train_file -A extra_train_1 -A extra_train_2
   
   Note that `train_file', `extra_train_1', and `extra_train_2' should 
   have the same number of instances. And then use the following command to 
   predict:

   $ python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result
   


Interactive Error Analysis 
==========================

We provide interactive tools to analyze prediction results. First, you generate a
file of prediction results by the commands introduced in section `Quick Start.'
Note that you CANNNOT specify `-a 0' to `text-predict.py' or the prediction
result will not be analyzable.

You then enter Python, import the module, load the prediction results, and
create an object of `Analyzer' by reading a model.

    $ python
    >>> from libshorttext.analyzer import *
    >>> predict_result = InstanceSet('predict_result')
    >>> analyzer = Analyzer('train_file.model')
    
You can select a subset of test data for analysis using the following options. 

    `wrong'
        Select wrongly predicted instances.
        
    `with_labels(labels, target)'
        If `target' is `true', then instances with labels in the set `labels'
        are selected. If `target' is `predict', those predicted to be in
        `labels' are chosen. `target' can also be `both' or `or'. `both' and
        `or' find the union and the intersection of `true' and `predict',
        respectively. The default value of `target' is `both'.
        
    `sort_by_dec'
        Sort instances by decision values.

    `subset(amount, method)'
        Get a specific amount of data by the method `top' or `random'. The 
        default value of `method' is `top'.

For example, among wrongly predicted instances with labels 'Books', 'Music', 
'Art', and 'Baby', to get those having the highest 100 decision values, you can use

    >>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100))

You can run the following operations to know details of the selected instances.

    >>> analyzer.info(insts)
    Number of instances: 100
    Accuracy: 0.0 (0/100) 
    True labels: "Baby"  "Art"  "Books"  "Music"
    Predicted labels: "Baby"  "Music"  "Books"  "Art"
    Text source: /home/user/libshorttext-1.0/test_file
    Selectors: 
    -> Select wronly predicted instances
    -> labels: "Books", "Music", "Art", "Baby"
    -> Sort by maximum decision values.
    -> Select 100 instances in top.

The following command generates a confusion table on the selected instances:

    >>> analyzer.gen_confusion_table(insts)
             Art  Books  Music  Baby
    Art        0     15      4     5
    Books     10      0     17     3
    Music     10     21      0     3
    Baby       1      7      4     0

To analyze a single short text, you first load it by

    >>> insts.load_text()
    
Then you can print information for each single text in `insts'.

    >>> print(insts[61])
    text = avengers assemble 4 panini uk collector s edition nm 2012
    true label = Books
    predicted label = Music

You can print model weights corresponding to tokens of a short text. The
following operation prints weights of the three classes with the highest 
decision values. (To print weights in all classes, you can change 3 to 0.)

    >>> analyzer.analyze_single(insts[61], 3)
                        Music       Books    Antiques
    edition        -5.232e-02   8.869e-01  -1.303e-01
    s edition      -2.219e-02   1.527e-01  -4.077e-02
    nm              7.269e-01   6.048e-02  -1.495e-01
    collector      -5.253e-02  -5.208e-02   8.804e-02
    uk              9.466e-01  -2.089e-01   2.683e-02
    collector s    -3.174e-02   6.389e-02   9.963e-02
    4              -2.011e-01  -2.062e-01   1.526e-01
    2012           -1.173e-01   2.663e-01  -1.369e-01
    s              -5.142e-02   1.485e-01   1.757e-01
    **decval**      3.816e-01   3.705e-01   2.842e-02
    True label: Books
 
You can also analyze an arbitrary short text.

    >>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3)
                      Music      Crafts      Travel
    sealed        4.828e-01   1.050e-03  -5.383e-02
    cd            2.872e+00  -1.032e-01  -1.723e-01
    cd single     1.663e-01  -5.181e-03  -6.558e-03
    single        4.375e-01  -6.953e-02  -9.960e-02
    usa           2.247e-01   3.530e-02   2.657e-02
    beatles       5.050e-01  -5.710e-02  -6.933e-02
    3 cd          1.320e-02  -3.837e-02  -7.793e-20
    3             3.057e-02   4.712e-02   1.402e-01
    **decval**    1.673e+00  -6.716e-02  -8.299e-02

Additional Information
======================

[1] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library 
for Short-text Classification.

[2] H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product 
title classification versus text classification.

For any questions and comments, please email
[email protected]
phecy / libshorttext Goto Github PK

libshorttext's Introduction

libshorttext's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent