phecy / libshorttext Goto Github PK
View Code? Open in Web Editor NEWThis project forked from izimobile/libshorttext
My fork of libshorttext
License: BSD 3-Clause "New" or "Revised" License
This project forked from izimobile/libshorttext
My fork of libshorttext
License: BSD 3-Clause "New" or "Revised" License
LibShortText is an open source library for short-text classification (http://www.csie.ntu.edu.tw/~cjlin/libshorttext). Please read the COPYRIGHT file before using LibShortText. To get started, please read the ``Quick Start'' section first. For developers, please check our document at http://www.csie.ntu.edu.tw/~cjlin/libshorttext/doc/ for integrating LibShortText in your software. Table of Contents ================= - Installation and Data Format - Quick Start - Command-line Usage - More Examples about Command-line Usage - Interactive Error Analysis - Additional Information Installation and Data Format ============================ LibShortText requires UNIX systems with Python 2.6 or newer versions. The latest version (Python 2.7) is recommended for better efficiency. On Unix systems, type $ make to install the package. For training and test data, every line in the file contains a label and a short text in the following format: <label><TAB><text> A TAB character is between <label> and <text>. Both the label and the text can contain space characters. Here are some examples. Jewelry & Watches handcrafted two strand multi color bead necklace Books big bike magazine february 1973 Two sample sets included in this package are `train_file' and `test_file'. Quick Start =========== You can run $ cd demo $ ./demo.sh to run a demonstration. LibShortText provides a simple training-prediction workflow: short texts ============> model ==============> predictions text-train.py text-predict.py The command `text-train.py' trains a text set to obtain a model. For example, the following command generates `train_file.model' for the given `train_file'. $ python text-train.py train_file [output skipped] `text-predict.py' predicts a test file using the trained model. For example, the following command predicts `test_file' with `train_file.model' and stores the results in `predict_result'. $ python text-predict.py test_file train_file.model predict_result Accuracy = 87.1800% (4359/5000) Once predict_result is obtained, LibShortText provides several handy utilities to conduct error analysis in the Python interactive shell. Please see the section `Interactive Error Analysis' for more details. Command-line Usage ================== -`text-train.py' Usage `text-train.py' obtains a model by training either a short-text dataset or a LIBSVM-format data set generated by `text2svm.py'. Usage: text-train.py [options] training_file [model] options: -P {0|1|2|3|4|5|6|7|converter_directory} Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1) 0 no stopword removal, no stemming, unigram 1 no stopword removal, no stemming, bigram 2 no stopword removal, stemming, unigram 3 no stopword removal, stemming, bigram 4 stopword removal, no stemming, unigram 5 stopword removal, no stemming, bigram 6 stopword removal, stemming, unigram 7 stopword removal, stemming, bigram If a preprocssor directory is given instead, then it is assumed that the training data is already in LIBSVM format. The preprocessor will be included in the model for test. -G {0|1} Grid search for the parameter C in linear classifiers. (default 0) 0 disable grid search (faster) 1 enable grid search (slightly better results) -F {0|1|2|3} Feature representation. (default 0) 0 binary feature 1 word count 2 term frequency 3 TF-IDF (term frequency + IDF) -N {0|1} Instance-wise normalization before training/test. (default 1 to conduct normalization) -A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. -L {0|1|2|3} Classifier. (default 0) 0 support vector classification by Crammer and Singer 1 L1-loss support vector classification 2 L2-loss support vector classification 3 logistic regression -f Overwrite the existing model file. Examples: text-train.py -L 3 -F 1 -N 1 raw_text_file model_file text-train.py -P text2svm_converter -L 1 converted_svm_file -`text-predict.py' Usage `text-predict.py' predicts labels for a test dataset with a trained model. Usage: text-predict.py [options] test_file model output options: -f Overwrite the existing output file. -a {0|1} Output options. (default 1) 0 Store only predicted labels. The information is NOT sufficient for interactive analysis. Use this option if you would like to get only accuracy. 1 More information is stored. The output provides information for interactive analysis, but the size of output can become much larger. -A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. -`text2svm.py' Usage `text2svm.py' generates a directory containing needed information for converting short texts to LIBSVM format. An output file in LIBSVM format is also generated. Usage: text2svm.py [options] text_src [output] options: -P {0|1|2|3|4|5|6|7} Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1) 0 no stopword removal, no stemming, unigram 1 no stopword removal, no stemming, bigram 2 no stopword removal, stemming, unigram 3 no stopword removal, stemming, bigram 4 stopword removal, no stemming, unigram 5 stopword removal, no stemming, bigram 6 stopword removal, stemming, unigram 7 stopword removal, stemming, bigram Default output will be a file "text_src.svm" and a directory "text_src.text_converter." If "output" is specified, the output will be "output" and "output.text_converter." More Examples about Command-line Usage ====================================== We use the following questions/answers to demonstrate some examples. Q: Given many parameters provided by `text-train.py', how to choose the parameters at the first trial? A: Although `text-train.py' has several parameters to tune, we carefully choose default parameters based on a study on short-text classification [2]. Running `text-train.py' without parameters can deliver good classification accuracy in general. It is equivalent to the following command, in which default parameters are explicitly specified. $ python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file Meaning for each parameter: -P 1: no stemming, no stopword removal, bigram features -G 0: no LIBLINEAR parameter selection -F 0: binary feature representation -N 1: each instance is normalized to unit length -L 0: use Crammer and Singer's multi-class method. Q: How to select the parameter C in LIBLINEAR automatically? A: By default, LIBLINEAR (and `text-train.py') sets the parameter C to 1. You can automatically select the best parameter C by using `-G 1`. Q: How to generate different models using the same training data? A: Internally, text-train.py converts data to LIBSVM format and applies LIBLINEAR for training. To reuse the pre-processed data, LibShortText provides another workflow: short texts ==========> LIBSVM format data ============> model ==============> result text2svm.py text-train.py text-predict.py The following command generates a LIBSVM-format file `train_file.svm' and a directory `train_file.text_converter' containing information for the conversion. $ python text2svm.py train_file [`train_file.text_converter' and `train_file.svm' are generated.] We then generate two models using the same LIBSVM-format file. $ python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model [A logistic regression model, `lr.model', is generated.] $ python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model [An L2-loss linear SVM model, `l2svm.model', is generated.] Q: How to overwrite existing models or prediction results? A: If the specified model or output file exists, by default, neither `text-train.py' nor `text-predict.py' overwrite them. You can generate new models/prediction outputs by `-f'. $ python text-train.py -f train_file $ python text-predict.py -f test_file train_file.model predict_result Q: Why is the file of prediction results so large? A: By default, some additional information for analysis are stored. If you need to get only classification accuracy, you can specify `-a 0' to save disk space. For example, $ python text-predict.py -a 0 test_file train_file.model predict_result Q: If I am an experienced LIBILNEAR user, how should I specify options for LIBLINEAR and `grid.py'? A: For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted string after `-L' with a special character `@'. For example, if you want to use L2-regularized Logistic Regression as the classifier, set the parameter C to 0.5, and append a bias term to each instance, you can type $ python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file To show parameters provided by LIBLINEAR/grid, use $ python text-train.py -x liblinear $ python text-train.py -x grid For `grid.py', to specify the range of C, using `-G @"-log2c begin,end,step"'. For example, the following command selects the best C among [2^-2, 2^-1, 2^0, 2^1] in terms of cross validation rates. $ python text-train.py -G @"-log2c -2,1,1" train_file Q: I have more features for texts, how can I add them in LibShortText? A: You can use `-A' option in `text2svm.py', `text-train.py', and `text-predict.py' to append feature files. Note that you can use multiple feature files. If we have 20 features, and these features are included in two files, `train_feats1' and `train_feats2', then we can use these files in the training stage by $ python text-train.py -A train_feats1 -A train_feats2 train_file The features you use in the training stage should be identical to those in the predict stage. Assume that `test_feats1' and `test_feats2' are feature files corresponding to `train_feats1' and `train_feats2', respectively. To predict a test file you should use $ python text-predict.py -A test_feats1 -A test_feats2 test_file train_file.model predict_result The usage of analyzer is the same as before. The features will be represented in the following format. <feat_filename>:<feat_idx> Q: I already have some LIBSVM-format features. How can I include these features when training the model? A: You can use the -A option in the command line mode. For example, if you have two extra svm files `extra_train_1' and `extra_train_2' in LIBSVM-format, then use: $ python text-train.py train_file -A extra_train_1 -A extra_train_2 Note that `train_file', `extra_train_1', and `extra_train_2' should have the same number of instances. And then use the following command to predict: $ python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result Interactive Error Analysis ========================== We provide interactive tools to analyze prediction results. First, you generate a file of prediction results by the commands introduced in section `Quick Start.' Note that you CANNNOT specify `-a 0' to `text-predict.py' or the prediction result will not be analyzable. You then enter Python, import the module, load the prediction results, and create an object of `Analyzer' by reading a model. $ python >>> from libshorttext.analyzer import * >>> predict_result = InstanceSet('predict_result') >>> analyzer = Analyzer('train_file.model') You can select a subset of test data for analysis using the following options. `wrong' Select wrongly predicted instances. `with_labels(labels, target)' If `target' is `true', then instances with labels in the set `labels' are selected. If `target' is `predict', those predicted to be in `labels' are chosen. `target' can also be `both' or `or'. `both' and `or' find the union and the intersection of `true' and `predict', respectively. The default value of `target' is `both'. `sort_by_dec' Sort instances by decision values. `subset(amount, method)' Get a specific amount of data by the method `top' or `random'. The default value of `method' is `top'. For example, among wrongly predicted instances with labels 'Books', 'Music', 'Art', and 'Baby', to get those having the highest 100 decision values, you can use >>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100)) You can run the following operations to know details of the selected instances. >>> analyzer.info(insts) Number of instances: 100 Accuracy: 0.0 (0/100) True labels: "Baby" "Art" "Books" "Music" Predicted labels: "Baby" "Music" "Books" "Art" Text source: /home/user/libshorttext-1.0/test_file Selectors: -> Select wronly predicted instances -> labels: "Books", "Music", "Art", "Baby" -> Sort by maximum decision values. -> Select 100 instances in top. The following command generates a confusion table on the selected instances: >>> analyzer.gen_confusion_table(insts) Art Books Music Baby Art 0 15 4 5 Books 10 0 17 3 Music 10 21 0 3 Baby 1 7 4 0 To analyze a single short text, you first load it by >>> insts.load_text() Then you can print information for each single text in `insts'. >>> print(insts[61]) text = avengers assemble 4 panini uk collector s edition nm 2012 true label = Books predicted label = Music You can print model weights corresponding to tokens of a short text. The following operation prints weights of the three classes with the highest decision values. (To print weights in all classes, you can change 3 to 0.) >>> analyzer.analyze_single(insts[61], 3) Music Books Antiques edition -5.232e-02 8.869e-01 -1.303e-01 s edition -2.219e-02 1.527e-01 -4.077e-02 nm 7.269e-01 6.048e-02 -1.495e-01 collector -5.253e-02 -5.208e-02 8.804e-02 uk 9.466e-01 -2.089e-01 2.683e-02 collector s -3.174e-02 6.389e-02 9.963e-02 4 -2.011e-01 -2.062e-01 1.526e-01 2012 -1.173e-01 2.663e-01 -1.369e-01 s -5.142e-02 1.485e-01 1.757e-01 **decval** 3.816e-01 3.705e-01 2.842e-02 True label: Books You can also analyze an arbitrary short text. >>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3) Music Crafts Travel sealed 4.828e-01 1.050e-03 -5.383e-02 cd 2.872e+00 -1.032e-01 -1.723e-01 cd single 1.663e-01 -5.181e-03 -6.558e-03 single 4.375e-01 -6.953e-02 -9.960e-02 usa 2.247e-01 3.530e-02 2.657e-02 beatles 5.050e-01 -5.710e-02 -6.933e-02 3 cd 1.320e-02 -3.837e-02 -7.793e-20 3 3.057e-02 4.712e-02 1.402e-01 **decval** 1.673e+00 -6.716e-02 -8.299e-02 Additional Information ====================== [1] H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library for Short-text Classification. [2] H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product title classification versus text classification. For any questions and comments, please email [email protected]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.