Git Product home page Git Product logo

chinesetextclassification's Introduction

Chinese Text Classification

  • Author: XJay18
  • Last Updated Time: July, 2020

Overview

This repo is originally an assignment for my NLP course. The objective of this assignment is to perform a Chinese text classification task on a Chinese text dataset.

A typical sample in the dataset may have the following form: data_id\tlabel\ttext. However you can easily modify the code to suit your own dataset format. For example, your dataset has the format of data_id\ttext\tlabel or label|text , etc. You can search # Specify your data format here. # in this repo to customize the code.

Best F1-Score (Example on a private dataset)

The best performance on the val dataset is achieved with BERT.

Below is the experiments and its corresponding performance. Best results are in bold.

Model Precision Recall F1 Config
MultinomialNB 81.06 80.05 79.38 alpha=0.0001
SVM 82.94 81.60 80.94 rbf kernel, C=1.8
AdaBoostDT 80.50 79.85 79.29 n_estimators=1000
BERT 94.55 94.50 94.48 click me

P-R Curve Comparison (Example on a private validation dataset)

pr-curve


How to run it?

Requirements

Please refer to this file to see the dependencies of this project. Below lists some of the important packages I use in this project:

Prepare dataset for testing

  • Please note that you need to replace the ./path/to/data in the codes with some appropriate paths.
  • It' s recommend that you store your dataset in a subdirectory under data/.

Run traditional machine learning algorithm on the dataset

  1. To use the traditional machine learning algorithm, you can directly run the following command:

    python ml.py -v

    By default, this will use the MultinomialNB for classification.

    The full settings of ml.py script are (you can run python ml.py --help to see the following infomation too):

    usage: ml.py [-h] -d DATA [-c CONFIG] [-r RECORD] [-v] [-i]
    
    Run final result of text classification.
    
    optional arguments:
    -h, --help            show this help message and exit
    -d DATA, --data DATA  Specify the directory for 'train.bunch', 'val.bunch'
                            data files.
    -c CONFIG, --config CONFIG
                            Specify the config file to run the process.
    -r RECORD, --record RECORD
                            Specify the directory for records.
    -v, --verbose         Whether to show some of the estimations for val
                            samples.
    -i, --image           Whether to show the P-R Curve of this model.
    
  2. Config file Please note that this section will introduce the config file for running only ml.py script.

    The structure of the config file is:

    model:
        name: MODEL_NAME
        other_arguments: VALUE

    I have provided three samples for running MultinomialNB, SVM and AdaBoostDecisionTree, you can use them directly. Please refer to sklearn document to see the model's arguments that need to be specified.

    model:
       name: AdaBoostDT
       n_estimators: 1000
    model:
        name: MultinomialNB
        alpha: 0.0001
    model:
       name: SVM
       C: 1.8
       kernel: rbf

    The project provides these models for this task:

    from sklearn.naive_bayes import MultinomialNB
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.linear_model import SGDClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.svm import SVC
    
    ML_MODELS = {
        "MultinomialNB": MultinomialNB,
        "RandomForest": RandomForestClassifier,
        "AdaBoostDT": AdaBoostClassifier,
        "GradientBoostDT": GradientBoostingClassifier,
        "SGD": SGDClassifier,
        "MLP": MLPClassifier,
        "SVM": SVC
    }
  3. Results After running the aforementioned command line, you will hopefully get all the results under runs/ directory like this:

    ChineseTextClassification/
        - .../
        - runs/
            - MultinomialNB/
                - YYYY-MM-DD...HH.MM.SS/
                    - P-R.png
                    - pr.values
                    - records.txt
                    - results.txt
            - AdaBoostDT/
                - YYYY-MM-DD...HH.MM.SS/
                    - P-R.png
                    - pr.values
                    - records.txt
                    - results.txt
            - .../
    
    • YYYY-MM-DD...HH.MM.SS or other strings like this are the experiment id to specify an experiment.
    • P-R.png is the P-R Curve of the result for that particular experiment.
    • pr.values stores the [preision, recall] value pairs for plotting P-R Curve in the future. These values are achieved by calling sklearn.metrics.precision_recall_curve method. You can refer to the method document to see how to use this file.
    • records.txt stores the standard output when running result.py script.
    • results.txt stores the predictions in the format data_id|predicted_label|text

Run neural network (typically, BERT) on the dataset

Thanks to the great DL community, we currently can use the advanced DL algorithm easily and push forward the performances on various tasks. Therefore, I also try the famous BERT model for this task.

  1. To test the performance with BERT, you can directly run the following command:

    python nn.py -s test

    By default, this will use this file as the protocol.

    Also, here are the full settings:

    usage: nn.py [-h] -s STAGE [-p PROTO] [-r RECORD] [-i]
    
    Run neural network for text classification.
    
    optional arguments:
    -h, --help            show this help message and exit
    -s STAGE, --stage STAGE
                            Specify the stage for the process, either 'train' or
                            'test'.
    -p PROTO, --proto PROTO
                            Specify the protocol file to run the process.
    -r RECORD, --record RECORD
                            Specify the directory for records.
    -i, --image           Whether to show the P-R Curve of this model.
    
  2. Protocol file Please note that this section will introduce the protocol file for running only nn.py script.

    An example of the protocol is like:

    model:
        name: BERT
        num_classes: 2
    data:
        pad: 32
        train_path: ./path/to/data
        val_path: ./path/to/data
        train_batch_size: 100
        val_batch_size: 200
        skip_first: true  # Whether to skip the first line in the dataset
        delimiter: "\t"  # The delimiter in the dataset, like '\t' or '|'
    optimizer:
        lr: 0.00003
        weight_decay: 0.0005
    scheduler:
        name: MultiStepLR
        milestones: [8]
        gamma: 0.1
    num_steps: 1000
    device: 0
    log_steps: 100
    val_steps: 200
    id: YYYY-MM-DD...HH.MM.SS
    ckpt_fold: runs

    As you may only evaluate the performance on this task rather than train the model on the dataset, you just need to specify the value for id to select a existing model. By default, I have put the best model's id in that field. Whew... so you can just neglect this section.

  3. Results After running the aforementioned command line, you will hopefully get all the results under runs/ directory like this:

    ChineseTextClassification/
        - .../
        - runs/
            - BERT/
                - YYYY-MM-DD...HH.MM.SS/
                    - best_model.bin
                    - latest_model.bin
                    - protocol.yml
                    - P-R.png *
                    - pr.values *
                    - records.txt
                    - results.txt *
                    - tests.txt *
                - YYYY-MM-DD...HH.MM.SS/
                    - ...
    
    • The file with name followed by * means that the file is generated during testing (i.e., after running python nn.py -s test)
    • Therefore, you can just check the file with name followed by * to see the performance on val dataset.

Epilogue

If you are confused of the project's settings or have any questions, feel free to report the issue or email me: [email protected].

chinesetextclassification's People

Contributors

xjay18 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.