Git Product home page Git Product logo

comment-removal-prediction's Introduction

Comment removal

FOSSA Status

This repo explores the comment removal prediction task using a sentence embedding mechanism followed by a classifier of choice.

For example encoding the reddit comments using LASER as inputs to different classifiers (mlp, svm or random forest).

The focus is on assesing how different embedding choices affect the classification and little effort is on finding the best classifier or finetunning the parameters of the classifier.

Additionally the transformer language model with a clasifier head is also explored.

Dataset

We use the Reddit comment removal dataset

Content

The dataset is a csv of about 30k reddit comments made in /r/science between Jan 2017 and June 2018. 10k of the comments were removed by moderators; the original text for these comments was recovered using the pushshift.io API. Each comment is a top-level reply to the parent post and has a comment score of 14 or higher.

The dataset comes from Google BigQuery, Reddit, and Pushshift.io.

Exploration

In scripts/expore_dataset.ipynb there's an overview of the dataset, class counts, input lengths and a sentiment analysis from a small random sample grouped by label.

The codebase tries to make little assumptions and not using any specific hand-crafted features but is helpful to understand the nature of the data at hand.

Structure

# tree -L 3 --dirsfirst -I "*.pyc|*cache*|*init*|*.npy|*.png|*.pkl"
.
├── comment_removal
│   ├── utils
│   │   ├── batchers.py
│   │   ├── loaders.py
│   │   ├── metrics.py
│   │   ├── mutils.py
│   │   ├── plotting.py
│   │   └── text_processing.py
│   ├── encoders.py
│   ├── laser_classifier.py
│   └── transformer_classifier.py
├── data
│   ├── reddit_test.csv
│   └── reddit_train.csv
├── external        # external model checkpoints and modified model definitions
│   ├── models
│   │   ├── LASER           # LASER encoder checkpoints
│   │   ├── transformer     # openAI Transformer checkpoints
│   │   ├── laser.py        # extended LASER model definition
│   │   └── transformer.py  # extended Transformer model definition
│   └── pyBPE          # BPE encoding codebase dependency for LASER encoding
├── results
│   └── test_predictions.csv
├── scripts
│   ├── init_LASER.sh           # Download pyBPE and LASER weights
│   ├── init_transformer.sh     # Download Transformer weights
│   └── explore_dataset.ipynb
├── tests
│   └── test_embeddings.py
├── workdir
├── README.md
├── requirements.txt
└── setup.cfg

How To

Installation

LASER encoder

First download the pretrained models and additional external code:

    ./scripts/init_LASER.sh

Follow the instructions in external/pyBPE to install the pyBPE tool.

Then, install the python dependencies:

    pip install -r requirements.txt

Transformer model

Download the pre-trained weights:

    ./scripts/init_transformer.sh

Run

The codebase offers two choices:

  1. Embeddings (LASER, LSI) + a choice of classifiers (MLP, RandomForest, SVC).
  2. Transformer model

Train

  • Training a MLP classifier on LASER-encoded inputs:
    python -m comment_removal.laser_classifier train \
            --encoder-type laser \
            --clf-type mlp

If you prefer to skip encoding and training, pre-trained models are available. More specifically:

  • LASER-encoded inputs + RandomForest
  • LASER-encoded inputs + MLP
  • 300 dimensional LSI-encoded inputs + RandomForest
  • 300 dimensional LSI-encoded inputs + MLP

To download the above:

    ./scripts/download_LASER_classifiers.sh
  • Training the transformer model:
    python -m comment_removal.transformer train

Alternatively you can open the ipython notebook in colab, which is recommended as it is self contained and benefits from GPU acceleration.

This uses the pre-trained weights from openAI implementation loaded into a PyTorch implementation of the model.

Eval

To evaluate one of the previously encoded inputs and trained models, for example LASER-encoded inputs and a Randomforest classifier:

    python -m comment_removal.laser_classifier eval \
        --encoder-type laser \
        --clf-type randomforest \
        --predictions-file results/LASER_randomforest_predictions.csv

This will try to load the encoded inputs from workdir/test_laser-comments.npy and the model from workdir/laser_randomforest.npy

Results

The codebase compares the following configurations:

Embeddings:

  • LSI:

    • keep_n = 10000 words. Without filtering by frequency of appearance

    • num_topics: aka number of latent dimensions: Two configurations are tested: 300 and 1024. Embeddings with 300 latent dimensions perform better but we test with 1024 too just so we can compare by matching the dimensionality of the LASER-encoded inputs and hence the classifier capacity. We use the LSI 300 dimensional embeddings as baseline.

  • LASER: Using a BiLSTM trained on 93 langauges (see original repository). Similarly we use the 93 language joint vocabulary and BPE codes.

Clasifiers:

  • MLP:

    • 3 hidden layers: (1024, 512, 128)
    • ReLU activations units
    • Trained with Adam optimizer
    • Using Early stopping
  • RandomForest:

    • Number of estimators: 1000
    • Maximum depth: 100
    • Max features: 100

Comparison

1024 dimensions LSI embeddings + MLP: 1024-LSI + MLP

300 dimensions LSI embeddings + MLP: 300-LSI + MLP

LASER embeddings + MLP: LASER + MLP

As can be seen using large pre-trained embedding models achieves similar performance as other baselines found in these kaggle kernels whilst involving little training and no handcrafted feature extraction. Note that there's no lowercasing, word replacing or any other type of text processing other than tokenization and BPE encoding for LASER embeddings..

Limitations

The following limitations are acknowledged:

  • Configuration flexibility for the embeddings and classifiers
  • Proper experimentation logging (Sacred or similar)
  • Unit testing
  • Code documentation and Typing

comment-removal-prediction's People

Watchers

 avatar

Forkers

fossabot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.