This repo explores the comment removal
prediction task using a
sentence embedding mechanism followed by a classifier of choice.
For example encoding the reddit comments
using
LASER as inputs to different
classifiers (mlp
, svm
or random forest
).
The focus is on assesing how different embedding choices affect the classification and little effort is on finding the best classifier or finetunning the parameters of the classifier.
Additionally the transformer language model with a clasifier head is also explored.
We use the Reddit comment removal dataset
The dataset is a csv of about 30k reddit comments made in /r/science
between Jan 2017 and June 2018. 10k of the comments were removed by
moderators; the original text for these comments was recovered using the pushshift.io API.
Each comment is a top-level reply to the parent post and has a comment score of 14 or higher.
The dataset comes from Google BigQuery, Reddit, and Pushshift.io.
In scripts/expore_dataset.ipynb
there's an overview of the dataset,
class counts, input lengths and a sentiment analysis from a small random sample
grouped by label.
The codebase tries to make little assumptions and not using any specific hand-crafted features but is helpful to understand the nature of the data at hand.
# tree -L 3 --dirsfirst -I "*.pyc|*cache*|*init*|*.npy|*.png|*.pkl"
.
├── comment_removal
│ ├── utils
│ │ ├── batchers.py
│ │ ├── loaders.py
│ │ ├── metrics.py
│ │ ├── mutils.py
│ │ ├── plotting.py
│ │ └── text_processing.py
│ ├── encoders.py
│ ├── laser_classifier.py
│ └── transformer_classifier.py
├── data
│ ├── reddit_test.csv
│ └── reddit_train.csv
├── external # external model checkpoints and modified model definitions
│ ├── models
│ │ ├── LASER # LASER encoder checkpoints
│ │ ├── transformer # openAI Transformer checkpoints
│ │ ├── laser.py # extended LASER model definition
│ │ └── transformer.py # extended Transformer model definition
│ └── pyBPE # BPE encoding codebase dependency for LASER encoding
├── results
│ └── test_predictions.csv
├── scripts
│ ├── init_LASER.sh # Download pyBPE and LASER weights
│ ├── init_transformer.sh # Download Transformer weights
│ └── explore_dataset.ipynb
├── tests
│ └── test_embeddings.py
├── workdir
├── README.md
├── requirements.txt
└── setup.cfg
First download the pretrained models and additional external code:
./scripts/init_LASER.sh
Follow the instructions in external/pyBPE
to install the pyBPE
tool.
Then, install the python dependencies:
pip install -r requirements.txt
Download the pre-trained weights:
./scripts/init_transformer.sh
The codebase offers two choices:
- Embeddings (
LASER
,LSI
) + a choice of classifiers (MLP
,RandomForest
,SVC
). - Transformer model
- Training a
MLP
classifier onLASER
-encoded inputs:
python -m comment_removal.laser_classifier train \
--encoder-type laser \
--clf-type mlp
If you prefer to skip encoding and training, pre-trained models are available. More specifically:
LASER-encoded
inputs +RandomForest
LASER-encoded
inputs +MLP
300 dimensional LSI-encoded
inputs +RandomForest
300 dimensional LSI-encoded
inputs +MLP
To download the above:
./scripts/download_LASER_classifiers.sh
- Training the
transformer
model:
python -m comment_removal.transformer train
Alternatively you can open the ipython notebook in colab, which is recommended as it is self contained and benefits from GPU acceleration.
This uses the pre-trained weights from openAI implementation loaded into a PyTorch implementation of the model.
To evaluate one of the previously encoded inputs and trained models,
for example LASER
-encoded inputs and a Randomforest
classifier:
python -m comment_removal.laser_classifier eval \
--encoder-type laser \
--clf-type randomforest \
--predictions-file results/LASER_randomforest_predictions.csv
This will try to load the encoded inputs from workdir/test_laser-comments.npy
and the model from workdir/laser_randomforest.npy
The codebase compares the following configurations:
-
LSI:
-
keep_n
= 10000 words. Without filtering by frequency of appearance -
num_topics
: aka number of latent dimensions: Two configurations are tested: 300 and 1024. Embeddings with 300 latent dimensions perform better but we test with 1024 too just so we can compare by matching the dimensionality of the LASER-encoded inputs and hence the classifier capacity. We use the LSI 300 dimensional embeddings as baseline.
-
-
LASER: Using a BiLSTM trained on 93 langauges (see original repository). Similarly we use the 93 language joint vocabulary and BPE codes.
-
MLP:
- 3 hidden layers: (1024, 512, 128)
- ReLU activations units
- Trained with Adam optimizer
- Using Early stopping
-
RandomForest:
- Number of estimators: 1000
- Maximum depth: 100
- Max features: 100
1024 dimensions LSI
embeddings + MLP:
300 dimensions LSI
embeddings + MLP:
As can be seen using large pre-trained embedding models achieves similar performance as
other baselines found in these
kaggle kernels
whilst involving little training and no handcrafted feature extraction.
Note that there's no lowercasing, word replacing or any other
type of text processing other than tokenization and BPE encoding for LASER
embeddings..
The following limitations are acknowledged:
- Configuration flexibility for the embeddings and classifiers
- Proper experimentation logging (Sacred or similar)
- Unit testing
- Code documentation and Typing