Implementation of a sarcasm detection model using deep learning with BERT to detect sarcasm in news headlines and social media posts
By: Nanna Hannesdottir, Kevin Vogt-Lowell, Cole Hunter
- PyTorch 1.11
- Cuda 11.2
- numpy, pandas, matplotlib
- io,os
- sklearn
- optuna
- tqdm
- json
Additionally, we used multiple methods from the Huggingface transformers library (installations included in notebooks).
We used the following two datasets in our experimentation, both available on Kaggle.
- SARC Reddit data: https://www.kaggle.com/datasets/danofer/sarcasm**
- News Headlines data: https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection
The News Headlines dataset can also be found in this repository under the data/ folder. The Reddit data is too big to upload to Github directly.
Under data/ you can also find a notebook with our preliminary inspections of the datasets. This notebook is exploratory only and not required to replicate the results.
- Main script for training and testing: basic_BERT_reddit.ipynb
- Functions and classes: defined in reddit_bert_functions.py, bert_sarcasm_model.py
- Main script with all functionality: multihead_bert_reddit.ipynb. Just run cells in order.
- Main script including all classes and functions for training and testing: Headlines_model.ipynb
- Main script with all functionality: multihead_bert_headlines.ipynb. Just run cells in order.
**NOTE: For reddit, we exclusively used train-balanced-sarcasm.csv as our data. The Reddit dataset is split into training and testing sets, so we had planned on using the data splits as provided. However, a more in-depth analysis of the dataset and its splits revealed that the data contained within the test set seemed to bear no relation in context or format to the data present in the training set. To resolve the issue, we decided that, given the massive size of the dataset, we would simply treat the training set as our main Reddit dataset and then create our training, testing, and validation splits from this newly defined set. Judging by notebooks available on Kaggle, other researchers who used this dataset frequently replicated this approach.