Git Product home page Git Product logo

hate-speech-classification's Introduction

Hate Speech Classification / Twitter Sentiment Classification

We developed a tweet hate speech classification pipeline including preprocessing steps, data set splitting, and three methods of classifying a tweet that includes a Naive Bayes, a Support Vector Machine and a Neural Network. Hyperparameter tuning methods were used to maximize the precision of each method. The best and final model was obtained with the SVM.

For the classification, we used a python-notebook (.ipynb). This way we could easily and quickly access various python libraries while maintaining a clear structure. In addition, the format is very suitable for working in a group via Google Colab.

Structure

Jupyter Notebook

  1. Imports & read data: Data for training are uploaded, Python libraries are imported
  2. Basic inspection of dataset
  3. Preprocessing:
    1. „@USER, RT and {{URL}}" are removed
    2. Stemming and Lemmatization
    3. Everything in lowercase
  4. Train/test split: Division into train (80%) and test-set (20%)
  5. Naïve Bayes
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  6. SVM
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  7. Neural Network
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  8. Export results: Export of the classification result with the highest accuracy (in our case, the SVM)

Dataset

train.tsv

The train.tsv file is our dataset used for training. It has a collection of over 18.000 labeled tweets in a convenient tsv format.

test.tsv.dist

The test.tsv.dist file is our evaluation or testing set. These tweets consisting of almost 5000 tweets are not labeled and are to be predicted by our different trained models. Unfortunately we dont have access to the corresponding dataset with the actual labels. We do know, that our SVM labeled them correctly with a success rate of about ~94%.

Authors

This was a group project done as undergraduate students at the Karlsruhe Institute of Technology (KIT) as a bonus for the lecture 'Introduction to Artificial Intelligence' in January of 2022 by Jan Bode and Jan Dorn. Standard MIT License is applied.

hate-speech-classification's People

Contributors

jandorn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.