Realtime twitter hate speech detection
This is the master branch of the repo it contains the codebase of Machine Learning approach that we followed to train our Model
The other branch is webapp which contains the codebase for deployment part mainly
We used different open source datasets, from different hackathons and competetions and combined them to make a bid dataset which containes variety of tweets the dataset majorly focuses on English Language Dataset Exploration has the code for all the exploration part of dataset and how we concatenated them.
Dataset Preprocessing contains the code of how we cleaned the dataset as it can not be directly fed to the Machine Learning Models. How different techniques we used to useful features from the text like hashtags, user mentions etc.
We trained our model using two prominent ML algorithms for Binary Classification, namely - Multinomial Naive Bayes and Logistic Regression.
The final model was saved based on training LR with n-grams of range (1,3) as lexical features.
The trainingg set classification report was:
The Test set classification report was:
The AUC-ROC curve for test set was:
documentation goes here