Git Product home page Git Product logo

twitter-dialect-classification's Introduction

Twitter dialect classification

Table of contents

Description

In this project I used the QADI dataset to classify the different dialects of Arabic language. I used machine learning methods specifically logestic regression, linear SVM and naive bayes. I also tried to compare those methods with deep learning models namely LSTM and word embeddings. I did many my experiments locally on my machine. I deployed my project using Flask
There are 4 main files in this project

  • 4 .py scripts (Data_fetching.py, Data_pre_processing.py, Model_training.py and app.py) with the final results/code

There are also

  • 2 Jupyter notebooks fetching and processing.ipynb and ML and DL training.ipynb with detailed code of all the experiments that I did
  • a dicts.py file which is there just to help in predicting during Flask deployment. it contains a processing function to process the user's input text
  • a picture final model.png of the final deep learning model
  • the Flask App's files like HTML and CSS files
  • a presentation file Tweeter dialect classification.pptx
  • The dataset

Getting Started

Dependencies

  • Anaconda is a must
  • tensorflow
  • flask
  • farasapy (you need to install java in order to work)
  • PyArabic
  • gensim

Installing

  • this code was run successfully on my windows machine.
  • it's recommended to create a new anaconda environment with
conda create -n tf tensorflow
conda activate tf
  • you need to install the dependencies
conda install pandas
conda install scikit-learn 
conda install -c anaconda flask
conda install -c anaconda gensim
conda install tensorflow
conda install -c conda-forge matplotlib
pip install farasapy
pip install PyArabic
  • if you faced any problem trying to use jupyter with this message "'jupyter' is not recognized as an internal or external command", install jupyter
    pip install notebook
  • please install java for farasapy to work
  • in order to train the models that require pretrained word embedding you need to download word embedding from
  1. Mazajak specifically the CBOW words that were trained on 100M tweets (this is required to runModel_training.py)
  2. AraVec specifically the Unigrams CBOW Models with vector size of 100

Executing program

  • you need to run .py scripts in the right order in the command line.
  • if you're intersted you can open the jupyter notebook for full detailed code/experiments
  • type flask run in the command line to run the flask app in your browser

Results

  • Results of deep learning on the validation set
Model name Accuracy F1 score
Embedding layer without lstm from scratch 0.523 0.494
LSTM from scratch 0.455 0.399
Embedding layer with finetuned AraVec 0.524 0.493
Embedding layer with finetuned Mazajak 0.526 0.497
LSTM with fixed pretrained embedding Mazajak 0.313 0.181
LSTM with fixed pretrained embedding AraVec 0.125 0.012
  • results of the machine learning models on the validation set
Model name Accuracy F1 score
uni-gram(Tf-idf) SVM 0.512 0.478
two-gram(Tf-idf) SVM 0.538 0.507
  • comparison of the deep learning and machine learning on the test set
Model name Accuracy F1 score
two gram SVM 0.5388 0.5072
Embedding layer with finetuned Mazajak 0.5288 0.5024

Author

name: Bassel Ali Mahmoud
email: [email protected]

Acknowledgments

twitter-dialect-classification's People

Contributors

basselali1 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.