Git Product home page Git Product logo

aoc_id's Introduction

Deep Models for Arabic Dialect Identification on Benchmarked Data. [Paper].

This is a simple text classification library focused on dialect identification, based on keras. Some Arabic text normalization utilities are included.

Current Implemented Models:

1- Word Level CNN based on [Convultion Neural Network for Text Classificartion].

2- Word Level C-LSTM based on [A C-LSTM Neural Network for Text Classification].

3- Recurrent Network and its variants (BiLSTM, LSTM, GRU, BiGRU, Attention-BiLSTM)

4- Models implemented but currently not supported in options (Attention-LSTM,Attention-BiGRU).

5- Not yet tested (char level CNN).

Requirements

- keras (2.0 or above)
- gensim
- numpy
- pandas

General Usage:

- * Tested with python 3.4 *
- python test_baselines.py --train training_file --Ar='True' --dev Dev_File --test test_file --model_type=model_selection --static=Trainable_embeddings --rand=Random_Embeddings --embedding=External_Embedding_model --model_file=Output_model_file_inJson
- put your training labels in [[link](https://github.com/UBC-NLP/aoc_id/edit/master/conf/label_list)].

Options details

  • train: training file assuming in csv format, text, label
  • Ar: if True then Arabic normalization is applied (should be true in case of external embeddings)
  • dev: Development file in csv format
  • test: test file in csv format
  • model_type: currently support those type of models: (cnn: word level cnn, clstm: word level clstm, lstm: vanilla lstm architecture, blstm: Vanilla bidirectional LSTM, bigru: Vanilla BiGated Recurrent unit, attbilstm: BiLSTM with self attention mechanism)
  • static: used in case of external embedding, if True: External Embeddings are not fine tuned during training, if False: External EMbeddings are fine tuned during training).
  • rand: if True, No external embedding is applied, randomly initialized embedding
  • embedding: External embedding model in gensim format
  • model_file: Output model file in Json. -EMB_type: Choose whether fastText or CBOW or skipgram

Note: final model score is dumped into a file with name_of_model_score with both dev and test scores

Example Project (Arabic Dialect Identification with Deep Models)

  • This project utilize 6 deep learning models applied on Arabic Online Commentary Dataset. [Paper]; [Dataset].

  • Make sure to cite AOC oringial paper if you are going to use it in your work.

  • This work currently accepted to VarDial Worshop 2018 co-located with COLING 2018 under the name (paper soon) "Deep Models for Arabic Dialect Identification on Benchmarked Data"

  • Training data: [link]

  • Dev data: [link]

  • Test data: [link]

  • An example on how to use it is in: [link]

  • If you are going to follow up on this project please cite this work using the following bibtext:

@inproceedings{Elaraby2018,
  title={Deep Models for Arabic Dialect Identification on Benchmarked Data},
  author={Elaraby, Mohamed and Abdul-Mageed, Muhammad},
  booktitle={Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial5)},
  year={2018}
}

External Embedding Models

  • For Arabic Dialects we release 2 embedding models
  • AOC embedding: [Download URL]
  • Twitter Embedding Model: [Download URL]
  • cite the following paper if you are planning to use city level dialect embedding model:
@inproceedings{mageedYouTweet2018,
  title={You Tweet What You Speak: A City-Level Dataset of Arabic Dialects},
  author={Abdul-Mageed, Muhammad and Alhuzali, Hassan and Elaraby, Mohamed},
  booktitle={LREC},
  pages={3653--3659},
  year={2018}
}

aoc_id's People

Contributors

engsalem avatar mageed avatar mohamedhama avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

aoc_id's Issues

Licence information

I was wondering under what license the data is available. The paper states "Creative Commons Attribution 4.0" but is this also true for the dataset as such?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.