Git Product home page Git Product logo

kurdishtokenization's Introduction

Kurdish Tokenization

A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects)

This repository contains data of the tokenization system described in the paper entitled "A Tokenization System for the Kurdish Language". An approach is proposed for the tokenization of the Sorani and Kurmanji dialects of Kurdish using a lexicon and a morphological analyzer. The tokenizer is available as a module in the Kurdish Language Processing Toolkit (KLPT).

Gold-standard Datasets

In addition to the tokenization tool, we provide a gold-standard dataset in the data folder containing 100 Sorani and Kurmanji sentences in the Text Corpus Format. These sentences are manually tokenized and therefore can be used for evaluation purposes.

Annotated Lexicons

We also provide a set of manually-annotated lexicons for this tool which are constantly being updated and completed. These lexicons contain word lemmata in Kurdish along with hyphen-separated multi-word expressions. The current version contains lexicographic data provided by the FreeDict project and Wîkîferheng, the Kurdish Wiktionary. The transliteration of the Latin-based script of Kurdish into the Latin-based one is carried out using Wergor. Please follow the instructions of the Kurdish Language Processing Toolkit (KLPT), if you would like to take part in the enrichment of resources.

The following shows two lemmata in the Kurmanji lexicon where the possible writings of a compound word-form are provided in the token_forms field.

"riswa": []
"riswa-kirin": {
"token_forms": ["riswakirin", "riswa kirin"]
}

For researchers

If you would like to extend the current study, the trained models can be found in the models directory. Please use the corresponding libraries to import the models in your pipelines. The output of the models are also available in the experiments folder.

Contribute

Are you interested in this project? Please follow the instructions of the Kurdish Language Processing Toolkit (KLPT) to get involved. Open-source is fun! 😊

Cite this paper

Please consider citing this paper, if you use any part of the data or the tool (bib file):

@inproceedings{ahmadi2020tokenization,
  title={{A Tokenization System for the Kurdish Language}},
  author={Ahmadi, Sina},
  booktitle={Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
  pages={},
  year={2020}
}

License

Creative Commons License
The annotated resources by Sina Ahmadi are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:

  • You are free to share, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially.
  • You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

kurdishtokenization's People

Contributors

sinaahmadi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.