Git Product home page Git Product logo

arabicsos's Introduction

ArabicSOS

Segmenter and Orthography Standardazier (SOS) for Classical Arabic (CA)

This is the beta version of the Arabic Segmenter for segmenting classical Arabic texts. It has currently been trained on a subset of Al-Manar corpus created by Dr. Emad Mohamed.

Related Paper: Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic

Disclaimer: This package is still in the early development stages, hence the documentation is sparse. While it has been tested in standard use-cases, there might be a few bugs in the code. Please make sure you have a backup of your data before you use the package. We will greatly appreciate any feedback at the email addresses listed in the controbutors section.

Requirements:

  1. Python 3.x
  2. pandas >= 0.23.4 (pip install pandas)
  3. catboost >= 0.11.2 (pip install catboost)

Model Files: Please download and install the model files in the model folder using the following command: wget -v -O catboost_1.model -L https://iu.box.com/shared/static/mcu4frnipinfw7ery0wetrax7u7zzsxp.model

Usage: python segmenter.py input_file_path -o _output_file_path

  • Providing the input_file path is mandatory
  • If you do not provide the output_file, it will be created in the same directory as that of the input file. The name of the input file will be appended by ".segmented"

Note: The package assumes that every line in the file contains a single sentence.

Example: There is a file named P105.txt in the sample folder. It contains raw arabic text. We can segment it as follows: python segmenter.py sample/P105.txt -o sample/my_segmented_file.txt Or simply
python segmenter.py sample/P105.txt which will result in the creation of P105.txt.segmented file in the sample folder

Contributors:

  1. Zeeshan Ali Sayyed ([email protected])
  2. Emad Mohamed ([email protected])

Acknowledgment: “This project was made possible by NPRP grant NPRP10-0115-170163 from the Qatar National Research Fund (a member of Qatar Foundation). The findings achieved herein are solely the responsibility of the authors”.

arabicsos's People

Contributors

zeeshansayyed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

arabicsos's Issues

segmentation to texts or words?

Thanks for you sharing...
I want to know if arabic needs segmentation like chinese?
I mean if when doing nlp task with arabic, split it to words is needed?
Thanks!

Model file doesn't exist: models/catboost_1.model

hello , thank you so much for sharing your tools

i'm trying to make it work , so i followed the instructions by installing the requierements
then i have typed the cmd : wget -v -O catboost_1.model https://iu.box.com/shared/static/mcu4frnipinfw7ery0wetrax7u7zzsxp.model

i couldn't install it with the -L parameters i have an error message

it show that i receive messeges it take a while then it done
then i tried this cmd :
python segmenter.py sample/wiki_books_test_0.txt -o sample/my_segmented_wiki_books_test_0.txt

i hade this message error :

Traceback (most recent call last):
File "segmenter.py", line 46, in
segment(args.in_file, args.out_file)
File "segmenter.py", line 24, in segment
model.load_model("models/catboost_1.model")
File "c:\users\twins\anaconda3\lib\site-packages\catboost\core.py", line 2589, in load_model
self._load_model(fname, format)
File "c:\users\twins\anaconda3\lib\site-packages\catboost\core.py", line 1315, in _load_model
self._object._load_model(model_file, format)
File "_catboost.pyx", line 4681, in _catboost._CatBoost._load_model
File "_catboost.pyx", line 4684, in _catboost._CatBoost._load_model
_catboost.CatBoostError: c:/program files (x86)/go agent/pipelines/buildmaster/catboost.git/catboost/libs/model/model_import_interface.h:19: Model file doesn't exist: models/catboost_1.model

could you help me please ?! i need this tool so much
thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.