Git Product home page Git Product logo

demfier / ebm-sanskrit-word-segmentation Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 2.0 1.44 MB

Python implementation of "Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit," EMNLP 2018

Home Page: https://zenodo.org/record/1035413/#.XGZGj7pKhCV

Jupyter Notebook 2.34% Python 97.66%
sanskrit-segmentation sequence-to-sequence energy-based-model deep-learning natural-language-processing morphological-tagging ebm numpy

ebm-sanskrit-word-segmentation's Introduction

Word Segmentation and Morphological Tagging in Sanskrit Using Energy Based Models

Code for our paper: Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit, accepted at EMNLP 2018, Brussels, Belgium.

Please find the pre-trained model and other data files distributed on zenodo [link]. Some other helper modules too have been provided in this repo inside the helpers and outdated folder.

Team members:

Amrith Krishna, Bishal Santra, Sasi Prasanth Bandaru, Gaurav Sahu, Vishnu Dutt Sharma, Pavankumar Satuluri and Pawan Goyal.

Getting Started

Please download the 2 compressed files dir.zip and wordsegmentation.rar to your working directory and extract them into folders named dir and wordsegmentation respectively.

Your working directory should be as follows

Working Directory
| -- wordsegmentation
| ----- skt_dcs_DS.bz2_4K_bigram_mir_10K
| ----- skt_dcs_DS.bz2_4K_bigram_mir_heldout
| -- dir

Prerequisites

Instructions for Training

Change your current directory to dir

Run the file Train_clique.py by using the following command

  • python Train_clique.py

To train on different input features like BM2/BM3/BR2/BR3/PM2/PM3/PR/PR3 please modify the bz2_input_folder value in the main function before beginning the training.

Feature bz2_input_folder
BM2 wordsegmentation/skt_dcs_DS.bz2_4K_bigram_mir_10K/
BM3 wordsegmentation/skt_dcs_DS.bz2_1L_bigram_mir_10K
BR2 wordsegmentation/skt_dcs_DS.bz2_4K_bigram_rfe_10K/
BR3 wordsegmentation/skt_dcs_DS.bz2_1L_bigram_rfe_10K/
PM2 wordsegmentation/skt_dcs_DS.bz2_4K_pmi_mir_10K/
PM3 wordsegmentation/skt_dcs_DS.bz2_1L_pmi_mir_10K2/
PR2 wordsegmentation/skt_dcs_DS.bz2_4K_pmi_rfe_10K/
PR3 wordsegmentation/skt_dcs_DS.bz2_1L_pmi_rfe_10K/

Instructions for Testing

After training, please modify the modelList dictionary in test_clique.py with the name of the neural network that has been saved during training. While testing for a feature, please provide the name of the neural net which was trained for the same feature.

We only provide the trained model for the feature BM2 which was our best performing feature. If the name of the neural net is not changed, then the testing will be performed on the pre-trained model for BM2 provided in outputs/train_t7978754709018

To test with a particular feature vector use the tag of the feature while execution

  • python test_clique.py -t <tag>

For example:

  • python test_clique.py -t BM2

After finishing the testing please run the following command to see the precision and recall values for both the word and word++ prediction tasks

  • python evaluate.py <tag>

For example:

  • python evaluate.py BM2

Reference:

If you find any part of our code useful, please cite:

@article{krishna2018free,
  title={Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit},
  author={Krishna, Amrith and Santra, Bishal and Bandaru, Sasi Prasanth and Sahu, Gaurav and Sharma, Vishnu Dutt and Satuluri, Pavankumar and Goyal, Pawan},
  journal={arXiv preprint arXiv:1809.01446},
  year={2018}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.