Git Product home page Git Product logo

stif-indonesia's Introduction

STIF-Indonesia

Paper

An implementation of "Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation".

You can also find Indonesian informal-formal parallel corpus in this repository.

Description

We were researching transforming a sentence from informal to its formal form. Our work addresses a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We benchmark several strategies to perform the style transfer.

In this repository, we provide the Phrase-Based Statistical Machine Translation, which has the highest result in our experiment. Note that, our data is extremely low-resource and domain-specific (Customer Service domain). Therefore, the system might not be robust towards out-of-domain input. Our future work includes exploring more robust style transfer. Stay tuned!

Paper

Paper

You can access our paper below:

Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation (IALP 2020)

Medium Article: Mengubah Bahasa Indonesia Informal Menjadi Baku Menggunakan Kecerdasan Buatan (In Indonesian)

Requirements

We use the RELEASE 4.0 Ubuntu 17.04+ version which only works on the specified OS.

We haven't tested it on other OS (e.g.: OS X and Windows). If you want to run the source code, use Ubuntu 17.04+. If you use windows, we advise you to use the WSL-2 to run the code.

In this experiment, we wrap the MOSES code by using Python's subprocess. So Python installation is necessary. The system is tested on Python 3.9. We recommend it to install with miniconda. You can install it by following this link: https://docs.conda.io/en/latest/miniconda.html

How To Run

First, clone the repository

git clone https://github.com/haryoa/stif-indonesia.git

Then run the MOSES downloader. We use .sh, so use a CLI application that can execute it. On the root project folder directory, do:

bash scripts/download_moses.sh

The script will download the Moses toolkit and extract it by itself.

Run Supervised Experiments

To run the supervised one, do:

python -m stif_indonesia --exp-scenario supervised

It will read the experiment config in experiment-config/00001_default_supervised_config.json

Run Semi-Supervised Experiments

To run the semi-supervised one, do:

python -m stif_indonesia --exp-scenario semi-supervised

It will read the experiment config in experiment-config/00002_default_semi_supervised_config.json

Output

  1. The training process will output the log of the experiment in log.log
  2. The output of the model will be produced in the output folder

Supervised output

It will output evaluation, lm , and train. evaluation is the result of prediction on the test set, lm is the output of the trained LM, and train is the produced model by the Moses toolkit

Semi-supervised output

It will output agg_data, best_model_dir, and produced_tgt_data. agg_data is the result of the forward-iteration data synthesis. best_model_dir is the best model produced by the training process, and produced_tgt_data is the prediction output of the test set.

Score

Please check the log.log file which is the output of the process.

Additional Information

If you want to replicate the dictionary-based method, you can use any informal - formal or slang dictionary on the internet.

For example, you can use this dictionary.

If you want to replicate our GPT-2 experiment, you can use a pre-trained Indonesian GPT-2 such as this one, or train it by yourself by using Oscar Corpus. After that, you can finetune it with the dataset that we have provided here. You should follow the paper on how to transform the data when you do the finetuning.

We use Huggingface's off-the-shelf implementation to train the model.

Team

  1. Haryo Akbarianto Wibowo @ Kata.ai
  2. Tatag Aziz Prawiro @ Universitas Indonesia
  3. Muhammad Ihsan @ Bina Nusantara
  4. Alham Fikri Aji @ Kata.ai
  5. Radityo Eko Prasojo @ Kata.ai
  6. Rahmad Mahendra @ Universitas Indonesia

stif-indonesia's People

Contributors

haryoa avatar afaji avatar haryoaws avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.