This is a project on product title translation from English to Thai. See the documentation for the description of steps, data and algorithms used.
Install the following Python libraries:
- nltk (http://www.nltk.org/install.html)
- googletrans (https://github.com/ssut/py-googletrans)
- kenlm (https://github.com/kpu/kenlm) (install both C++ and Python wrapper)
Additionally, the following are required for pre-processing Wikipedia corpus:
- regex >= 2016.6.24
- lxml >= 3.3.3
- pythai >= 0.1.3
Take the first two columns (Category
and Product name
) of the provided sample and test data and convert them into tab-separated files. Put them in a directory named data
.
Train a 3-gram language model based on Thai Wikipedia:
cd lm
bash prepro.sh
python build_lexicon.py data/sample.txt data/test.txt
python translate.py data/test.txt lm/th.txt.arpa test.output