Git Product home page Git Product logo

mongolian-bert's Introduction

Mongolian BERT models

This repository contains pre-trained Mongolian BERT models trained by tugstugi, enod and sharavsambuu. Special thanks to nabar who provided 5x TPUs.

This repository is based on the following open source projects: google-research/bert, huggingface/pytorch-pretrained-BERT and yoheikikuta/bert-japanese.

Models

SentencePiece with a vocabulary size 32000 is used as the text tokenizer. You can use the masked language model notebook Open In Colab to test how good the pre-trained models could predict masked Mongolian words.

Cased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3476765
masked_lm_accuracy = 0.7069192
masked_lm_loss = 1.2822781
next_sentence_accuracy = 0.99875
next_sentence_loss = 0.0038988923

Uncased BERT-Base

Download either TensorFlow checkpoint or PyTorch model. Eval results:

global_step = 4000000
loss = 1.3115116
masked_lm_accuracy = 0.7018335
masked_lm_loss = 1.3155857
next_sentence_accuracy = 0.995
next_sentence_loss = 0.015816934

Loading in Tensorflow 2.x

Little changes needed in order to load weights as Keras Layer in Tensorflow 2.x Open In Colab

Finetuning

This repo contains only pre-trained BERT models, for finetuning see:

Pre-Training

This repo already provides pre-trained models. If you really want to pre-train from scratch, you will need a TPU. A base model can be trained in 13 days (4M steps) on TPUv2. For a big model, you will need more than a month. We have used max_seq_length=512 instead of training first with max_seq_length=128 and then with max_seq_length=512 because it had better masked LM accuracy.

Install

Checkout the project and install dependencies:

git clone --recursive https://github.com/tugstugi/mongolian-bert.git
pip3 install -r requirements.txt

Data preparation

Download the Mongolian Wikipedia and the 700 million word Mongolian news data set and pre process them into the directory mn_corpus/:

# Mongolian Wikipedia
python3 datasets/dl_and_preprop_mn_wiki.py
# 700 million words Mongolian news data set
python3 datasets/dl_and_preprop_mn_news.py

After pre-processing, the dataset will contain around 500M words.

Train SentencePiece vocabulary

Now, train the cased SentencePiece model i.e. with the vocabulary size 32000 :

cd sentencepiece
cat ../mn_corpus/*.txt > all.txt
python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_cased

If the training was successful, the following files should be created: mn_cased.model and mn_cased.vocab. You can also test whether the SentencePiece model is working as intended:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor()
>>> s.Load('mn_cased.model')
>>> s.EncodeAsPieces('Мөнгөө тушаачихсаныхаа дараа мэдэгдээрэй')
['▁Мөнгөө', '▁тушаа', 'чихсан', 'ыхаа', '▁дараа', '▁мэдэгд', 'ээрэй']

For a uncased SentencePiece model, convert the content of all.txt to lower case and train with:

python3 train_sentencepiece.py --input all.txt --vocab-size 32000 --prefix mn_uncased

Create/Upload TFRecord files

Create TFRecord files for cased:

python3 create_pretraining_data_helper.py --max_seq_length=512 --max_predictions_per_seq=77 --cased

Upload to your GCloud bucket:

gsutil cp mn_corpus/maxseq512*.tfrecord gs://YOUR_BUCKET/data-cased/

For uncased, adjust above steps accordingly.

Train a model

To train, i.e. uncased BERT-Base on TPUv2, use the following command:

export INPUT_FILES=gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_1.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_10.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_11.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_12.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_13.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_14.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_15.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_16.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_17.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_18.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_19.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_2.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_3.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_4.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_5.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_6.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_7.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_8.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_news_700m_9.tfrecord,gs://YOUR_BUCKET/data-uncased/maxseq512-mn_wiki.tfrecord
python3 bert/run_pretraining.py \
  --input_file=$INPUT_FILES \
  --output_dir=gs://YOUR_BUCKET/uncased_bert_base \
  --use_tpu=True \
  --tpu_name=YOUR_TPU_ADDRESS \
  --num_tpu_cores=8 \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=bert_configs/bert_base_config.json \
  --train_batch_size=256 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=4000000 \
  --num_warmup_steps=10000 \
  --learning_rate=1e-4

For a large model, use bert_config_file=bert_configs/bert_large_config.json and train_batch_size=32.

Citation

@misc{mongolian-bert,
  author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
  title = {BERT Pretrained Models on Mongolian Datasets},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
}

mongolian-bert's People

Contributors

tugstugi avatar sharavsambuu avatar enod avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.