Git Product home page Git Product logo

indic-bart's Introduction

IndicBART

Pre-trained, multilingual sequence-to-sequence models for Indian languages

You can read more about IndicBART here. IndicBART is part of the AI4Bharat tools for Indian languages.

You can read more about IndicBART in this paper.

Installation

  1. Install the YANMTT toolkit. Checkout the v1.0 release via "git checkout v1.0". Make sure to create a new conda or virtual environment to ensure things work smoothly.

  2. Download the following:

  3. Decompress the vocabulary zip: unzip albert-indicunified64k.zip

Finetuning IndicBART for NMT

Sample training corpora

Sample development set

3-way parallel: en hi bn

Script conversion

  • The Indic side of the data needs to converted to the Devanagari script. You may use the indic_scriptmap.py script.
    • This script depends on Indic NLP Library and Indic NLP Resources which should be manually installed.
    • Following this, change the paths in lines 13 and 16 in indic_scriptmap.py.
    • Usage: python indic_scriptmap.py <input_file> <output_file> <source_language> <target_language>
      • Example: python indic_scriptmap.py input.txt output.txt ta hi
      • This will map the script in the input.txt file from Tamil to Hindi.
  • The sample data provided above has already been converted to the Devanagari script, so you can use it as is.

Fine-tuning command

python PATH-TO-YANMTT/train_nmt.py --train_slang hi,bn --train_tlang en,en  \
    --dev_slang hi,bn --dev_tlang en,en --train_src train.en-hi.hi,train.en-bn.bn \
    --train_tgt train.en-hi.en,train.en-bn.en --dev_src dev.hi,dev.bn --dev_tgt dev.en,dev.en \
    --model_path model.ft --encoder_layers 6 --decoder_layers 6 --label_smoothing 0.1 \
    --dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 --encoder_attention_heads 16 \
    --decoder_attention_heads 16 --encoder_ffn_dim 4096 --decoder_ffn_dim 4096 \
    --d_model 1024 --tokenizer_name_or_path albert-indicunified64k --warmup_steps 16000 \
    --weight_decay 0.00001 --lr 0.001 --max_gradient_clip_value 1.0 --dev_batch_size 128 \
    --port 22222 --shard_files --hard_truncate_length 256 --pretrained_model indicbart_model.ckpt &> log

At the end of training, you should find the model with the highest BLEU score for a given language pair. This will be model.ft.best_dev_bleu.-en where language can be hi or bn. The model training log will tell you the iteration number when the best performing checkpoint was last saved.

Decoding command

decmod=BEST-CHECKPOINT-NAME
    
python PATH-TO-YANMTT/decode_nmt.py --model_path $decmod --slang hi --tlang en \
    --test_src dev.hi --test_tgt dev.trans --port 23352 --encoder_layers 6 --decoder_layers 6 \
    --encoder_attention_heads 16 --decoder_attention_heads 16 --encoder_ffn_dim 4096 \
    --decoder_ffn_dim 4096 --d_model 1024 --tokenizer_name_or_path albert-indicunified64k \
    --beam_size 4 --length_penalty 0.8

Notes:

  1. If you want to use an IndicBART model with language specific scripts, we provide that variant as well: (Vocabulary) (Model)

  2. If you want to perform additional pre-training of IndicBART or train your own then follow the instructions in: https://github.com/prajdabre/yanmtt/blob/main/examples/train_mbart_model.sh

  3. For advanced training options, look at the examples in: https://github.com/prajdabre/yanmtt/blob/main/examples

Finetuning IndicBART for Summarization

Sample Corpus

Fine-tuning command

python PATH-TO-YANMTT/train_nmt.py --train_slang hi --train_tlang hi --dev_slang hi --dev_tlang hi \
    --train_src train.text.hi --train_tgt train.summary.hi --dev_src dev.text.hi \
    --dev_tgt dev.summary.hi --model_path model.ft --encoder_layers 6 --decoder_layers 6 \
    --label_smoothing 0.1 --dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 \
    --encoder_attention_heads 16 --decoder_attention_heads 16 --encoder_ffn_dim 4096 \
    --decoder_ffn_dim 4096 --d_model 1024 --tokenizer_name_or_path albert-indicunified64k \
    --warmup_steps 16000 --weight_decay 0.00001 --lr 0.0003 --max_gradient_clip_value 1.0 \
    --dev_batch_size 128 --port 22222 --shard_files --hard_truncate_length 512 \
    --pretrained_model indicbart_model.ckpt --max_src_length 384 --max_tgt_length 40 \
    --is_summarization --dev_batch_size 64 --max_decode_length_multiplier -60 \
    --min_decode_length_multiplier -10 --no_repeat_ngram_size 4 --length_penalty 1.0 \
    --max_eval_batches 20 --hard_truncate_length 512 

Decoding command

decmod=BEST-CHECKPOINT-NAME
    
python PATH-TO-YANMTT/decode_nmt.py --model_path $decmod --slang hi --tlang en \
    --test_src dev.text.hi --test_tgt dev.trans --port 23352 --encoder_layers 6 \
    --decoder_layers 6 --encoder_attention_heads 16 --decoder_attention_heads 16 \
    --encoder_ffn_dim 4096 --decoder_ffn_dim 4096 --d_model 1024 \
    --tokenizer_name_or_path albert-indicunified64k --beam_size 4 \
    --max_src_length 384 --max_decode_length_multiplier -60 --min_decode_length_multiplier -10 \
    --no_repeat_ngram_size 4 --length_penalty 1.0 --hard_truncate_length 512 

Contributors

Citing

If you use IndicBART, please cite the following paper:

@misc{dabre2021indicbart,
      title={IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages}, 
      author={Raj Dabre and Himani Shrotriya and Anoop Kunchukuttan and Ratish Puduppully and Mitesh M. Khapra and Pratyush Kumar},
      year={2021},
      eprint={2109.02903},
      archivePrefix={arXiv},
      booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
      publisher = "Association for Computational Linguistics",
      primaryClass={cs.CL}
    }    

License

IndicBART is licensed under the MIT License

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.