ai4bharat / indictrans Goto Github PK

indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2

Home Page: https://ai4bharat.iitm.ac.in/indic-trans

License: MIT License

Python 11.81% Shell 5.51% Jupyter Notebook 81.53% HTML 1.16%

translation pytorch multilingual-translations indic-nlp indian-languages

indictrans's Introduction

IndicTrans

Website | Paper | Video | Demo Resources

🚩NOTE 🚩IndicTrans2 is now available. It supports 22 Indian languages and has better translation quality compared to IndicTrans1. We recommend using IndicTrans2.

IndicTrans is a Transformer-4x ( ~434M ) multilingual NMT model trained on Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:


Assamese (as)	Hindi (hi)	Marathi (mr)	Tamil (ta)
Bengali (bn)	Kannada (kn)	Odia (or)	Telugu (te)
Gujarati (gu)	Malayalam (ml)	Punjabi (pa)

Benchmarks

We evaluate IndicTrans model on a WAT2021, WAT2020, WMT (2014, 2019, 2020), UFAL, PMI (subset of the PMIndia dataest created by us for Assamese) and FLORES benchmarks. It outperforms all publicly available open source models. It also outperforms commercial systems like Google, Bing translate on most datasets and performs competitively on Flores. Here are the results that we obtain:

	WAT2021										WAT2020							WMT			UFAL	PMI	FLORES-101
	bn	gu	hi	kn	ml	mr	or	pa	ta	te	bn	gu	hi	ml	mr	ta	te	hi	gu	ta	ta	as	as	bn	gu	hi	kn	ml	mr	or	pa	ta	te
IN-EN	29.6	40.3	43.9	36.4	34.6	33.5	34.4	43.2	33.2	36.2	20.0	24.1	23.6	20.4	20.4	18.3	18.5	29.7	25.1	24.1	30.2	29.9	23.3	32.2	34.3	37.9	28.8	31.7	30.8	30.1	35.8	28.6	33.5
EN-IN	15.3	25.6	38.6	19.1	14.7	20.1	18.9	33.1	13.5	14.1	11.4	15.3	20.0	7.2	12.7	6.2	7.6	25.5	17.2	9.9	10.9	11.6	6.9	20.3	22.6	34.5	18.9	16.3	16.1	13.9	26.9	16.3	22.0

Updates

Click to expand

21 June 2022

Add more documentation on hosted API usage

18 December 2021

Tutorials updated with latest model links

26 November 2021

 - v0.3 models are now available for download

27 June 2021

- Updated links for indic to indic model
- Add more comments to training scripts
- Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
- Add folder structure in readme
- Add python wrapper for model inference

09 June 2021

- Updated links for models
- Added Indic to Indic model

09 May 2021

- Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
- Added colab notebook for finetuning instructions

Updates
Table of contents
Resources
Running Inference
- Command line interface
- Python Inference
Training model
Finetuning model on your data
License
Contributors
Contact
Acknowledgements

Resources

Try out model online (Huggingface spaces)

Download model

Indic to English: v0.3

English to Indic: v0.3

Indic to Indic: v0.3

Mirror links for the IndicTrans models

STS Benchmark

Download the human annotations for STS benchmark here

Using hosted APIs

Try out our models at IndicTrans Demos

Refer to this colab notebook on how to use python to hit the API endpoints-->

Accessing on ULCA

You can try out our models at ULCA and filter for IndicTrans models.

Running Inference

Command line interface

The model is trained on single sentences and hence, users need to split parapgraphs to sentences before running the translation when using our command line interface (The python interface has translate_paragraph method to handle multi sentence translations).

Note: IndicTrans is trained with a max sequence length of 200 tokens (subwords). If your sentence is too long (> 200 tokens), the sentence will be truncated to 200 tokens before translation.

Here is an example snippet to split paragraphs into sentences for English and Indic languages supported by our model:

# install these libraries
# pip install mosestokenizer
# pip install indic-nlp-library

from mosestokenizer import *
from indicnlp.tokenize import sentence_tokenize

INDIC = ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]

def split_sentences(paragraph, language):
    if language == "en":
        with MosesSentenceSplitter(language) as splitter:
            return splitter([paragraph])
    elif language in INDIC:
        return sentence_tokenize.sentence_split(paragraph, lang=language)

split_sentences("""COVID-19 is caused by infection with the severe acute respiratory
syndrome coronavirus 2 (SARS-CoV-2) virus strain. The disease is mainly transmitted via the respiratory
route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing. """, language='en')

>> ['COVID-19 is caused by infection with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus strain.',
 'The disease is mainly transmitted via the respiratory route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing.']

split_sentences("""இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது. இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.
அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.""",
 language='ta')

>> ['இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.',
 'இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது.',
 'இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.',
 'அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.']

Follow the colab notebook to setup the environment, download the trained IndicTrans models and translating your own text.

Colab notebook for command line inference -->

Python Inference

Colab notebook for python inference -->

The python interface is useful in case you want to reuse the model for multiple translations and do not want to reinitialize the model each time

Training model

Setting up your environment

Click to expand

cd indicTrans
git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
git clone https://github.com/rsennrich/subword-nmt.git
# install required libraries
pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library

# Install fairseq from source
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable ./

Details of models and hyperparameters

Architechture: IndicTrans uses 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and feedforward dimension of 4096 with total number of parameters of 434M
Loss: Cross entropy loss
Optimizer: Adam
Label Smoothing: 0.1
Gradient clipping: 1.0
Learning rate: 5e-4
Warmup_steps: 4000

Please refer to section 4, 5 of our paper for more details on training/experimental setup.

Training procedure and code

The high level steps we follow for training are as follows:

Organize the traning data as en-X folders where each folder has two text files containing parallel data for en-X lang pair.

# final_data
# ├── en-as
# │   ├── train.as
# │   └── train.en
# ├── en-bn
# │   ├── train.bn
# │   └── train.en
# ├── en-gu
# │   ├── train.en
# │   └── train.gu
# ├── en-hi
# │   ├── train.en
# │   └── train.hi
# ├── en-kn
# │   ├── train.en
# │   └── train.kn
# ├── ....

Organize the developement set and test set of multiple benchmarks as follows:

<all devtest dir>
├──<benchmark 1>
|    ├── en-as
|    ├── en-bn
|    ├── en-gu
|    └── en-hi
|        ├── test.en
|        ├── test.en
|        ├── dev.en
|        └── dev.hi
├──<benchmark 2>
|
...

Removing dev and test set overlaps from training data Refer to "Training Data" subsection in section 4 of our paper for more details on how we use a strict overlap removal method.

python3 remove_train_devtest_overlaps.py <train_data_dir> <all devtest dir>
^ if you are only training for en-x

python3 remove_train_devtest_overlaps.py <train_data_dir> <all devtest dir> true
^ if you are training many2many model

Prepare the experiment folder and create the binarized data required for fairseq

<exp dir>             # named like indic-en-exp for indic-en training or en-indic-exp for en-indic training
├──<devtest>
    └── all
        ├── en-as
            ├── dev.en      # merge all en files for en-as dev sets
            ├── dev.as      # merge all as files for en-as dev sets
            ├── test.en     # merge all en files for en-as test sets
            └── test.as     # merge all as files for en-as test sets
        ├── en-bn
        ├── en-gu
        ├── ...
        └── en-hi
   ├── en-as
   ├── en-bn
   ├── ...
   └── en-te
        ├── train.en      # merged en train set for en-te with all devtest overlaps removed
        └── train.te      # merged te train set for en-te with all devtest overlaps removed

# Using exp dir, prepare the training data as required for Fairseq using prepare_data_joint_training.sh

# prepare_data_joint_training.sh takes exp dir, src_lang, tgt_lang as input
# This does preprocessing, building vocab, binarization for joint training

# Creating the vocabulary will take a while if the dataset is huge. To make it faster, run it on a multicore system
bash prepare_data_joint_training.sh '../indic-en-exp' 'indic' 'en'

Start training with fairseq-train command. Please refer to fairseq documentaion to know more about each of these options

# some notable args:
# --max-updates         -> maximum update steps the model will be trained for
# --arch=transformer_4x -> we use a custom transformer model and name it transformer_4x (4 times the parameter size of transformer  base)
# --user_dir            -> we define the custom transformer arch in model_configs folder and pass it as an argument to user_dir for fairseq to register this architechture
# --lr                  -> learning rate. From our limited experiments, we find that lower learning rates like 3e-5 works best for finetuning.
# --max_tokens          -> this is max tokens per batch. You should limit to lower values if you get oom errors.
# --update-freq         -> gradient accumulation steps

fairseq-train ../indic-en-exp/final_bin \
--max-source-positions=210 \
--max-target-positions=210 \
--max-update=<max_updates> \
--save-interval=1 \
--arch=transformer_4x \
--criterion=label_smoothed_cross_entropy \
--source-lang=SRC \
--lr-scheduler=inverse_sqrt \
--target-lang=TGT \
--label-smoothing=0.1 \
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--clip-norm 1.0 \
--warmup-init-lr 1e-07 \
--lr 0.0005 \
--warmup-updates 4000 \
--dropout 0.2 \
--save-dir ../indic-en-exp/model \
--keep-last-epochs 5 \
--patience 5 \
--skip-invalid-size-inputs-valid-test \
--fp16 \
--user-dir model_configs \
--wandb-project <wandb_project_name> \
--update-freq=<grad_accumulation_steps> \
--distributed-world-size <num_gpus> \
--max-tokens <max_tokens_in_a_batch>

The above steps are further documented in our colab notebook

Please refer to this issue to see discussion of our training hyperparameters.

WandB plots

IndicTrans en-indic model

IndicTrans indic-en model

Evaluating trained model

The trained model will get saved in the experiment directory. It will have the following files:

 en-indic/                              # en to indic experiment directory
 ├── final_bin                          # contains fairseq dictionaries
 │   ├── dict.SRC.txt
 │   └── dict.TGT.txt
 ├── model                              # contains model checkpoint(s)
 │   └── checkpoint_best.pt
 └── vocab                              # contains bpes for src and tgt (since we train seperate vocabularies) generated with subword_nmt
     ├── bpe_codes.32k.SRC
     ├── bpe_codes.32k.TGT
     ├── vocab.SRC
     └── vocab.TGT

To test the models after training, you can use joint_translate.sh to get output predictions and compute_bleu.sh to compute bleu scores.

# joint_translate takes src_file, output_fname, src_lang, tgt_lang, model_folder as inputs
# src_file -> input text file to be translated
# output_fname -> name of the output file (will get created) containing the model predictions
# src_lang -> source lang code of the input text ( in this case we are using en-indic model and hence src_lang would be 'en')
# tgt_lang -> target lang code of the input text ( tgt lang for en-indic model would be any of the 11 indic langs we trained on:
#              as, bn, hi, gu, kn, ml, mr, or, pa, ta, te)
# supported languages are:
#              as - assamese, bn - bengali, gu - gujarathi, hi - hindi, kn - kannada,
#              ml - malayalam, mr - marathi, or - oriya, pa - punjabi, ta - tamil, te - telugu

# model_folder -> the directory containing the model and the vocab files ( the model is stored in exp_dir/model)



# here we are translating the english sentences to hindi and model_folder contains the model checkpoint
bash joint_translate.sh <path to test.en> en_hi_outputs.txt 'en' 'hi' model_folder

# to compute bleu scores for the predicitions with a reference file, use the following command
# arguments:
# pred_fname: file that contains model predictions
# ref_fname: file that contains references
# src_lang and tgt_lang : the source and target language

bash compute_bleu.sh en_hi_outputs.txt <path to test.hi reference file> 'en' 'hi'

Detailed benchmarking results

Refer to Benchmarks for results of IndicTrans model on various benchmarks. Please refer to table 6,7 of our paper for comparison with other open source and commercial models and section 6 for detailed discussion of the results

Finetuning model on your data

The high level steps for finetuning on your own dataset are:

Organize the traning data as en-X folders where each folder has two text files containing parallel data for en-X lang pair.

# final_data
# ├── en-as
# │   ├── train.as
# │   └── train.en
# ├── en-bn
# │   ├── train.bn
# │   └── train.en
# ├── en-gu
# │   ├── train.en
# │   └── train.gu
# ├── en-hi
# │   ├── train.en
# │   └── train.hi
# ├── en-kn
# │   ├── train.en
# │   └── train.kn
# ├── ....

Organize the developement set and test set of multiple benchmarks as follows:

<all devtest dir>
├──<benchmark 1>
|    ├── en-as
|    ├── en-bn
|    ├── en-gu
|    └── en-hi
|        ├── test.en
|        ├── test.en
|        ├── dev.en
|        └── dev.hi
├──<benchmark 2>
|
...

Removing dev and test set overlaps from training data Refer to "Training Data" subsection in section 4 of our paper for more details on how we use a strict overlap removal method.

python3 remove_train_devtest_overlaps.py <train_data_dir> <all devtest dir>
^ if you are only training for en-x

python3 remove_train_devtest_overlaps.py <train_data_dir> <all devtest dir> true
^ if you are training many2many model

After removing the dev and test set overlaps, you can move the train files and benchmark files (refer to colab notebook below for more details) to the experiment directory. This will have the trained checkpoint and the following structure:

# prepare the experiment folder

 <exp dir>                              # experiment directory
 ├── final_bin                          # contains fairseq dictionaries which we will use to binarize the new finetuning data
 │   ├── dict.SRC.txt
 │   └── dict.TGT.txt
 ├── model                              # contains model checkpoint(s)
 │   └── checkpoint_best.pt
 └── vocab                              # contains bpes for src and tgt (since we train seperate vocabularies) generated with subword_nmt
     ├── bpe_codes.32k.SRC
     ├── bpe_codes.32k.TGT
     ├── vocab.SRC
     └── vocab.TGT

# We will use fairseq-train to finetune the model:


# some notable args:
# --max-update=1000     -> for this example, to demonstrate how to finetune we are only training for 1000 steps. You should increase this when finetuning
# --arch=transformer_4x -> we use a custom transformer model and name it transformer_4x (4 times the parameter size of transformer  base)
# --user_dir            -> we define the custom transformer arch in model_configs folder and pass it as an argument to user_dir for fairseq to register this architechture
# --lr                  -> learning rate. From our limited experiments, we find that lower learning rates like 3e-5 works best for finetuning.
# --restore-file        -> reload the pretrained checkpoint and start training from here (change this path for indic-en. Currently its is set to en-indic)
# --reset-*             -> reset and not use lr scheduler, dataloader, optimizer etc of the older checkpoint
# --max_tokns           -> this is max tokens per batch


fairseq-train <exp_dir>/final_bin \
--max-source-positions=210 \
--max-target-positions=210 \
--max-update=1000 \
--save-interval=1 \
--arch=transformer_4x \
--criterion=label_smoothed_cross_entropy \
--source-lang=SRC \
--lr-scheduler=inverse_sqrt \
--target-lang=TGT \
--label-smoothing=0.1 \
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--clip-norm 1.0 \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.2 \
--tensorboard-logdir <exp_dir>/tensorboard-wandb \
--save-dir <exp_dir>/model \
--keep-last-epochs 5 \
--patience 5 \
--skip-invalid-size-inputs-valid-test \
--fp16 \
--user-dir model_configs \
--update-freq=2 \
--distributed-world-size 1 \
--max-tokens 256 \
--lr 3e-5 \
--restore-file <checkpoint exp_dir>/model/checkpoint_best.pt \
--reset-lr-scheduler \
--reset-meters \
--reset-dataloader \
--reset-optimizer

The above steps (setup the environment, download the trained IndicTrans models and prepare your custom dataset for funetuning) are further documented in our colab notebook

Please refer to this issue for some tips on finetuning.

Note: Since this is a big model (400M params), you might not be able to train with reasonable batch sizes in the free google Colab account. We are planning to release smaller models (after pruning / distallation) soon.

Folder Structure


IndicTrans
│   .gitignore
│   apply_bpe_traindevtest_notag.sh         # apply bpe for joint vocab (Train, dev and test)
│   apply_single_bpe_traindevtest_notag.sh  # apply bpe for seperate vocab   (Train, dev and test)
│   binarize_training_exp.sh                # binarize the training data after preprocessing for fairseq-training
│   compute_bleu.sh                         # Compute blue scores with postprocessing after translating with `joint_translate.sh`
│   indictrans_fairseq_inference.ipynb      # colab example to show how to use model for inference
│   indicTrans_Finetuning.ipynb             # colab example to show how to use model for finetuning on custom domain data
│   joint_translate.sh                      # used for inference (see colab inference notebook for more details on usage)
│   learn_bpe.sh                            # learning joint bpe on preprocessed text
│   learn_single_bpe.sh                     # learning seperate bpe on preprocessed text
│   LICENSE
│   prepare_data.sh                         # prepare data given an experiment dir (this does preprocessing,
│                                           # building vocab, binarization ) for bilingual training
│   prepare_data_joint_training.sh          # prepare data given an experiment dir (this does preprocessing,
│                                           # building vocab, binarization ) for joint training
│   README.md
│
├───legacy                                  # old unused scripts
├───model_configs                           # custom model configrations are stored here
│       custom_transformer.py               # contains custom 4x transformer models
│       __init__.py
├───inference
│       custom_interactive.py               # for python wrapper around fairseq-interactive
│       engine.py                           # python interface for model inference
└───scripts                                 # stores python scripts that are used by other bash scripts
    │   add_joint_tags_translate.py         # add lang tags to the processed training data for bilingual training
    │   add_tags_translate.py               # add lang tags to the processed training data for joint training
    │   clean_vocab.py                      # clean vocabulary after building with subword_nmt
    │   concat_joint_data.py                # concatenates lang pair data and creates text files to keep track
    │                                       # of number of lines in each lang pair.
    │   extract_non_english_pairs.py        # Mining Indic to Indic pairs from english centric corpus
    │   postprocess_translate.py            # Postprocesses translations
    │   preprocess_translate.py             # Preprocess translations and for script conversion (from indic to devnagiri)
    │   remove_large_sentences.py           # to remove large sentences from training data
    └───remove_train_devtest_overlaps.py    # Finds and removes overlaped data of train with dev and test sets

Citing our work

If you are using any of the resources, please cite the following article:

@article{10.1162/tacl_a_00452,
    author = {Ramesh, Gowtham and Doddapaneni, Sumanth and Bheemaraj, Aravinth and Jobanputra, Mayank and AK, Raghavan and Sharma, Ajitesh and Sahoo, Sujit and Diddee, Harshita and J, Mahalakshmi and Kakwani, Divyanshu and Kumar, Navneet and Pradeep, Aswin and Nagaraj, Srihari and Deepak, Kumar and Raghavan, Vivek and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh Shantadevi},
    title = "{Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {10},
    pages = {145-162},
    year = {2022},
    month = {02},
    abstract = "{We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence
                    pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00452},
    url = {https://doi.org/10.1162/tacl\_a\_00452},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00452/1987010/tacl\_a\_00452.pdf},
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

The IndicTrans code (and models) are released under the MIT License.

Contributors

Gowtham Ramesh, _{(RBCDSAI, IITM)}
Sumanth Doddapaneni, _{(RBCDSAI, IITM)}
Aravinth Bheemaraj, _{(Tarento, EkStep)}
Mayank Jobanputra, _(IITM)
Raghavan AK, _(AI4Bharat)
Ajitesh Sharma, _{(Tarento, EkStep)}
Sujit Sahoo, _{(Tarento, EkStep)}
Harshita Diddee, _(AI4Bharat)
Mahalakshmi J, _(AI4Bharat)
Divyanshu Kakwani, _{(IITM, AI4Bharat)}
Navneet Kumar, _{(Tarento, EkStep)}
Aswin Pradeep, _{(Tarento, EkStep)}
Srihari, Nagaraj, _{(Tarento, EkStep)}
Kumar Deepak, _{(Tarento, EkStep)}
Vivek Raghavan, _(EkStep)
Anoop Kunchukuttan, _{(Microsoft, AI4Bharat)}
Pratyush Kumar, _{(RBCDSAI, AI4Bharat, IITM)}
Mitesh Shantadevi Khapra, _{(RBCDSAI, AI4Bharat, IITM)}

Contact

Anoop Kunchukuttan ([email protected])
Mitesh Khapra ([email protected])
Pratyush Kumar ([email protected])

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.

indictrans's People

Contributors

Stargazers

Watchers

indictrans's Issues

Which model class(es) are being saved in the different model checkpoints?

Hi.

I am trying to load the en-indic model into PyTorch.

After unzipping the folder, I am doing

checkpoint = torch.load("/content/en-indic/en-indic/model/checkpoint_best.pt")

Now, to load the model and optimizer for later use, I am following this.

To follow along with this tutorial, what is the TheModelClass and TheOptimizerClass?

Chunks lost in translation

Hi,
Thank you for your work on indicTrans. I have been using this to translate some short paragraphs( 3-4 sentences) in various supported Indic languages. I noticed that there is a certain amount of data that gets lost in translation. For example- I am trying to translate this English sentence to Tamil:
"In order to make the French capital safer, quieter and less dirty,a speed limit of 30 kmph for cars came into force in Paris on Monday"
This is translated as:
பிரான்ஸ் தலைநகர் பாரிஸில் கார்களுக்கு மணிக்கு 30 கிலோமீட்டர் வேகத்தில் செல்லலாம் என்ற கட்டுப்பாடு விதிக்கப்பட்டுள்ளது

The chunk- "In order to make the French capital safer, quieter and less dirty" is lost in the translation

I assumed that with the Transformer architecture, long sentences too could be translated more accurately.

I would like to know what could be done to fix this issue.

Indic-Eng model error while parsing long sentence

"(indic2en_model.batch_translate(ta_sent, 'ta', 'en')"

While passing Tamil sentence of length more than 723 we get a error

'''num_words = len(sent.split())
61 if num_words > MAX_SEQ_LEN:
---> 62 print_str = " ".join(num_words[:5]) + " .... " + " ".join(num_words[-5:])
63 sent = " ".join(num_words[:MAX_SEQ_LEN])
64 print(

TypeError: 'int' object is not subscriptable'''

Instead of sent.split() there is num.words which is an integer and not subscriptable

ModuleNotFoundError: No module named 'inference'

Throwing error

from indicTrans.inference.engine import Model

showing

ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_1140/1581464475.py in
----> 1 from indicTrans.inference.engine import Model

D:\datascienceprojects\project-new\indicTrans\inference\engine.py in
13 from indicnlp.tokenize import sentence_tokenize
14
---> 15 from inference.custom_interactive import Translator
16
17

ModuleNotFoundError: No module named 'inference'

ULCA API Help

Hi,
As I am looking to use ULCA API for Indian Language translation, I explore through the Bhashini platform. Bhashini
I checked with ULCA API with this link for different models but I was not able to find proper documentation on those API call-in details or some Sample examples. although the work done on this platform is extremely useful, If you can help to understand those APIs with some documentation or simple examples that will be more fruitful, especially since I am looking into Translations API.

Thanks

Dataset not available

The datasets in the google storage api, shows a 403 error.
"https://storage.googleapis.com/samanantar-public/V0.2/data/en2indic/samanatar-en-indic-v0.2.zip" & "https://storage.googleapis.com/samanantar-public/benchmarks.zip".
How to get the following datasets now?

API is giving timeout Error (Errno 110]

The batch translation api is throwing a time-out error from today morning. Could you please fix this issue?

problems with downloading the model

Hello, cannot download indic-en model, Could you, please, upload it

Unable to Download benchmark.zip while training the data

wget https://storage.googleapis.com/samanantar-public/benchmarks.zip
--2023-09-04 18:52:18-- https://storage.googleapis.com/samanantar-public/benchmarks.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.66.16, 142.250.183.208, 142.250.183.176, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.66.16|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-09-04 18:52:19 ERROR 404: Not Found.

If unicode and text together are sent for translation, the model outputs junk which contains variants of 'Narendra Modi'

The text list is "chk_list = ['hello', '\ue806x it tomorrow it doesn’t matter how well you have scheduled']" and the returned translated list is ['ನಮಸ್ಕಾರ.', 'narendramodi narendramodi syndramodi narendramodi syndramodi narendramodi syndramodi narendramodi syndramodi narendramodi narendramodi syndramodi syndramodi syndramodi syndramodi syndramodi syndramodi narendramodi narendramodi narendramodi narendramodi narendramodi narendramodi narendramodi'] when translated to kannada language. I am curious why translated list is junk and contains variants of Narendra Modi

Translation Scores

Hi,
I have been using the model to generate a dataset, Is there a way to find out the translation score that fairseq provides from indictrans scripts itself? Since the model is made on top on Fairseq module.

Bad Alignment

Src: This mir@@ rors the development from seed to leaf , and from leaf to bu@@ d , and from bu@@ d to flower .
Tgt: यह बीज से प@@ त्ती तक , प@@ त्ती से कली तक , और कली से फूल तक विकास का दर्@@ पण है ।
alignment: (2-0) (8-1) (0-2) (10-3) (0-4) (0-5) (14-6) (14-7) (0-8) (0-9) (16-10) (0-11) (0-12) (21-13) (21-14) (0-15) (24-16) (0-17) (6-18) (3-19) (3-20) (0-21) (0-22) (0-23)

Hi,
Above example shows the alignment of each target word to the source word using indicTrans model and as can be seen, alignment is pretty bad. My question is, how the alignment is this bad for such good quality translation? Can you please help me understand this? Or correct me if I am making mistake in getting this alignment info.

I am using fairseq-interactive "--print-alignment" tag to get this alignment info.

TypeError: Descriptors cannot not be created directly while importing indicTrans

While referring to the jupyter notebook here

from indicTrans.inference.engine import Model

will get the following issue

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

may I know possible solution for this issue,
Thanks

Inappropriate hindi transliteration

Code:

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp import loader
from indicnlp import common
common.set_resources_path(INDIC_RESOURCES_PATH)
loader.load()
ItransTransliterator.to_itrans('मैं आज आपकी किस प्रकार सहायता कर सकता हूँ?', 'hi')

Output:

mai.m aaja aapakii kisa prakaara sahaayataa kara sakataa huuँ?

Output using google translator:

jaanakar achchha laga. main aaj aapakee kis prakaar sahaayata kar sakata hoon?

There is unnecessary use of '.' and 'uँ' in romanization. What can be the best solution that gives appropriate and presentable transliterated output.

Update download links in the notebooks.

Hi,

I was trying out the models in colab notebooks and noticed the links are not up to date. After some digging, I was able to make the wget command work with the following links.

# download the indictrans model


# downloading the indic-en model
!wget https://ai4b-my.sharepoint.com/:u:/g/personal/sumanthdoddapaneni_ai4bharat_org/ETnq-z4aHXFAjDF1Te3AZ20BaZ59PwlKlzSemEHhrmYJ3w?download=1 -c -O 'indic-en.zip'
!unzip indic-en.zip

# downloading the en-indic model
!wget https://ai4b-my.sharepoint.com/:u:/g/personal/sumanthdoddapaneni_ai4bharat_org/EUOJ3irrwzFGnEnlPWHgaYkBugAQz25bPFgRvCPW8k7qtg?download=1 -c -O 'en-indic.zip'
!unzip en-indic.zip

# downloading the indic-indic model
!wget https://ai4b-my.sharepoint.com/:u:/g/personal/sumanthdoddapaneni_ai4bharat_org/Eajn_jJIp5NEqeyqZ0GW4FgBdiANlZNQiy7dlwkaNr8DHw?download=1 -c -O 'm2m.zip'
!unzip m2m.zip

Will be great if you could update them.

Unable to download model files

I am the error below when trying to download the models:

Could you please upload the models to alternative services and provide the links?

Inference Time

I tried to calculate inference time for batch size of 8 (each 100 words), it turned out to be 6-9s, which is quite slow, can you please help me out in enhancing it?

docker file to support self hosting requirement

I was trying to install indicTrans on my local machine(Mac OS 11.6 BigSur) and then wanted to try on linux server.
There were some challenges related to python dependencies and cuda support. There was also torch version dependencies giving error( May be it is because of my python version as well.I am using 3.9 version).

Could not able to find exact docker file which will help self hosted installation. It will be great if you add docker file for self hosting requirement.

Not able to remove logs of fairseq as they occupying so much space

Discussed in #62

^{Originally posted by Akhil-VSSG April 8, 2024}
Hi,
iam trying to resolve an issue , iam using indictrans and everytime i use it logging the fairseq logs like this:

and its taking a lot of space so i wanted to get rid of these logs so i want in to the fairseq_task.py present in tasks folder in fairseq folder.

Even thou i commented out the logger commands in that specific file iam still able to see those logger commands getting logged . Can you please help me out to remove those logs from my log file so that it wont take up all the memory
Thank you.

ONNX Conversion Of IndicTrans Model

Greetings,
We have been trying to convert the Indic-Trans Models to ONNX due to their large size and other factors but it seems to be too complex as it's an ensemble model and not a normal Pytorch Model.
Please let us know if there's any easier way to convert this model to ONNX.
Thanks!

Unable to access models

Hello, I am trying to use the pre-trained models and I am unable to download them with the error as attached below.

At the same time in the demo colab notebook while trying to run the imports cell, I encounter the following error. I restarted the run time but not sure why this is occurring.

I could run the same notebook previously without any issues I tried to use an older commit version of fairseq from here in that case I encounter the same error as above. Not sure if this is a problem with fairseq or something else.

When I tried to pip install fairseq instead of using the source repository I could move forward without the import issue but still the models are not available.
Let me know how I can fix this and if I am missing anything.
Thank you.

ImportError: cannot import name 'convert_namespace_to_omegaconf' from 'fairseq.dataclass.utils'

I am trying to run the translation code in my IDE. The python code is placed under indicTrans folder.
The statement
from inference.engine import Model
throws an error :

  File "Run.py", line 1, in <module>  
    from inference.engine import Model  
  File "/home/sneha/Documents/indic/indicTrans/inference/engine.py", line 15, in <module>  
    from inference.custom_interactive import Translator  
  File "/home/sneha/Documents/indic/indicTrans/inference/custom_interactive.py", line 17, in <module>  
    from fairseq.dataclass.utils import convert_namespace_to_omegaconf  
ImportError: cannot import name 'convert_namespace_to_omegaconf' from 'fairseq.dataclass.utils' (/home/sneha/.local/lib/python3.8/site-packages/fairseq/dataclass/utils.py)

May I know how this can be solved ?
Thank you

Mixed language translation issue

Hi,

When I translate a source sentence containing mix of English and Hindi words into target language Hindi, random words appear for Hindi words.
For eg.

High Commission of India,
India House,
माझगाव डॉक शिपबिल्डर्स लिमिटेड
Mazagon Dock Shipbuilders Limited

Corresponding Hindi Translation

भारतीय उच्चायोग,
इंडिया हाउस,
एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज एक्सरसाइज
मझगांव डॉक शिपबिल्डर्स लिमिटेड

I used joint_translate.sh script for translation.

Thanks

Unable to use model for inference

Hi, i am trying to follow indictrans_fairseq_inference.ipynb for inference using pretrained models for english to hindi. But the generated output file is empty. On running the command : bash joint_translate.sh en_sentences.txt hi_outputs.txt 'en' 'hi' '../en-indic' , the following logs show up :

Wed Aug 18 12:03:02 EDT 2021
Applying normalization and script conversion
100%|#######################################################################################################################################################| 4/4 [00:00<00:00, 35.06it/s]
Number of sentences in input: 4
Applying BPE
Decoding
Extracting translations, script conversion and detokenization
Translation completed

However, when I see the hi_outputs.txt.log :

2021-08-18 12:06:40 | INFO | fairseq.tasks.translation | [SRC] dictionary: 32104 types
2021-08-18 12:06:40 | INFO | fairseq.tasks.translation | [TGT] dictionary: 35848 types
2021-08-18 12:06:40 | INFO | fairseq_cli.interactive | loading model(s) from ../en-indic/model/checkpoint_best.pt
2021-08-18 12:06:54 | INFO | fairseq_cli.interactive | Sentence buffer size: 2500
2021-08-18 12:06:54 | INFO | fairseq_cli.interactive | NOTE: hypothesis and token scores are output in base 2
2021-08-18 12:06:54 | INFO | fairseq_cli.interactive | Type the input sentence and press return:
S-0	__src__en__ __tgt__hi__ hello
W-0	0.342	seconds
inside decode_fn
Traceback (most recent call last):
  File "/u/ttater24/miniconda3/envs/indictrans/bin/fairseq-interactive", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-interactive')())
  File "/dccstor/cssblr/rmurthyv/MWE/indicTrans/training/fairseq/fairseq_cli/interactive.py", line 317, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/dccstor/cssblr/rmurthyv/MWE/indicTrans/training/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/dccstor/cssblr/rmurthyv/MWE/indicTrans/training/fairseq/fairseq_cli/interactive.py", line 283, in main
    print("H-{}\t{}\t{}".format(id_, score, hypo_str)) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-30: ordinal not in range(128)

I get this error, and if I comment out the print lines on line 283 and line 285 in fairseq/fairseq_cli/interactive.py, it does not show any error but the output file still comes out empty.

logs if i comment out the print statements :

2021-08-18 12:10:33 | INFO | fairseq.tasks.translation | [SRC] dictionary: 32104 types
2021-08-18 12:10:33 | INFO | fairseq.tasks.translation | [TGT] dictionary: 35848 types
2021-08-18 12:10:33 | INFO | fairseq_cli.interactive | loading model(s) from ../en-indic/model/checkpoint_best.pt
2021-08-18 12:10:49 | INFO | fairseq_cli.interactive | Sentence buffer size: 2500
2021-08-18 12:10:49 | INFO | fairseq_cli.interactive | NOTE: hypothesis and token scores are output in base 2
2021-08-18 12:10:49 | INFO | fairseq_cli.interactive | Type the input sentence and press return:
S-0	__src__en__ __tgt__hi__ hello
W-0	0.341	seconds
inside decode_fn
P-0	-2.3383 -0.0889 -0.4440
S-1	__src__en__ __tgt__hi__ This bicycle is too small for you ! !
W-1	0.341	seconds
inside decode_fn
P-1	-0.4985 -0.4906 -0.0768 -0.7661 -0.2664 -0.4335 -0.2313 -0.1962 -0.2364 -0.7059
S-2	__src__en__ __tgt__hi__ I will directly meet you at the airport .
W-2	0.341	seconds
inside decode_fn
P-2	-0.7890 -1.7070 -0.8908 -0.2296 -0.8053 -1.8046 -0.3992 -0.1925 -0.3338 -0.1554
S-3	__src__en__ __tgt__hi__ If COVID-19 is spreading in your community , stay safe by taking some simple precautions , such as physical distancing , wearing a mask , keeping rooms well ventilated , avoiding crowds , cleaning your hands , and coughing into a bent elbow or tissue
W-3	0.341	seconds
inside decode_fn
P-3	-1.0445 -0.2641 -0.2060 -0.1771 -0.4343 -0.1578 -0.1135 -1.0917 -0.1419 -0.1706 -0.6793 -0.1694 -0.9022 -1.4178 -1.5907 -0.0508 -0.1238 -0.6169 -0.8115 -0.8431 -0.0594 -1.6192 -0.1850 -2.0033 -0.6672 -0.1500 -0.1400 -0.2116 -0.1118 -0.1452 -1.0625 -0.0479 -0.1315 -0.3330 -1.7769 -0.2528 -0.2888 -0.3009 -0.0194 -0.4282 -0.1504 -0.1208 -0.2184 -0.2383 -0.0699 -0.1597 -0.3872 -0.2935 -0.4527 -0.5734 -0.1968 -0.3366 -0.5575 -0.0954 -0.3563 -0.9439 -0.8385 -1.2941 -0.7543 -1.0392 -0.1206 -2.6353 -0.5275 -0.2655 -0.1371 -0.4036 -0.5091
2021-08-18 12:10:51 | INFO | fairseq_cli.interactive | Total time: 18.184 seconds; translation time: 1.364

In this case, the consolidated_testoutput in postprocess_translate.py is :
['', '', '', '']
I am unable to understand why the output is an empty file and how to use the model for inference

regarding binirizing the model

can i convert the fairseq finetuned model and the best checkpoint to ctranslate model .

indic to en fairseq training

Could you share the notebook showing the process of training the fairseq model for indic to en.

Can't download the model ...showing error

The download link not working

Fairseq installation error in ubuntu 20.04 python3.8

This is the error I get when I install the fairseq using the below command

pip install --editable ./

Running setup.py develop for fairseq
  error: subprocess-exited-with-error

  × python setup.py develop did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      running develop
      running egg_info
      writing fairseq.egg-info/PKG-INFO
      writing dependency_links to fairseq.egg-info/dependency_links.txt
      writing entry points to fairseq.egg-info/entry_points.txt
      writing requirements to fairseq.egg-info/requires.txt
      writing top-level names to fairseq.egg-info/top_level.txt
      reading manifest file 'fairseq.egg-info/SOURCES.txt'
      adding license file 'LICENSE'
      running build_ext
      cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
      cythoning fairseq/data/token_block_utils_fast.pyx to fairseq/data/token_block_utils_fast.cpp
      building 'fairseq.libbleu' extension
      x86_64-linux-gnu-gcc: fatal error: cannot execute ‘cc1plus’: execvp: No such file or directory
      compilation terminated.
      /tmp/pip-build-env-geo_ny44/overlay/lib/python3.8/site-packages/setuptools/command/easy_install.py:160: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and rds-based tools.
        warnings.warn(
      /tmp/pip-build-env-geo_ny44/overlay/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standools.
        warnings.warn(
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
WARNING: No metadata found in /root/IndicTranslation/venv/lib/python3.8/site-packages
Rolling back uninstall of fairseq
Moving to /root/IndicTranslation/venv/bin/fairseq-eval-lm
 from /tmp/pip-uninstall-x78vx6r1/fairseq-eval-lm
Moving to /root/IndicTranslation/venv/bin/fairseq-generate
 from /tmp/pip-uninstall-x78vx6r1/fairseq-generate
Moving to /root/IndicTranslation/venv/bin/fairseq-interactive
 from /tmp/pip-uninstall-x78vx6r1/fairseq-interactive
Moving to /root/IndicTranslation/venv/bin/fairseq-preprocess
 from /tmp/pip-uninstall-x78vx6r1/fairseq-preprocess
Moving to /root/IndicTranslation/venv/bin/fairseq-score
 from /tmp/pip-uninstall-x78vx6r1/fairseq-score
Moving to /root/IndicTranslation/venv/bin/fairseq-train
 from /tmp/pip-uninstall-x78vx6r1/fairseq-train
Moving to /root/IndicTranslation/venv/bin/fairseq-validate
 from /tmp/pip-uninstall-x78vx6r1/fairseq-validate
Moving to /root/IndicTranslation/venv/lib/python3.8/site-packages/fairseq-0.10.2.dist-info/
 from /root/IndicTranslation/venv/lib/python3.8/site-packages/~airseq-0.10.2.dist-info
Moving to /root/IndicTranslation/venv/lib/python3.8/site-packages/fairseq/
 from /root/IndicTranslation/venv/lib/python3.8/site-packages/~airseq
Moving to /root/IndicTranslation/venv/lib/python3.8/site-packages/fairseq_cli/
 from /root/IndicTranslation/venv/lib/python3.8/site-packages/~airseq_cli
error: subprocess-exited-with-error

× python setup.py develop did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
  running develop
  running egg_info
  writing fairseq.egg-info/PKG-INFO
  writing dependency_links to fairseq.egg-info/dependency_links.txt
  writing entry points to fairseq.egg-info/entry_points.txt
  writing requirements to fairseq.egg-info/requires.txt
  writing top-level names to fairseq.egg-info/top_level.txt
  reading manifest file 'fairseq.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  running build_ext
  cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
  cythoning fairseq/data/token_block_utils_fast.pyx to fairseq/data/token_block_utils_fast.cpp
  building 'fairseq.libbleu' extension
  x86_64-linux-gnu-gcc: fatal error: cannot execute ‘cc1plus’: execvp: No such file or directory
  compilation terminated.
  /tmp/pip-build-env-geo_ny44/overlay/lib/python3.8/site-packages/setuptools/command/easy_install.py:160: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and othebased tools.
    warnings.warn(
  /tmp/pip-build-env-geo_ny44/overlay/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards.
    warnings.warn(
  error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
WARNING: You are using pip version 22.0.3; however, version 22.0.4 is available.
You should consider upgrading via the '/root/IndicTranslation/venv/bin/python -m pip install --upgrade pip' command.

self hosted models to translate from Indic to English translation fails with no output

Error:

get error: "AttributeError: 'dict' object has no attribute '_get_node_flag'" when translating sentences from Indic to English.

Environment details

Installation was successfully done following as per this: https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb#scrollTo=E_4JxNdRlPQB

Server: Ubuntu
GPU: Cuda 11.x
Python 3.8

Terminal Output from python run

(langtrans) sysops@das:~/a1bharat/indicTrans$ python app.py
शिर्डी - श्री साईबाबा संस्‍थानचे जानेवारी २०२३ चे उत्‍कृष्‍ट विभाग प्रमुख व उत्‍कृष्‍ट कर्मचारी म्‍हणुन मुख्‍यलेखाधिकारी तथा प्र.प्रशासकीय अधिकारी कैलास खराडे व प्र.लेखाधिकारी साहेबराव लंके यांचा संस्‍थानचे प्र.मुख्‍य कार्यकारी अधिकारी राहुल...
Tue Mar 7 14:58:26 CST 2023
Applying normalization and script conversion
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.62it/s]
Number of sentences in input: 1
Applying BPE
Decoding
Extracting translations, script conversion and detokenization
Translation completed

Logs from: en_outputs.txt.log:

2023-03-07 14:58:32 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': 'model_configs', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': '../indic-en/model/checkpoint_best.pt', 'post_process': 'subword_nmt', 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 1}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 64, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 2500, 'input': 'en_outputs.txt.bpe'}, 'model': None, 'task': {'_name': 'translation', 'data': '../indic-en/final_bin', 'source_lang': 'SRC', 'target_lang': 'TGT', 'load_alignments': False, 'left_pad_source': True, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024, 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': None, 'required_seq_len_multiple': 1, 'eval_bleu': False, 'eval_bleu_args': '{}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args': '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': None, 'eval_bleu_print_samples': False}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2023-03-07 14:58:32 | INFO | fairseq.tasks.translation | [SRC] dictionary: 35904 types
2023-03-07 14:58:32 | INFO | fairseq.tasks.translation | [TGT] dictionary: 32088 types
2023-03-07 14:58:32 | INFO | fairseq_cli.interactive | loading model(s) from ../indic-en/model/checkpoint_best.pt
Traceback (most recent call last):
File "/home/sysops/langtrans/bin/fairseq-interactive", line 8, in
sys.exit(cli_main())
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq_cli/interactive.py", line 312, in cli_main
distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 364, in call_main
main(cfg, **kwargs)
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq_cli/interactive.py", line 145, in main
models, _model_args = checkpoint_utils.load_model_ensemble(
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 297, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 339, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 271, in load_checkpoint_to_cpu
overwrite_args_by_name(state["cfg"], arg_overrides)
File "/home/sysops/langtrans/lib/python3.8/site-packages/fairseq/dataclass/utils.py", line 427, in overwrite_args_by_name
with open_dict(cfg):
File "/usr/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/home/sysops/langtrans/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 669, in open_dict
prev_state = config._get_node_flag("struct")
AttributeError: 'dict' object has no attribute '_get_node_flag'

Empty string

So I followed the exact procedure as mentioned in the collab notebook for running on command prompt. But unfortunately, I am not able to get anything printed in the final output file. Can someone please help me with this?
Thanks

Architecture mismatched during fine-tuning

I am doing fine-tuning on my dataset using Indic-indic model. But getting architecture mismatch error. I am using --arch=transformer_4x for this architecture

finetuning for indic2indic model

The current finetuning script is not supporting finetuning for indic-indic model (which is one of the indicTrans model). But I can able to finetune the en-indc and indic-en models. So, Could you provide the script or direction to fine-tune that specific model.

Demo Colab notebook issue.

Hello I am trying to run the demo colab notebook but when I try to import the fairseq dependencies I am getting the below error.

I tried restarting the runtime as mentioned, installed fairseq using pip, and also tried a different checkout version of faireq from here and here but still no help.

Not sure if I am missing something. Attaching the screenshot for reference. Kindly help.

P.S: When installing with pip I am not getting this error but another error as attached below when trying to import indic trans Model

Issue in using Standalone En2Indic Model

Hi,
If I download the En2Indic Model and try to use it, I am getting the following error:

ModuleNotFoundError: No module named 'inference'

However, if I first download Indic2En Model and then download EN2Indic model things are working fine. Can you please check if it is the same behaviour for you also? or Am I doing something wrong?

Colab Links for your reference:

Working only after Indic2En model
https://colab.research.google.com/drive/1HxG1_lvZ7XDD89QbVikajg0HJnQvNrgR?usp=sharing

Not Working if EN2Indic model is downloaded
https://colab.research.google.com/drive/1QAQg0557YVgoPLfC7MMtbnfHNNmFpu-m?usp=sharing

Params for Training en-indic model

We are trying to replicate the results from samantar indictrans paper. We are training the model for only en-hi translations. We are currently using these params following the paper :
fairseq-train ../en_hi_4x/final_bin --max-source-positions=210 --max-target-positions=210 --save-interval-updates=10000 --arch=transformer_4x --criterion=label_smoothed_cross_entropy --source-lang=SRC --lr-scheduler=inverse_sqrt --target-lang=TGT --label-smoothing=0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 1.0 --warmup-init-lr 1e-07 --lr 0.0005 --warmup-updates 4000 --dropout 0.2 --save-dir ../en_hi_4x/model --keep-last-epochs 5 --patience 5 --skip-invalid-size-inputs-valid-test --fp16 --user-dir model_configs --wandb-project 'train_1' --max-tokens 300"

Can you please share the params you have used for training the en-indic model or specifically if you have tried en-hi separately?

indic to indic bpe codes

Hi,

Great model, thank you for the awesome work! I'm trying to do indic to indic translation, and from the joint_translate.sh I can see that the bpe_codes.32k, for en to indic (vice versa) they are actually separated into "bpe_codes.32k.SRC" and "bpe_codes.32k.TGT", but for the indic to indic model it only contains one file "bpe_codes.32k.SRC_TGT", which breaks the model creation at inference.

Did I miss some documentation of how to modify the code when doing indic to indic inference? (otherwise it will return No such file or directory: '../m2m/vocab/bpe_codes.32k.SRC' error)

Thanks!

Reducing BLEU score

I tried to finetune the indicTrans model on 1603080 en-hi sentences from WAT 2021. Initially, I trained it for 3 epochs for 9 hours 20 minutes on GPU and got a BLEU score of 37.1. Then, I tried to continue the training the next day by restoring the last checkpoint which ran for 6 epochs for a period of 19 hours odd and finally gave a BLEU score of 36.2. None of the epochs on the second day seemed to produce the best checkpoint and the loss remained around 3.1 for the entire period.
What seems to be the problem and how I can solve it?

TypeError: cannot unpack non-iterable NoneType object while importing fairseq

Hi Team

I am trying to reproduce indicTrans_python_interface.ipynb. I am not able to import fairseq library .Below is the error I am facing.

I could see same issue is still open in fairseq github repo. I tried installing torch and torchvision packages as mentioned in below link but still I am facing the same issue.

facebookresearch/fairseq#4214

This is issue is blocking me to train my own model as well using IndicTrans_training.ipynb .

I could see the import is successful in indicTrans_python_interface.ipynb. with few warnings.

Below is link to my notebook

https://colab.research.google.com/drive/1e0G_jDe8_0hd-xtj1e4zhwqJQNtE86EC?usp=sharing

Can you please help me here

Regards
Subbu

Don't want particular keywords to translate in message

Have a list of words that I want to keep as it is when translating from one language to another. How to achieve this.

Batch Translation form API

I am referring to (https://hf.space/embed/ai4bharat/IndicTrans-English2Indic/api). However this is limited to single string translation.
@sumanthd17 @GokulNC is there any way to do batch Translation like the earlier hosted public api, How we can do batch translation using hugging face spaces public hosted models API? Please provide example.

Running into errors using Colab for inference

Apologies for creating an issue out of the blue. I came across this codebase via https://indicnlp.ai4bharat.org/indic-trans/ and I am a big fan of this project.

I am running into some issues while running the code. I followed along the code in the colab given in the README file (as well as the PR with a few fixes: #6).

In particular,

The script joint_translate.sh expects a file ../en-indic/vocab/bpe_codes.32k.SRC_TGT which does not exist. I assume that the file that is required is ../en-indic/vocab/bpe_codes.32k.SRC instead.
After I change the file above, while running the model, I still get this error:

AssertionError: Could not infer model type from Namespace(_name='transformer_4x', ...

I assume this is because the model transformer_4x is a custom model, not in available models in fairseq/fairseq/models/__init__.py here.

Would it be possible to make sure that the code runs in the colab? Thank you so much!

Can we host the model on Non GPU system?

Will the model perform if hosted on a non GPU system?

Handling of URL in source sentence

Hi, sentences having URL doesn't get translated correctly,
Input: If you develop these symptoms in someone close to you, staying at home can help prevent the spread of Coronavirus infection. For details visit https://mohfw.gov.in/
Output (Hindi): यदि आप अपने किसी करीबी में ये लक्षण विकसित करते हैं, तो घर पर रहना कोरोना वायरस संक्रमण के प्रसार को रोकने में मदद कर सकता है। अधिक जानकारी के लिए https:// www. mohfw. gov. in/pdf पर जाएं। //mohfw. gov. in/पर उपलब्ध है।

How to handle it?

Model only works inside the indicTrans folder.

Hi, I did the setup as explained in python inference demo colab notebook. It happened properly. However, the model only works inside the indicTrans folder. Accessing it from anywhere outside results in ModuleNotFoundErrors or FileNotFoundErrors.

ModuleNotFoundError: No module named 'indicTrans'

I had a FastAPI backend where I needed to access the model outside indicTrans but its not working. The code I tried is given below:

import sys
sys.path.append(path_to_indictrans_folder)
from indicTrans.inference.engine import Model

indic2en_model = Model(expdir='../indic-en')

[Same code without sys.path.append works inside indicTrans]

Is there any mistake in accessing the model using this code or is there any other approach?

OS Used: Linux Mint

RAM issue while deploying IndicTrans model on AWS

We're using indictrans translation models, and the huggingface api is throwing a runtime error. We set up a docker file for the model on AWS but it's eating up around 16GB of RAM (and requiring t2 xl tier which is quite expensive). We want to use the translation model in production in our mobile app . Any ideas on what can be done to reduce costs or make the API calls easier?

ai4bharat / indictrans Goto Github PK

indictrans's Introduction

IndicTrans

Benchmarks

Updates

Table of contents

Resources

Try out model online (Huggingface spaces)

Download model

STS Benchmark

Using hosted APIs

Accessing on ULCA

Running Inference

Command line interface

Python Inference

Training model

Setting up your environment

Details of models and hyperparameters

Training procedure and code

WandB plots

Evaluating trained model

Detailed benchmarking results

Finetuning model on your data

Folder Structure

Citing our work

License

Contributors

Contact

Acknowledgements

indictrans's People

Contributors

Stargazers

Watchers

Forkers

indictrans's Issues

showing

Discussed in #62

Error:

Environment details

Terminal Output from python run

Logs from: en_outputs.txt.log:

Recommend Projects

Recommend Topics

Recommend Org