Inspired by Thai Sentence Vector Benchmark, we decided to embark on the journey of training Indonesian sentence embedding models!
To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. We hope this repository can serve as a benchmark for future research on Indonesian sentence embeddings.
We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the STS-B dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
You can find the translated dataset on ๐ค HuggingFace Hub.
Moreover, we will further evaluate the transferrability of our models on downstream tasks (e.g. text classification, natural language inference, etc.) and compare them with existing pre-trained language models (PLMs).
For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of IndoNLU, respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.
We followed SimCSE: Simple Contrastive Learning of Sentence Embeddings and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the Sentence Transformer implementation of SimCSE.
Like SimCSE, ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the official ConGen implementation which was written on top of the Sentence Transformers library.
Model | Spearman's Correlation (%) | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised |
---|---|---|---|---|---|---|
SimCSE-IndoBERT Lite Base | 44.08 | 12M | IndoBERT Lite Base | N/A | Wikipedia | |
SimCSE-IndoRoBERTa Base | 61.26 | 125M | IndoRoBERTa Base | N/A | Wikipedia | |
SimCSE-IndoBERT Base | 70.13 | 125M | IndoBERT Base | N/A | Wikipedia | |
ConGen-IndoBERT Lite Base | 79.97 | 12M | IndoBERT Lite Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | |
ConGen-IndoBERT Base | 80.47 | 125M | IndoBERT Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | |
ConGen-SimCSE-IndoBERT Base | 81.16 | 125M | SimCSE-IndoBERT Base | paraphrase-multilingual-mpnet-base-v2 | Wikipedia | |
S-IndoBERT Base mMARCO | 72.95 | 125M | IndoBERT Base | N/A | mMARCO | โ |
distiluse-base-multilingual-cased-v2 | 75.08 | 134M | DistilBERT Base Multilingual | mUSE | See: SBERT | โ |
paraphrase-multilingual-mpnet-base-v2 | 83.83 | 125M | XLM-RoBERTa Base | paraphrase-mpnet-base-v2 | See: SBERT | โ |
Model | Accuracy (%) | F1 Macro (%) |
---|---|---|
SimCSE-IndoBERT Lite Base | 41.13 | 40.70 |
SimCSE-IndoRoBERTa Base | 50.45 | 50.75 |
SimCSE-IndoBERT Base | 55.45 | 55.78 |
ConGen-IndoBERT Lite Base | 58.18 | 58.84 |
ConGen-IndoBERT Base | 57.04 | 57.06 |
ConGen-SimCSE-IndoBERT Base | 59.54 | 60.37 |
S-IndoBERT Base mMARCO | 48.86 | 47.92 |
distiluse-base-multilingual-cased-v2 | 63.63 | 64.13 |
paraphrase-multilingual-mpnet-base-v2 | 63.18 | 63.78 |
Model | Accuracy (%) | F1 Macro (%) |
---|---|---|
SimCSE-IndoBERT Lite Base | 68.8 | 63.37 |
SimCSE-IndoRoBERTa Base | 76.2 | 70.42 |
SimCSE-IndoBERT Base | 85.6 | 81.50 |
ConGen-IndoBERT Lite Base | 81.2 | 75.59 |
ConGen-IndoBERT Base | 85.4 | 82.12 |
ConGen-SimCSE-IndoBERT Base | 83.0 | 78.74 |
S-IndoBERT Base mMARCO | 80.2 | 75.73 |
distiluse-base-multilingual-cased-v2 | 78.8 | 73.64 |
paraphrase-multilingual-mpnet-base-v2 | 89.6 | 86.56 |
@misc{Thai-Sentence-Vector-Benchmark-2022,
author = {Limkonchotiwat, Peerat},
title = {Thai-Sentence-Vector-Benchmark},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark}}
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@inproceedings{gao2021simcse,
title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2021}
}
@inproceedings{limkonchotiwat-etal-2022-congen,
title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
author = "Limkonchotiwat, Peerat and
Ponwitayarat, Wuttikorn and
Lowphansirikul, Lalita and
Udomcharoenchaikit, Can and
Chuangsuwanich, Ekapol and
Nutanong, Sarana",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
year = "2022",
publisher = "Association for Computational Linguistics",
}