Git Product home page Git Product logo

indo-sentence-embeddings's Introduction

Indonesian Sentence Embeddings

Inspired by Thai Sentence Vector Benchmark, we decided to embark on the journey of training Indonesian sentence embedding models!

logo

To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. We hope this repository can serve as a benchmark for future research on Indonesian sentence embeddings.

Evaluation

Machine Translated STS-B

We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the STS-B dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

You can find the translated dataset on ๐Ÿค— HuggingFace Hub.

Moreover, we will further evaluate the transferrability of our models on downstream tasks (e.g. text classification, natural language inference, etc.) and compare them with existing pre-trained language models (PLMs).

Text Classification

For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of IndoNLU, respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.

Methods

(Unsupervised) SimCSE

We followed SimCSE: Simple Contrastive Learning of Sentence Embeddings and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the Sentence Transformer implementation of SimCSE.

ConGen

Like SimCSE, ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the official ConGen implementation which was written on top of the Sentence Transformers library.

Results

Machine Translated Indonesian Semantic Textual Similarity Benchmark (STSB-MT-ID)

Model Spearman's Correlation (%) #params Base/Student Model Teacher Model Train Dataset Supervised
SimCSE-IndoBERT Lite Base 44.08 12M IndoBERT Lite Base N/A Wikipedia
SimCSE-IndoRoBERTa Base 61.26 125M IndoRoBERTa Base N/A Wikipedia
SimCSE-IndoBERT Base 70.13 125M IndoBERT Base N/A Wikipedia
ConGen-IndoBERT Lite Base 79.97 12M IndoBERT Lite Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
ConGen-IndoBERT Base 80.47 125M IndoBERT Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
ConGen-SimCSE-IndoBERT Base 81.16 125M SimCSE-IndoBERT Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
S-IndoBERT Base mMARCO 72.95 125M IndoBERT Base N/A mMARCO โœ…
distiluse-base-multilingual-cased-v2 75.08 134M DistilBERT Base Multilingual mUSE See: SBERT โœ…
paraphrase-multilingual-mpnet-base-v2 83.83 125M XLM-RoBERTa Base paraphrase-mpnet-base-v2 See: SBERT โœ…

Emotion Classification (EmoT)

Model Accuracy (%) F1 Macro (%)
SimCSE-IndoBERT Lite Base 41.13 40.70
SimCSE-IndoRoBERTa Base 50.45 50.75
SimCSE-IndoBERT Base 55.45 55.78
ConGen-IndoBERT Lite Base 58.18 58.84
ConGen-IndoBERT Base 57.04 57.06
ConGen-SimCSE-IndoBERT Base 59.54 60.37
S-IndoBERT Base mMARCO 48.86 47.92
distiluse-base-multilingual-cased-v2 63.63 64.13
paraphrase-multilingual-mpnet-base-v2 63.18 63.78

Sentiment Analysis (SmSA)

Model Accuracy (%) F1 Macro (%)
SimCSE-IndoBERT Lite Base 68.8 63.37
SimCSE-IndoRoBERTa Base 76.2 70.42
SimCSE-IndoBERT Base 85.6 81.50
ConGen-IndoBERT Lite Base 81.2 75.59
ConGen-IndoBERT Base 85.4 82.12
ConGen-SimCSE-IndoBERT Base 83.0 78.74
S-IndoBERT Base mMARCO 80.2 75.73
distiluse-base-multilingual-cased-v2 78.8 73.64
paraphrase-multilingual-mpnet-base-v2 89.6 86.56

References

@misc{Thai-Sentence-Vector-Benchmark-2022,
  author = {Limkonchotiwat, Peerat},
  title = {Thai-Sentence-Vector-Benchmark},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark}}
}
@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}
@inproceedings{gao2021simcse,
  title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
  author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
  booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2021}
}
@inproceedings{limkonchotiwat-etal-2022-congen,
  title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
  author = "Limkonchotiwat, Peerat  and
    Ponwitayarat, Wuttikorn  and
    Lowphansirikul, Lalita and
    Udomcharoenchaikit, Can  and
    Chuangsuwanich, Ekapol  and
    Nutanong, Sarana",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
  year = "2022",
  publisher = "Association for Computational Linguistics",
}

indo-sentence-embeddings's People

Contributors

w11wo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.