Git Product home page Git Product logo

multilingual_similarity_compare's Introduction

Goals

In this repository, we will see different methods to achieve similarity search across multiple languages. We aim to compare:

  • LASER (Baseline)
  • XLM MLM 100 languages
  • XLM-R
  • sentence transformers with multilingual models

We compare the below models on the publicly available newstest2012 from WMT 2012.

For each sentence in the source language, we calculate the closest sentence in the joint embedding space in the target language. If this sentence has the same index in the file, it is considered as correct, and as an error else wise. Therefore, the N-way parallel corpus should not contain duplicates.

  1. First, get the data from the section "Get data".
  2. Second, install the necessary tools such as tokenizers and BPE encoder.
  3. Third, preprocess the data (this step is not necessary for all models but interesting for the seek of experimenting).
  4. Then, get sentence embeddings with the method of your choice. In this section, we have not optimized the code so it can be very slow for XLM models. It is quite fast with sentence-transformers though. It could be useful to improve this part to get sentence embeddings faster ==> We could use larger batches (batch version is available but needs to be optimized). Feel free to raise an issue with suggestions.
  5. Finally, compute similarity search for each pair of languages.

Results

Baseline (LASER)

cs de en es fr avg
cs 0.00% 0.70% 0.90% 0.67% 0.77% 0.76%
de 0.83% 0.00% 1.17% 0.90% 1.03% 0.98%
en 0.93% 1.27% 0.00% 0.83% 1.07% 1.02%
es 0.53% 0.77% 0.97% 0.00% 0.57% 0.71%
fr 0.50% 0.90% 1.13% 0.60% 0.00% 0.78%
avg 0.70% 0.91% 1.04% 0.75% 0.86% 1.06%

Sentence-transformers with 'distiluse-base-multilingual-cased'

Czech language is currently not covered by sentence-transformers. More experiments are coming to do a thorough comparison. Hence, the following is a better view of what performance can achieve sentence-transformers using as base model 'distiluse-base-multilingual-cased'

cs de en es fr avg
cs NA NA NA NA NA NA
de NA 0.00% 1.40% 1.60% 1.53% TODO
en NA 1.47% 0.00% 1.33% 0.97% TODO
es NA 1.80% 1.10% 0.00% 1.03% TODO
fr NA 1.73% 0.97% 1.07% 0.00% TODO
avg NA TODO TODO TODO TODO TODO

Sentence-transformers outperform LASER on the pair (French, English), (English, French).
Otherwise, LASER shows higher performance.

Russian is considered in our analysis. LASER has not published in their repo the results for Russian. We will reproduce their experiments soon and include Russian for rigurous analysis.

langs de en es fr ru avg
de 0.00% 1.40% 1.60% 1.53% 2.60% 1.78%
en 1.47% 0.00% 1.33% 0.97% 0.50% 1.07%
es 1.80% 1.10% 0.00% 1.03% 2.13% 1.52%
fr 1.73% 0.97% 1.07% 0.00% 2.00% 1.44%
ru 2.76% 0.50% 2.23% 2.03% 0.00% 1.88%
avg 1.94% 0.99% 1.56% 1.39% 1.81% 1.54%

We can also do a Zero-Shot encoding and consider CS language.
The following shows such results:

langs cs de en es fr ru avg
cs 0.00% 39.46% 37.63% 38.20% 38.99% 41.69% 39.19%
de 36.10% 0.00% 1.40% 1.60% 1.53% 2.60% 8.64%
en 34.17% 1.47% 0.00% 1.33% 0.97% 0.50% 7.69%
es 35.36% 1.80% 1.10% 0.00% 1.03% 2.13% 8.29%
fr 35.96% 1.73% 0.97% 1.07% 0.00% 2.00% 8.34%
ru 37.96% 2.76% 0.50% 2.23% 2.03% 0.00% 9.10%
avg 35.91% 9.44% 8.32% 8.88% 8.91% 9.78% 13.54%

DistilBERT-base-multilingual-cased

With mean pooling, MAX_LEN = 100:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   84.85%   87.71%   86.85%   88.71%   87.03%
de    75.06%    0.00%   64.20%   70.36%   69.13%   69.69%
en    74.16%   54.28%    0.00%   24.84%   29.14%   45.60%
es    75.99%   81.62%   40.86%    0.00%   50.95%   62.35%
fr    90.08%   78.16%   37.53%   56.88%    0.00%   65.66%
avg   78.82%   74.73%   57.58%   59.73%   59.48%   66.07%

With mean pooling, MAX_LEN = 50:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   84.18%   85.98%   82.65%   89.08%   85.47%
de    74.79%    0.00%   61.51%   65.23%   67.27%   67.20%
en    75.72%   55.04%    0.00%   26.71%   31.60%   47.27%
es    78.39%   81.75%   37.40%    0.00%   55.54%   63.27%
fr    75.36%   63.20%   35.16%   34.33%    0.00%   52.01%
avg   76.07%   71.05%   55.01%   52.23%   60.87%   63.05%

With cls pooling, MAX_LEN = 100:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   89.84%   98.07%   89.88%   91.94%   92.43%
de    72.86%    0.00%   88.68%   71.33%   73.63%   76.62%
en    81.55%   64.47%    0.00%   42.99%   51.85%   60.21%
es    77.46%   78.82%   85.55%    0.00%   60.11%   75.48%
fr    75.42%   76.76%   83.82%   50.18%    0.00%   71.55%
avg   76.82%   77.47%   89.03%   63.59%   69.38%   75.26%

With cls pooling, MAX_LEN = 50:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   91.51%   98.30%   91.04%   93.81%   93.66%
de    75.02%    0.00%   89.68%   73.16%   76.92%   78.70%
en    82.78%   67.87%    0.00%   49.32%   56.51%   64.12%
es    78.69%   81.68%   86.38%    0.00%   64.84%   77.90%
fr    76.72%   80.22%   85.18%   54.18%    0.00%   74.08%
avg   78.31%   80.32%   89.89%   66.92%   73.02%   77.69%

XLM with MLM covering 100 languages

With max_len = 100, (the higher the better according to experiments)

langs cs de en es fr ru avg
cs 0.00% 74.46% 74.63% 71.86% 47.45% 42.72% 62.22%
de 48.42% 0.00% 32.83% 35.23% 35.30% 40.93% 38.54%
en 53.28% 37.70% 0.00% 22.91% 25.57% 24.18% 32.73%
es 40.26% 48.68% 19.35% 0.00% 25.87% 36.56% 34.15%
fr 65.73% 50.35% 20.85% 37.30% 0.00% 31.73% 41.19%
ru 41.56% 42.09% 21.51% 34.23% 34.73% 0.00% 34.83%
avg 49.85% 50.66% 33.83% 40.31% 33.79% 35.22% 40.61%

Without russian, (for easier comparison with Baseline)

langs cs de en es fr avg
cs 0.00% 74.46% 74.63% 71.86% 47.45% 67.10%
de 48.42% 0.00% 32.83% 35.23% 35.30% 37.95%
en 53.28% 37.70% 0.00% 22.91% 25.57% 34.87%
es 40.26% 48.68% 19.35% 0.00% 25.87% 33.54%
fr 65.73% 50.35% 20.85% 37.30% 0.00% 43.56%
avg 51.92% 52.80% 36.91% 41.82% 33.55% 43.40%

XLM-RoBERTa

@TODO: CHECK WHY XLM-R shows such poor performance

Mean Pooling Strategy is the strategy with the best performance. Refer to this issue for comparison when using CLS Pooling Strategy. Input: MAX_LEN = 40, Mean Pooling Strategy

sh similarity_XLM-R_batch.sh 40 mean True 

Output:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   91.31%   97.64%   98.37%   94.34%   95.41%
de    93.84%    0.00%   88.05%   92.87%   95.40%   92.54%
en    91.21%   77.79%    0.00%   72.56%   94.77%   84.08%
es    95.47%   93.14%   61.84%    0.00%   90.28%   85.18%
fr    91.97%   81.25%   77.12%   71.96%    0.00%   80.58%
avg   93.12%   85.87%   81.16%   83.94%   93.70%   87.56%

Input: MAX_LEN = 100, Mean Pooling Strategy

sh similarity_XLM-R_batch.sh 100 mean True 

Output:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   91.24%   97.77%   97.80%   92.11%   94.73%
de    94.31%    0.00%   86.55%   92.44%   94.67%   91.99%
en    91.08%   74.93%    0.00%   71.56%   77.69%   78.81%
es    95.37%   88.38%   58.54%    0.00%   90.38%   83.17%
fr    92.71%   91.58%   87.41%   89.14%    0.00%   90.21%
avg   93.36%   86.53%   82.57%   87.74%   88.71%   87.78%

In the following, we show how to replicate such results.

General

Get data

For linux users: Use script wmt.sh

sh wmt.sh 

Install tools

We assume that tools like torch, tqdm, etc are already installed. On Google Collab, it is the case. (cf. examples --coming soon)
Use script install-tools.sh

sh install-tools.sh 

Preprocess data (not necessary)

When preprocessing data prior to encoding, we experienced lower performance. Hence, we do not recommend to do it. If one still wants to preprocess data, please use prepare-data-wmt bash file. Then, some small changes need to be added, so that input file can be found in the proper directory. We may add an argument for --input_file_name in the future.

Use script prepare-data-wmt.sh

sh prepare-data-wmt.sh

Replicate Results: Sentence-transformers

At the moment, sentence-transformers on "distiluse-base-multilingual-cased" only covers Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
Hence, Czech is not covered. We still decide to print the errors for language Czech for the seek of experiment on pure Zero-Shot learning. However, for comparison with baseline, we focus on other pairs of languages.

Run similarity search

sh similarity_sentenceBERT.sh

Replicate Results: XLM

Run similarity search

In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.

MAX_LEN is an integer parameter that is describing the number of tokens to consider when encoding.
For each sentence with less tokens than MAX_LEN, we use zero-padding.
For each sentence with more token than MAX_LEN, we ignore all tokens after MAX_LEN index.
Default value is 40. Based on prior statistics on preprocessed data for French, we have:

  • mean_value of length sentence: 26.6
  • std_value of length sentence: 15.4
  • max_value of length sentence: 145

We recommend to increase MAX_LEN for experiments. Larger MAX_LEN gives better performance but slower computation.

sh similarity_XLM.sh MAX_LEN

Replicate Results: XLM-R

Run similarity search

In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.

Parameters:

  • --max_len: maximum length of the sentences
  • --pooling_strat: pooling strategy in the set {cls, mean} at the moment. Needs to be optimized. Default value is cls embedding.
  • --gpu: Use GPU if --gpu set to True and GPU support in your machine
sh similarity_XLM-R.sh MAX_LEN POOLING_STRAT

Batch version:

sh similarity_XLM-R_batch.sh MAX_LEN POOLING_STRAT GPU

multilingual_similarity_compare's People

Contributors

mastafaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

multilingual_similarity_compare's Issues

Test XLM-R with mean pooling after solving issue on mean pooling

Input:

sh similarity_XLM-R_batch.sh 40 mean True 

Output:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   91.31%   97.64%   98.37%   94.34%   95.41%
de    93.84%    0.00%   88.05%   92.87%   95.40%   92.54%
en    91.21%   77.79%    0.00%   72.56%   94.77%   84.08%
es    95.47%   93.14%   61.84%    0.00%   90.28%   85.18%
fr    91.97%   81.25%   77.12%   71.96%    0.00%   80.58%
avg   93.12%   85.87%   81.16%   83.94%   93.70%   87.56%

Input:

sh similarity_XLM-R_batch.sh 100 mean True 

Output:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   91.24%   97.77%   97.80%   92.11%   94.73%
de    94.31%    0.00%   86.55%   92.44%   94.67%   91.99%
en    91.08%   74.93%    0.00%   71.56%   77.69%   78.81%
es    95.37%   88.38%   58.54%    0.00%   90.38%   83.17%
fr    92.71%   91.58%   87.41%   89.14%    0.00%   90.21%
avg   93.36%   86.53%   82.57%   87.74%   88.71%   87.78%

Why is mean pooling giving such bad performance?

! sh similarity_distilBERT_batch.sh 50 mean True 

Output:

Confusion matrix:
langs   cs       de       en       es       fr       avg     
cs     0.00%   99.97%   99.97%   99.97%   99.97%   99.97%
de   100.00%    0.00%  100.00%  100.00%  100.00%  100.00%
en    99.97%   99.97%    0.00%   99.97%   99.97%   99.97%
es    99.97%   99.97%   99.93%    0.00%  100.00%   99.97%
fr    99.97%   99.97%   99.97%   99.97%    0.00%   99.97%
avg   99.98%   99.97%   99.97%   99.98%   99.98%   99.97%

Time comparison between GPU and CPU

  • distilBERT with CPU, MAX_LEN = 100
    100% 3003/3003 [06:54<00:00, 7.30it/s]
    100% 3003/3003 [07:15<00:00, 7.25it/s]
    100% 3003/3003 [07:06<00:00, 7.75it/s]
    100% 3003/3003 [06:35<00:00, 7.91it/s]
    100% 3003/3003 [06:31<00:00, 8.07it/s]

  • distilBERT with GPU, MAX_LEN = 100:
    LENGTH OF TOTAL DOCUMENT 3003
    -- Encoder: 3003 sentences in 35s
    LENGTH OF TOTAL DOCUMENT 3003
    -- Encoder: 3003 sentences in 35s
    LENGTH OF TOTAL DOCUMENT 3003
    -- Encoder: 3003 sentences in 35s
    LENGTH OF TOTAL DOCUMENT 3003
    --Encoder: 3003 sentences in 35s
    LENGTH OF TOTAL DOCUMENT 3003
    -- Encoder: 3003 sentences in 34s

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.