Git Product home page Git Product logo

sacrebleu's Introduction

sacreBLEU

PyPI version Python version GitHub issues

SacreBLEU (Post, 2018) provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.

The official version is hosted at https://github.com/mjpost/sacrebleu.

Motivation

Comparing BLEU scores is harder than it should be. Every decoder has its own implementation, often borrowed from Moses, but maybe with subtle changes. Moses itself has a number of implementations as standalone scripts, with little indication of how they differ (note: they mostly don't, but multi-bleu.pl expects tokenized input). Different flags passed to each of these scripts can produce wide swings in the final score. All of these may handle tokenization in different ways. On top of this, downloading and managing test sets is a moderate annoyance.

Sacre bleu! What a mess.

SacreBLEU aims to solve these problems by wrapping the original reference implementation (Papineni et al., 2002) together with other useful features. The defaults are set the way that BLEU should be computed, and furthermore, the script outputs a short version string that allows others to know exactly what you did. As an added bonus, it automatically downloads and manages test sets for you, so that you can simply tell it to score against wmt14, without having to hunt down a path on your local file system. It is all designed to take BLEU a little more seriously. After all, even with all its problems, BLEU is the default and---admit it---well-loved metric of our entire research community. Sacre BLEU.

Features

  • It automatically downloads common WMT test sets and processes them to plain text
  • It produces a short version string that facilitates cross-paper comparisons
  • It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization
  • It produces the same values as the official script (mteval-v13a.pl) used by WMT
  • It outputs the BLEU score without the comma, so you don't have to remove it with sed (Looking at you, multi-bleu.perl)
  • It supports different tokenizers for BLEU including support for Japanese and Chinese
  • It supports chrF, chrF++ and Translation error rate (TER) metrics
  • It performs paired bootstrap resampling and paired approximate randomization tests for statistical significance reporting

Breaking Changes

v2.0.0

As of v2.0.0, the default output format is changed to json for less painful parsing experience. This means that software that parse the output of sacreBLEU should be modified to either (i) parse the JSON using for example the jq utility or (ii) pass -f text to sacreBLEU to preserve the old textual output. The latter change can also be made persistently by exporting SACREBLEU_FORMAT=text in relevant shell configuration files.

Here's an example of parsing the score key of the JSON output using jq:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de | jq -r .score
20.8

Installation

Install the official Python module from PyPI (Python>=3.6 only):

pip install sacrebleu

In order to install Japanese tokenizer support through mecab-python3, you need to run the following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ja]"

In order to install Korean tokenizer support through pymecab-ko, you need to run the following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ko]"

Command-line Usage

You can get a list of available test sets with sacrebleu --list. Please see DATASETS.md for an up-to-date list of supported datasets. You can also list available test sets for a given language pair with sacrebleu --list -l en-fr.

Basics

Downloading test sets

Downloading is triggered when you request a test set. If the dataset is not available, it is downloaded and unpacked.

E.g., you can use the following commands to download the source, pass it through your translation system in translate.sh, and then score it:

$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de

Some test sets also have the outputs of systems that were submitted to the task. For example, the wmt/systems test set.

$ sacrebleu -t wmt21/systems -l zh-en --echo NiuTrans

This provides a convenient way to score:

$ sacrebleu -t wmt21/system -l zh-en --echo NiuTrans | sacrebleu -t wmt21/systems -l zh-en
``

You can see a list of the available outputs by passing an invalid value to `--echo`.

### JSON output

As of version `>=2.0.0`, sacreBLEU prints the computed scores in JSON format to make parsing less painful:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de


```json
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

If you want to keep the old behavior, you can pass -f text or export SACREBLEU_FORMAT=text:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

Scoring

(All examples below assume old-style text output for a compact representation that save space)

Let's say that you just translated the en-de test set of WMT17 with your fancy MT system and the detokenized translations are in a file called output.detok.txt:

# Option 1: Redirect system output to STDIN
$ cat output.detok.txt | sacrebleu -t wmt17 -l en-de
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

# Option 2: Use the --input/-i argument
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

You can obtain a short version of the signature with --short/-sh:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -sh
BLEU|#:1|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

If you only want the score to be printed, you can use the --score-only/-b flag:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b
20.8

The precision of the scores can be configured via the --width/-w flag:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b -w 4
20.7965

Using your own reference file

SacreBLEU knows about common test sets (as detailed in the --list example above), but you can also use it to score system outputs with arbitrary references. In this case, do not forget to provide detokenized reference and hypotheses files:

# Let's save the reference to a text file
$ sacrebleu -t wmt17 -l en-de --echo ref > ref.detok.txt

# Option 1: Pass the reference file as a positional argument to sacreBLEU
$ sacrebleu ref.detok.txt -i output.detok.txt -m bleu -b -w 4
20.7965

# Option 2: Redirect the system into STDIN (Compatible with multi-bleu.perl way of doing things)
$ cat output.detok.txt | sacrebleu ref.detok.txt -m bleu -b -w 4
20.7965

Using multiple metrics

Let's first compute BLEU, chrF and TER with the default settings:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
      chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 52.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0

Let's now enable chrF++ which is a revised version of chrF that takes into account word n-grams. Observe how the nw:0 gets changed into nw:2 in the signature:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter --chrf-word-order 2
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
    chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0 = 49.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0

Metric-specific arguments are detailed in the output of --help:

BLEU related arguments:
  --smooth-method {none,floor,add-k,exp}, -s {none,floor,add-k,exp}
                        Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)
  --smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE
                        The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)
  --tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}
                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.
  --lowercase, -lc      If True, enables case-insensitivity. (Default: False)
  --force               Insist that your tokenized input is actually detokenized.

chrF related arguments:
  --chrf-char-order CHRF_CHAR_ORDER, -cc CHRF_CHAR_ORDER
                        Character n-gram order. (Default: 6)
  --chrf-word-order CHRF_WORD_ORDER, -cw CHRF_WORD_ORDER
                        Word n-gram order (Default: 0). If equals to 2, the metric is referred to as chrF++.
  --chrf-beta CHRF_BETA
                        Determine the importance of recall w.r.t precision. (Default: 2)
  --chrf-whitespace     Include whitespaces when extracting character n-grams. (Default: False)
  --chrf-lowercase      Enable case-insensitivity. (Default: False)
  --chrf-eps-smoothing  Enables epsilon smoothing similar to chrF++.py, NLTK and Moses; instead of effective order smoothing. (Default: False)

TER related arguments (The defaults replicate TERCOM's behavior):
  --ter-case-sensitive  Enables case sensitivity (Default: False)
  --ter-asian-support   Enables special treatment of Asian characters (Default: False)
  --ter-no-punct        Removes punctuation. (Default: False)
  --ter-normalized      Applies basic normalization and tokenization. (Default: False)

Version Signatures

As you may have noticed, sacreBLEU generates version strings such as BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 for reproducibility reasons. It's strongly recommended to share these signatures in your papers!

Outputting other metadata

Sacrebleu knows about metadata for some test sets, and you can output it like this:

$ sacrebleu -t wmt21 -l en-de --echo src docid ref | head 2
Couple MACED at California dog park for not wearing face masks while having lunch (VIDEO) - RT USA News	rt.com.131279	Paar in Hundepark in Kalifornien mit Pfefferspray besprüht, weil es beim Mittagessen keine Masken trug (VIDEO) - RT USA News
There's mask-shaming and then there's full on assault.	rt.com.131279	Masken-Shaming ist eine Sache, Körperverletzung eine andere.

If multiple fields are requested, they are output as tab-separated columns (a TSV).

To see the available fields, add --echo asdf (or some other garbage data):

$ sacrebleu -t wmt21 -l en-de --echo asdf
sacreBLEU: No such field asdf in test set wmt21 for language pair en-de.
sacreBLEU: available fields for wmt21/en-de: src, ref:A, ref, docid, origlang

Translationese Support

If you are interested in the translationese effect, you can evaluate BLEU on a subset of sentences with a given original language (identified based on the origlang tag in the raw SGM files). E.g., to evaluate only against originally German sentences translated to English use:

$ sacrebleu -t wmt13 -l de-en --origlang=de -i my-wmt13-output.txt

and to evaluate against the complement (in this case origlang en, fr, cs, ru, de) use:

$ sacrebleu -t wmt13 -l de-en --origlang=non-de -i my-wmt13-output.txt

Please note that the evaluator will return a BLEU score only on the requested subset, but it expects that you pass through the entire translated test set.

Languages & Preprocessing

BLEU

  • You can compute case-insensitive BLEU by passing --lowercase to sacreBLEU
  • The default tokenizer for BLEU is 13a which mimics the mteval-v13a script from Moses.
  • Other tokenizers are:
    • none which will not apply any kind of tokenization at all
    • char for language-agnostic character-level tokenization
    • intl applies international tokenization and mimics the mteval-v14 script from Moses
    • zh separates out Chinese characters and tokenizes the non-Chinese parts using 13a tokenizer
    • ja-mecab tokenizes Japanese inputs using the MeCab morphological analyzer
    • ko-mecab tokenizes Korean inputs using the MeCab-ko morphological analyzer
    • flores101 and flores200 uses the SentencePiece model built from the Flores-101 and Flores-200 dataset, respectively. Note: the canonical .spm file will be automatically fetched if not found locally.
  • You can switch tokenizers using the --tokenize flag of sacreBLEU. Alternatively, if you provide language-pair strings using --language-pair/-l, zh, ja-mecab and ko-mecab tokenizers will be used if the target language is zh or ja or ko, respectively.
  • Note that there's no automatic language detection from the hypotheses so you need to make sure that you are correctly selecting the tokenizer for Japanese, Korean and Chinese.

Default 13a tokenizer will produce poor results for Japanese:

$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -b
2.1

Let's use the ja-mecab tokenizer:

$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja --tokenize ja-mecab -b
14.5

If you provide the language-pair, sacreBLEU will use ja-mecab automatically:

$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -l en-ja -b
14.5

chrF / chrF++

chrF applies minimum to none pre-processing as it deals with character n-grams:

  • If you pass --chrf-whitespace, whitespace characters will be preserved when computing character n-grams.
  • If you pass --chrf-lowercase, sacreBLEU will compute case-insensitive chrF.
  • If you enable non-zero --chrf-word-order (pass 2 for chrF++), a very simple punctuation tokenization will be internally applied.

TER

Translation Error Rate (TER) has its own special tokenizer that you can configure through the command line. The defaults provided are compatible with the upstream TER implementation (TERCOM) but you can nevertheless modify the behavior through the command-line:

  • TER is by default case-insensitive. Pass --ter-case-sensitive to enable case-sensitivity.
  • Pass --ter-normalize to apply a general Western tokenization
  • Pass --ter-asian-support to enable the tokenization of Asian characters. If provided with --ter-normalize, both will be applied.
  • Pass --ter-no-punct to strip punctuation.

Multi-reference Evaluation

All three metrics support the use of multiple references during evaluation. Let's first pass all references as positional arguments:

$ sacrebleu ref1 ref2 -i system -m bleu chrf ter
        BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>
      chrF2|nrefs:2|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 75.0
TER|nrefs:2|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 31.2

Alternatively (less recommended), we can concatenate references using tabs as delimiters as well. Don't forget to pass --num-refs/-nr in this case!

$ paste ref1 ref2 > refs.tsv

$ sacrebleu refs.tsv --num-refs 2 -i system -m bleu
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>

Multi-system Evaluation

As of version >=2.0.0, SacreBLEU supports evaluation of an arbitrary number of systems for a particular test set and language-pair. This has the advantage of seeing all results in a nicely formatted table.

Let's pass all system output files that match the shell glob newstest2017.online-* to sacreBLEU for evaluation:

$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf
╒═══════════════════════════════╤════════╤═════════╕
│                        System │  BLEU  │  chrF2  │
╞═══════════════════════════════╪════════╪═════════╡
│ newstest2017.online-A.0.en-de │  20.8  │  52.0   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-B.0.en-de │  26.7  │  56.3   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-F.0.en-de │  15.5  │  49.3   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-G.0.en-de │  18.2  │  51.6   │
╘═══════════════════════════════╧════════╧═════════╛

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

You can also change the output format to latex:

$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf -f latex
\begin{tabular}{rcc}
\toprule
                        System &  BLEU  &  chrF2  \\
\midrule
 newstest2017.online-A.0.en-de &  20.8  &  52.0   \\
 newstest2017.online-B.0.en-de &  26.7  &  56.3   \\
 newstest2017.online-F.0.en-de &  15.5  &  49.3   \\
 newstest2017.online-G.0.en-de &  18.2  &  51.6   \\
\bottomrule
\end{tabular}

...

Confidence Intervals for Single System Evaluation

When enabled with the --confidence flag, SacreBLEU will print (1) the actual system score, (2) the true mean estimated from bootstrap resampling and (3), the 95% confidence interval around the mean. By default, the number of bootstrap resamples is 1000 (bs:1000 in the signature) and can be changed with --confidence-n:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf --confidence -f text --short
   BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 22.675 (μ = 22.669 ± 0.598) ...
chrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 51.953 (μ = 51.953 ± 0.462)

NOTE: Although provided as a functionality, having access to confidence intervals for just one system may not reveal much information about the underlying model. It often makes more sense to perform paired statistical tests across multiple systems.

NOTE: When resampling, the seed of the numpy's random number generator (RNG) is fixed to 12345. If you want to relax this and set your own seed, you can export the environment variable SACREBLEU_SEED to an integer. Alternatively, you can export SACREBLEU_SEED=None to skip initializing the RNG's seed and allow for non-deterministic behavior.

Paired Significance Tests for Multi System Evaluation

Ideally, one would have access to many systems in cases such as (1) investigating whether a newly added feature yields significantly different scores than the baseline or (2) evaluating submissions for a particular shared task. SacreBLEU offers two different paired significance tests that are widely used in MT research.

Paired bootstrap resampling (--paired-bs)

This is an efficient implementation of the paper Statistical Significance Tests for Machine Translation Evaluation and is result-compliant with the reference Moses implementation. The number of bootstrap resamples can be changed with the --paired-bs-n flag and its default is 1000.

When launched, paired bootstrap resampling will perform:

  • Bootstrap resampling to estimate 95% CI for all systems and the baseline
  • A significance test between the baseline and each system to compute a p-value.

Paired approximate randomization (--paired-ar)

Paired approximate randomization (AR) is another type of paired significance test that is claimed to be more accurate than paired bootstrap resampling when it comes to Type-I errors (Riezler and Maxwell III, 2005). Type-I errors indicate failures to reject the null hypothesis when it is true. In other words, AR should in theory be more robust to subtle changes across systems.

Our implementation is verified to be result-compliant with the Multeval toolkit that also uses paired AR test for pairwise comparison. The number of approximate randomization trials is set to 10,000 by default. This can be changed with the --paired-ar-n flag.

Running the tests

  • The first system provided to --input/-i will be automatically taken as the baseline system against which you want to compare other systems.
  • When --input/-i is used, the system output files will be automatically named according to the file paths. For the sake of simplicity, SacreBLEU will automatically discard the baseline system if it also appears amongst other systems. This is useful if you would like to run the tool by passing -i systems/baseline.txt systems/*.txt. Here, the baseline.txt file will not be also considered as a candidate system.
  • Alternatively, you can also use a tab-separated input file redirected to SacreBLEU. In this case, the first column hypotheses will be taken as the baseline system. However, this method is not recommended as it won't allow naming your systems in a human-readable way. It will instead enumerate the systems from 1 to N following the column order in the tab-separated input.
  • On Linux and Mac OS X, you can launch the tests on multiple CPU's by passing the flag --paired-jobs N. If N == 0, SacreBLEU will launch one worker for each pairwise comparison. If N > 0, N worker processes will be spawned. This feature will substantially speed up the runtime especially if you want the TER metric to be computed.

Example: Paired bootstrap resampling

In the example below, we select newstest2017.LIUM-NMT.4900.en-de as the baseline and compare it to 4 other WMT17 submissions using paired bootstrap resampling. According to the results, the null hypothesis (i.e. the two systems being essentially the same) could not be rejected (at the significance level of 0.05) for the following comparisons:

  • 0.1 BLEU difference between the baseline and the online-B system (p = 0.3077)
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-bs
╒════════════════════════════════════════════╤═════════════════════╤══════════════════════╕
│                                     System │  BLEU (μ ± 95% CI)  │  chrF2 (μ ± 95% CI)  │
╞════════════════════════════════════════════╪═════════════════════╪══════════════════════╡
│ Baseline: newstest2017.LIUM-NMT.4900.en-de │  26.6 (26.6 ± 0.6)  │  55.9 (55.9 ± 0.5)   │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-A.0.en-de │  20.8 (20.8 ± 0.6)  │  52.0 (52.0 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-B.0.en-de │  26.7 (26.6 ± 0.7)  │  56.3 (56.3 ± 0.5)   │
│                                            │    (p = 0.3077)     │    (p = 0.0240)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-F.0.en-de │  15.5 (15.4 ± 0.5)  │  49.3 (49.3 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-G.0.en-de │  18.2 (18.2 ± 0.5)  │  51.6 (51.6 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
╘════════════════════════════════════════════╧═════════════════════╧══════════════════════╛

------------------------------------------------------------
Paired bootstrap resampling test with 1000 resampling trials
------------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score / bootstrap estimated true mean / 95% CI are provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. For a given system and the baseline,
   the p-value is roughly the probability of the absolute score difference (delta)
   or higher occurring due to chance, under the assumption that the null hypothesis is correct.

 - Assuming a significance threshold of 0.05, the null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red.

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|bs:1000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|bs:1000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

Example: Paired approximate randomization

Let's now run the paired approximate randomization test for the same comparison. According to the results, the findings are compatible with the paired bootstrap resampling test. However, the p-value for the baseline vs. online-B comparison is much higher (0.8066) than the paired bootstrap resampling test.

(Note that the AR test does not provide confidence intervals around the true mean as it does not perform bootstrap resampling.)

$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-ar
╒════════════════════════════════════════════╤═══════════════╤═══════════════╕
│                                     System │     BLEU      │     chrF2     │
╞════════════════════════════════════════════╪═══════════════╪═══════════════╡
│ Baseline: newstest2017.LIUM-NMT.4900.en-de │     26.6      │     55.9      │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-A.0.en-de │     20.8      │     52.0      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-B.0.en-de │     26.7      │     56.3      │
│                                            │ (p = 0.8066)  │ (p = 0.0385)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-F.0.en-de │     15.5      │     49.3      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-G.0.en-de │     18.2      │     51.6      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
╘════════════════════════════════════════════╧═══════════════╧═══════════════╛

-------------------------------------------------------
Paired approximate randomization test with 10000 trials
-------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score is provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. For a given system and the baseline,
   the p-value is roughly the probability of the absolute score difference (delta)
   or higher occurring due to chance, under the assumption that the null hypothesis is correct.

 - Assuming a significance threshold of 0.05, the null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red.

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|ar:10000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|ar:10000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

Using SacreBLEU from Python

For evaluation, it may be useful to compute BLEU, chrF or TER from a Python script. The recommended way of doing this is to use the object-oriented API, by creating an instance of the metrics.BLEU class for example:

In [1]: from sacrebleu.metrics import BLEU, CHRF, TER
   ...:
   ...: refs = [ # First set of references
   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: bleu = BLEU()

In [3]: bleu.corpus_score(sys, refs)
Out[3]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

In [4]: bleu.get_signature()
Out[4]: nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0

In [5]: chrf = CHRF()

In [6]: chrf.corpus_score(sys, refs)
Out[6]: chrF2 = 59.73

Variable Number of References

Let's now remove the first reference sentence for the first system sentence The dog bit the man. by replacing it with either None or the empty string ''. This allows using a variable number of reference segments per hypothesis. Observe how the signature changes from nrefs:2 to nrefs:var:

In [1]: from sacrebleu.metrics import BLEU, CHRF, TER
   ...:
   ...: refs = [ # First set of references
                 # 1st sentence does not have a ref here
   ...:          ['', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: bleu = BLEU()

In [3]: bleu.corpus_score(sys, refs)
Out[3]: BLEU = 29.44 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)

In [4]: bleu.get_signature()
Out[4]: nrefs:var|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0

Compatibility API

You can also use the compatibility API that provides wrapper functions around the object-oriented API to compute sentence-level and corpus-level BLEU, chrF and TER: (It should be noted that this API can be removed in future releases)

In [1]: import sacrebleu
   ...: 
   ...: refs = [ # First set of references
   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: sacrebleu.corpus_bleu(sys, refs)
Out[2]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

License

SacreBLEU is licensed under the Apache 2.0 License.

Credits

This was all Rico Sennrich's idea Originally written by Matt Post. New features and ongoing support provided by Martin Popel (@martinpopel) and Ozan Caglayan (@ozancaglayan).

If you use SacreBLEU, please cite the following:

@inproceedings{post-2018-call,
  title = "A Call for Clarity in Reporting {BLEU} Scores",
  author = "Post, Matt",
  booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
  month = oct,
  year = "2018",
  address = "Belgium, Brussels",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/W18-6319",
  pages = "186--191",
}

Release Notes

Please see CHANGELOG.md for release notes.

sacrebleu's People

Contributors

abcdenis avatar ales-t avatar cfedermann avatar dustalov avatar guitaricet avatar heiwais25 avatar hmaarrfk avatar jhcross avatar loicbarrault avatar louismartin avatar martinpopel avatar mayhewsw avatar me-manikanta avatar mjpost avatar morinoseimorizo avatar mtresearcher avatar neubig avatar ozancaglayan avatar pmichel31415 avatar pogayo avatar polm avatar rbawden avatar sn1c avatar sukuya avatar thammegowda avatar tholiao avatar thomaszen avatar tirkarthi avatar tuetschek avatar zjaume avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sacrebleu's Issues

refactor tokenization classes

Tokenization is all hacked together with a dictionary of functions. This has caused problems with the Mecab tokenizer. It would be good to factor this out into a simple class hierarchy.

Documentation error

Documentation for extract_ngrams suggests two arguments:

In [26]: print(sacrebleu.extract_ngrams.__doc__)                                                                       
Extracts all the ngrams (1 <= n <= NGRAM_ORDER) from a sequence of tokens.

    :param line: a segment containing a sequence of words
    :param max_order: collect n-grams from 1<=n<=max
    :return: a dictionary containing ngrams and counts

but it looks like it actually needs a min and a max (just providing min appears to give max a default of 4).

Add version information in __version__

Expected behavior of dunder version (PEP396, not a must but it's common in most library):

>>> import sacrebleu
>>> sacrebleu.__version__
'1.4.4'

Currently,

>>> import sacrebleu
>>> sacrebleu.__version__
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<...> in <module>
      1 import sacrebleu
----> 2 sacrebleu.__version__

AttributeError: module 'sacrebleu' has no attribute '__version__'

but the version information is kept in the sacrebleu.VERSION attribute:

>>> import sacrebleu
>>> sacrebleu.VERSION
'1.4.4'

v1.4.6 broke some imports

Previously, I relied on being able to import DATASETS, SACREBLEU_DIR, _clean, and smart_open from sacrebleu. The refactor in #66 (9baed3e) broke the latter three imports.

While I admit that relying on underscore function imports is hacky, I think SACREBLEU_DIR is clearly something that should be importable. The case of smart_open is more ambiguous: there is a reasonable argument to be made for keeping it "under the hood", and the code is trivial to duplicate, but the backwards incompatibility is both a severe and unexpected change.

Add sanity check for cached test sets

I interrupted sacreBLEU during the download of a WMT16 test set and did not realize it caches results, also incomplete results. Later usage kept using only 155 lines of the partially downloaded test set. Maybe sacreBLEU should check if the cached versions has the required number of lines?

Running two or more instances of sacrebleu in a pipe does not lock temp files

Hi,
When I run a command where sacrebleu is used multiple times like this (I do this all the time):

./sacrebleu.py -t wmt16 -l ende --echo src | ./generic_translator | ./sacrebleu.py -t wmt16 -l ende

the two instances may overwrite each others temporary files when the test set is being downloaded for the first time. Both instances then start downloading, unpacking, etc. They may end-up corrupting the files.

sentence_bleu return type change

Quick question:

since this PR:
061175e

sentence_bleu no longer returns a float but a BLEU object, unlike sentence_chrf which returns a float. Is this API change intended?

As context this will break the Reranker in Sockeye. We can of course branch around whether BLEU or chrf is used, but before we do this I wanted to check whether this was intended.

Cleaner checking of Chinese charset

Maybe this would be easier to maintain when checking chinese characterset, the core function don't change if we were to add more or remove from the charset:

chinese_charset = [
(u'\u3400', u'\u4db5'),  # CJK Unified Ideographs Extension A, release 3.0
(u'\u4e00', u'\u9fa5'),  # CJK Unified Ideographs, release 1.1
(u'\u9fa6', u'\u9fbb'),  # CJK Unified Ideographs, release 4.1
(u'\uf900', u'\ufa2d'),  # CJK Compatibility Ideographs, release 1.1
(u'\ufa30', u'\ufa6a'),  # CJK Compatibility Ideographs, release 3.2
(u'\ufa70', u'\ufad9'),  # CJK Compatibility Ideographs, release 4.1
(u'\u20000', u'\u2a6d6'),  # (UTF16) CJK Unified Ideographs Extension B, release 3.1
(u'\u2f800', u'\u2fa1d'),  # (UTF16) CJK Compatibility Supplement, release 3.1
(u'\uff00', u'\uffef'),  # Full width ASCII, full width of English punctuation, half width Katakana, half wide half width kana, Korean alphabet
(u'\u2e80', u'\u2eff'),  # CJK Radicals Supplement
(u'\u3000', u'\u303f'),  # CJK punctuation mark
(u'\u31c0', u'\u31ef'),  # CJK stroke  
(u'\u2f00', u'\u2fdf'),  # Kangxi Radicals
(u'\u2ff0', u'\u2fff'),  # Chinese character structure
(u'\u3100', u'\u312f'),  # Phonetic symbols
(u'\u31a0', u'\u31bf'),  # Phonetic symbols (Taiwanese and Hakka expansion)
(u'\ufe10', u'\ufe1f'),
(u'\ufe30', u'\ufe4f'),
(u'\u2600', u'\u26ff'),
(u'\u2700', u'\u27bf'),
(u'\u3200', u'\u32ff'),
(u'\u3300', u'\u33ff')
]

def isChineseChar(uchar):
    for start, end in chinese_charset:
        if start <= uchar <= end:
            return turn
    return false

Compatibility Issue w/ mecab-python3==1.0.0

I get the following error when using sacrebleu==1.4.10 with -tok ja-mecab and mecab-python3==1.0.0:

Failed when trying to initialize MeCab. Some things to check:

    - If you are not using a wheel, do you have mecab installed?

    - Do you have a dictionary installed? If not do this:

        pip install unidic-lite

    - If on Windows make sure you have this installed:

        https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads

    - Try creating a Model with the same arguments as your Tagger; that may
      give a more descriptive error message.

If you are still having trouble, please file an issue here:

    https://github.com/SamuraiT/mecab-python3/issues
Traceback (most recent call last):
  File "env/bin/sacrebleu", line 8, in <module>
    sys.exit(main())
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 1003, in main
    bleu = corpus_bleu(system, refs, smooth_method=args.smooth, smooth_value=args.smooth_value, force=args.force, lowercase=args.lc, tokenize=args.tokenize)
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 637, in corpus_bleu
    output, *refs = [TOKENIZERS[tokenize](x.rstrip()) for x in lines]
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 637, in <listcomp>
    output, *refs = [TOKENIZERS[tokenize](x.rstrip()) for x in lines]
  File "env/lib/python3.5/site-packages/sacrebleu/tokenizer.py", line 253, in tokenize
    self.load()
  File "env/lib/python3.5/site-packages/sacrebleu/tokenizer.py", line 239, in load
    self.tagger = MeCab.Tagger("-Owakati")
  File "env/lib/python3.5/site-packages/MeCab/__init__.py", line 105, in __init__
    super(Tagger, self).__init__(args)
RuntimeError

Installing unidic-lite results in a failing assertion:

Traceback (most recent call last):
  File "env/bin/sacrebleu", line 8, in <module>
    sys.exit(main())
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 1003, in main
    bleu = corpus_bleu(system, refs, smooth_method=args.smooth, smooth_value=args.smooth_value, force=args.force, lowercase=args.lc, tokenize=args.tokenize)
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 637, in corpus_bleu
    output, *refs = [TOKENIZERS[tokenize](x.rstrip()) for x in lines]
  File "env/lib/python3.5/site-packages/sacrebleu/sacrebleu.py", line 637, in <listcomp>
    output, *refs = [TOKENIZERS[tokenize](x.rstrip()) for x in lines]
  File "env/lib/python3.5/site-packages/sacrebleu/tokenizer.py", line 253, in tokenize
    self.load()
  File "env/lib/python3.5/site-packages/sacrebleu/tokenizer.py", line 242, in load
    assert d.size == 392126, "Please make sure to use IPA dictionary for MeCab"
AssertionError: Please make sure to use IPA dictionary for MeCab

With mecab-python3==0.996.5 it works just fine.

Scores per item

It would be useful to have the individual scores for each hypothesis.

I know that the sentence_bleu function says that "computing BLEU on the sentence level is not its intended use, BLEU is a corpus-level metric", but it might be useful to check for outliers.

Porting sacrebleu to linux package manager(s)

Hello, thank you very much for developing this package, it is super useful!

I noticed that sacrebleu has 46 releases currently and probably has a fast development rate. I was wondering if you might be interested in porting sacrebleu into user-based linux package repositories (on top of pip). Compared to installing/updating separately with pip, this would make keeping everything up-to-date much easier by simply using your standard package manager.

If this is of interest, I would be glad to port sacrebleu into the Arch Linux User Repository (AUR) here. I can also make a PR to add some scripts that will keep this updated every time there is a new release on your repository.

[Question] How to interpret the evaluation metric for translation?

I can find the evaluation metric in the README file as below.
BLEU+case.mixed+lang.de-en+test.wmt17 = 32.97 66.1/40.2/26.6/18.1 (BP = 0.980 ratio = 0.980 hyp_len = 63134 ref_len = 64399)

For 32.97, I think that it is a cased BLEU4.
I wonder about the meaning of the other performance metrics.

Is 66.1/40.2/26.6/18.1 also a BLEU score? Could you explain the difference between these BLEU scores and BLEU4(32.97)?
Could you explain the meaning of BP and ratio?

International tokenization

First congrats: great idea, Rico! great implementation (and name), Matt.

The best practice when using mteval (by the way the newest version is mteval-v14.pl, the differences from v13a are not important, just the number is nicer) is to use --international-tokenization.
Of course, there won't be ever a consensus on the "ideal" tokenization (and Chinese, Japanese, Thai etc. will need a special approach anyway) and BLEU needs just "reasonable enough", but consistent tokenization. However, without the international-tokenization the correlation of BLEU with humans is much lower for languages with non-ascii punctuation, which includes also English with “typographic” quotes.

If you are interested I can send a PR with Python implementation of BLEU's international-tokenization.

Perhaps operate on arbitrary sequences?

Sometimes one doesn't have a string, but a sequence of e.g. integers, maybe corresponding to torch.argmax or the like. AFAIK there's no reason to limit the BLEU calculations to strings, it generalizes perfectly to sequences (any type that supports sequential indexing) of any type for which equality is defined. Might be worthwhile to call list() on the inputs right when they're passed in: keeps all functionality but can be run on other sequence-similarity-ish things.

About Translationese

Does scoring translationese get ready to release? I did not find a piece of code about this feature in sacrebleu.py except "--origlang=de". By the way, can you kindly provide the reference upon which you are implementing for scoring translationese?

Most recent release only works with Python 3.6+

I installed the lastest PyPi release of sacrebleu today, and got the following error:

$ sacrebleu
File "/home/user/mmueller/mtrain-toy-models/venvs/mtrain3/bin/sacrebleu", line 6, in <module>
from sacrebleu import main
File "/home/user/mmueller/mtrain-toy-models/venvs/mtrain3/lib/python3.5/site-packages/sacrebleu.py", line 1144
precisions = "/".join(f"{p:.1f}" for p in self.precisions)

After researching for a while, it turns out an expression like this only works with Python 3.6 or higher:

$ pyenv shell 3.5.0
$ python -c 'precisions = "/".join(f"{p:.1f}" for p in [3, 2, 1])'
  File "<string>", line 1
    precisions = "/".join(f"{p:.1f}" for p in [3, 2, 1])
                                   ^
SyntaxError: invalid syntax
$ pyenv shell 3.6.0
$ python -c 'precisions = "/".join(f"{p:.1f}" for p in [3, 2, 1])'
$ pyenv shell 3.7.0
$ python -c 'precisions = "/".join(f"{p:.1f}" for p in [3, 2, 1])'

@matt Can you reproduce this?

I think as many versions of Python 3 as possible should be supported.

documentation: add example of using sacrebleu as a library

Currently, it is not clear how to use sacrebleu for validation inside python code.
Docstrings for raw_corpus_bleu and corpus_bleu do not provide examples and what sys_stream and ref_streams actually are

I tried to use it like this:

ref_t = [['this is an example translation', 'this is an example'], ['this is another example but only with one reference translation']]                                                                                

pred_t = ['this is an example translation', 'this is another example translation']                                                                                                                                     

sacrebleu.raw_corpus_bleu(pred_t, ref_t)
# or
sacrebleu.corpus_bleu(pred_t, ref_t)

Which is consistent with function signature

sys_stream: Union[str, Iterable[str]],
Union[str, List[Iterable[str]]]

But this error happends:
EOFError: Source and reference streams have different lengths!

Some examples or more detailed parameter explanation would really help.

And thank you for your work on standardization of BLEU computation!

UPD:
figured out how to use corpus_bleu with one reference translation for each sentence (but not multiple)

ref_t = ['this is an example translation', 'this is another example but only with one reference translation']             

pred_t = ['this is an example translation', 'this is another example translation']                                                                                                                                     

sacrebleu.corpus_bleu(pred_t, [ref_t])

wmt17/dev checksum failed

Hi,
When I testing the zh-en translation on wmt17 dev file,

cat my_file | sacrebleu -t wmt17/dev -l zh-en

it says: checksum failed. I rerun multiple times, it is the same problem.
Could you please help check this problem?

sacreBLEU: Downloading http://data.statmt.org/wmt17/translation-task/dev.tgz to /var/storage/shared/sdrgvc/sys/jobs/application_1545092712078_50932/.sacrebleu/wmt17/dev/dev.tgz
sacreBLEU: Fatal: MD5 sum of downloaded file was incorrect (got 9b1aa63c1cf49dccdd20b962fe313989, expected 4a3dc2760bb077f4308cce96b06e6af6).
sacreBLEU: Please manually delete "/var/storage/shared/sdrgvc/sys/jobs/application_1545092712078_50932/.sacrebleu/wmt17/dev/dev.tgz" and rerun the command.
sacreBLEU: If the problem persists, the tarball may have changed, in which case, please contact the SacreBLEU maintainer.

Error caused by mecab-python3

I get an error trying to
pip install sacrebleu

I wonder whether it is this commit that now prevents me from installing sacrebleu on Windows either in the system install or in Anaconda.
Here's the main part of the error:

    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\David\Anaconda3\envs\OpenNMT_tf\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\David\\AppData\\Local\\Temp\\pip-install-7ez0vltn\\mecab-python3\\setup.py'"'"'; __file__='"'"'C:\\Users\\David\\AppData\\Local\\Temp\\pip-install-7ez0vltn\\mecab-python3\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\David\AppData\Local\Temp\pip-record-qbdi82e5\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\David\Anaconda3\envs\OpenNMT_tf\Include\mecab-python3'
         cwd: C:\Users\David\AppData\Local\Temp\pip-install-7ez0vltn\mecab-python3\
    Complete output (9 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.7
    creating build\lib.win-amd64-3.7\MeCab
    copying src\MeCab\__init__.py -> build\lib.win-amd64-3.7\MeCab
    running build_ext
    error: [WinError 2] The system cannot find the file specified
    ----------------------------------------

> ERROR: Command errored out with exit status 1: 'C:\Users\David\Anaconda3\envs\OpenNMT_tf\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\David\\AppData\\Local\\Temp\\pip-install-7ez0vltn\\mecab-python3\\setup.py'"'"'; __file__='"'"'C:\\Users\\David\\AppData\\Local\\Temp\\pip-install-7ez0vltn\\mecab-python3\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\David\AppData\Local\Temp\pip-record-qbdi82e5\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\David\Anaconda3\envs\OpenNMT_tf\Include\mecab-python3' Check the logs for full command output.

I don't think that this is the fault of Sacrebleu though - I can't find a way to install mecab-python3 on my machine at all. I need a 64 bit python since it is required by TensorFlow that is required by OpenNMT-tf. The site for Mecab says that it is only available on 32bit Python. So if that's the case has this change prevented the use of "pip install sacrebleu" on all 64bit python installations?

Originally posted by @davidbaines in #61 (comment)

Deprecation warning due to invalid escape sequences

Deprecation warnings are raised due to invalid escape sequences. This can be fixed by using raw strings or escaping the literals. pyupgrade also helps in automatic conversion : https://github.com/asottile/pyupgrade/

find . -iname '*.py' | grep -Ev 'test.py' | xargs -P4 -I{} python3.8 -Wall -m py_compile {}
./sacrebleu/sacrebleu.py:772: DeprecationWarning: invalid escape sequence \w
  return list(filter(lambda x: re.match('\w\w\-\w\w', x), DATASETS.get(testset, {}).keys()))

feature support for the significance test by bootstrapping?

Hi, team. Thanks for the cool library.

I'm looking for the feature of computing BLEU score with confidence interval like in that of Moses. I noticed that similar issue has been raised in issue

I just want to check if any new updates have been made regarding this feature in sacreBLEU.

Thank you.

automatically pick tokenizer when known

Hi, when adding the en-zh direction to my work setup, I got this message:

sacreBLEU: You should also pass "--tok zh" when scoring Chinese...

This is causing problems because it makes the command line irregular, in that for some languages, a different command-line pattern is needed. This makes it unnecessarily complex to wrap it in scripts that accept e.g. the language pair as an argument. Such scripts are often just one-line pipelines that detokenize and then pipe the result to Sacrebleu, passing some of their arguments directly down to the tools they invoke without further processing.

Could Sacrebleu please automatically choose the correct default tokenizer based on the language, whenever it has enough information to do so?

Thanks!

Speed up (w/ numpy)

Can we make SacreBLEU faster, possibly using numpy, multithreading or even GPU? And still keep it reliable and easy to install?

This issue should serve for sharing ideas and coordinating our efforts (PRs).

I am not aware of any particular numpy BLEU implementation. I just know (and I guess @mjpost too) that the chrF implementation in SacreBLEU is taken from Sockeye, but it uses List[float] instead of np.array. I am not sure whether this has any substantial impact on the speed.
I have not done profiling, but I guess most time is spent with the tokenization and maybe n-gram extraction and intersection, which could be substituted with Counter intersection similarly to the chrF implementation, supposing that Python3's Counter is C-optimized and faster.

Numpy can be useful if bootstrap resampling is added (cf. #40, #11).

The international tokenization has been optimized using lru_cache. However, there is still a cycle through all Unicode code points in _property_chars for each execution of sacrebleu, which could be prevented if adding the regex dependency (importing it conditionally, only if --tok intl required).

(minor) factor tokenizer code

Hi, there are three tokenizers, and at least two of them (tokenize_13a() and tokenize_zh()) contain the same processing steps for tokenizing Western-script characters. The code is not even identically written, although the regexes seem identical in the end. Code duplication is dangerous, in case future modifications are ever made to this code. I suggest to factor out this code.

variable number of gold annotations

The example in the introduction assumes that there is a fixed # of gold annotations for each generated text:

import sacrebleu
refs = [['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
        ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.']]
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
bleu = sacrebleu.corpus_bleu(sys, refs)
print(bleu.score)

Is there a call that accepts variable number of gold text for each generated text?

sacreBLEU silently fails on some systems (such as Red Hat scientific linux)

Hey,

sacreBLEU silently produces very short files on some systems. I believe it is some sort of locale issue, but I am not exactly sure what's happening. Here's an example:

[cs-bogo1@login-e-6 sacretest]$ LC_ALL=C.UTF-8 ../sacreBLEU/sacrebleu.py -t wmt13 -l en-de --echo ref > testvalid.de
[cs-bogo1@login-e-6 sacretest]$ LC_ALL=C.UTF-8 ../sacreBLEU/sacrebleu.py -t wmt13 -l en-de --echo src > testvalid.en
[cs-bogo1@login-e-6 sacretest]$ wc -l *
   1 testvalid.de
  70 testvalid.en
  71 total
[cs-bogo1@login-e-6 sacretest]$ rm -rf *
[cs-bogo1@login-e-6 sacretest]$ ../sacreBLEU/sacrebleu.py -t wmt13 -l en-de --echo ref > testvalid.de
[cs-bogo1@login-e-6 sacretest]$ ../sacreBLEU/sacrebleu.py -t wmt13 -l en-de --echo src > testvalid.en
[cs-bogo1@login-e-6 sacretest]$ wc -l *
   1 testvalid.de
  70 testvalid.en
  71 total

This happens on a number of systems and I am unable to find a pattern as of yet. My affected system has the following python version:
Python 3.6.4 (default, Jan 18 2018, 00:44:15)
[GCC 5.4.0] on linux
My working system has the following python version:
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux

At the very least there should be an error and not a silent failure.

Cheers,

Nick

sentence_bleu calculation fails

Sentence level BLEU calculation from Python api gave 0.0 score on arbitrary samples in a dataset while corpus level calculations works correctly.

image

reproducible versions: 1.2.11 - 1.4.2.

Using own test set data

Apologies if this is a really simple question. I am doing an NMT project for a student science exhibition. I am trying to debug a high BLEU score problem I have (calculated by Fairseq).

I have my own dev set with my decoder's predictions too. My language pair (en-ga [Irish]) is not available in any of the pre-processed sets. How may I use my own material with sacrebleu? I only ask because I find the readme just a small bit unclear on that (unless I'm missing something).

Thanks.

Unit tests failing with sacrebleu 1.4.2

The latest version on pypi (1.4.2) has 15 unit tests failing. Most of them for this reason:

    @pytest.mark.parametrize("hypotheses, references, expected_score", test_cases_keep_whitespace)
    def test_chrf_keep_whitespace(hypotheses, references, expected_score):
        score = sacrebleu.corpus_chrf(hypotheses, references, 6, 3, remove_whitespace=False)
>       assert abs(score - expected_score) < EPSILON
E       TypeError: unsupported operand type(s) for -: 'CHRF' and 'float'

Add checksumming

Occasionally a file gets only partially processed if sacrebleu is interrupted during downloading or conversion. There should be a checksum to ensure that cached files are correct. We took a step in this direction by complaining if the line numbers between system output and reference are incorrect, but this isn't failsafe, and also doesn't guard against use of --echo.

Unicode characters in error message cause problems with LC_ALL=C

Hi,
Two things, same cause, I believe.

ubuntu@ip-172-31-65-80:~$ LC_ALL=C sacrebleu --help
Traceback (most recent call last):
  File "/usr/local/bin/sacrebleu", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/sacrebleu.py", line 762, in main
    args = arg_parser.parse_args()
  File "/usr/lib/python3.5/argparse.py", line 1735, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/usr/lib/python3.5/argparse.py", line 1767, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/usr/lib/python3.5/argparse.py", line 1973, in _parse_known_args
    start_index = consume_optional(start_index)
  File "/usr/lib/python3.5/argparse.py", line 1913, in consume_optional
    take_action(action, args, option_string)
  File "/usr/lib/python3.5/argparse.py", line 1841, in take_action
    action(self, namespace, argument_values, option_string)
  File "/usr/lib/python3.5/argparse.py", line 1025, in __call__
    parser.print_help()
  File "/usr/lib/python3.5/argparse.py", line 2367, in print_help
    self._print_message(self.format_help(), file)
  File "/usr/lib/python3.5/argparse.py", line 2373, in _print_message
    file.write(message)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 454: ordinal not in range(128)

and installation with pip (not pip3, in the same enviroment with LC_ALL=C):

ubuntu@ip-172-31-65-80:~$ sudo -H pip install sacrebleu
Collecting sacrebleu
  Using cached sacrebleu-1.0.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-iHSRfo/sacrebleu/setup.py", line 13, in <module>
        import sacrebleu
      File "/tmp/pip-build-iHSRfo/sacrebleu/sacrebleu.py", line 3
    SyntaxError: Non-ASCII character '\xc3' in file /tmp/pip-build-iHSRfo/sacrebleu/sacrebleu.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-iHSRfo/sacrebleu/

Works fine with:

LC_ALL=en_US.UTF8 sacrebleu --help

Can't have nice things :)

The Choice of Detokenization Methods

Hi,
Just a couple of questions since we need to detokenize data before evaluation.
Will the choice of detokenization method affect the BLEU score?
Which detokenization method is the best choice?

Floor smoothing doesn't work in the command line

When using --smooth floor, the floor is set to 0.0 (cf https://github.com/mjpost/sacreBLEU/blob/master/sacrebleu.py#L1179 and ) by default with no way of specifying any other threshold (https://github.com/mjpost/sacreBLEU/blob/master/sacrebleu.py#L1489).


Reproduction:

echo "yes" > hyp.txt
echo "no" > ref.txt
< hyp.txt | sacrebleu -s floor ref.txt

Gives:

BLEU+case.mixed+numrefs.1+smooth.floor+tok.13a+version.1.2.20 = 0.0 0.0/0.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)

Should be (I think):

BLEU+case.mixed+numrefs.1+smooth.floor+tok.13a+version.1.2.20 = 1.0 1.0/1.0/1.0/1.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)

Option to output only final score to stdout

Hi,
would be useful to have an option that outputs only the score

22.17

instead of

BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt13+tok.13a+version.1.1.4 = 22.17 55.8/28.2/16.7/10.2 (BP = 0.974 ratio = 0.975 hyp_len = 62113 ref_len = 63737)

Python 2 compatibility

It would be nice to support Python 2, I guess. This would require at least unicode handlers for file streams, likely different downloading code, and maybe some other small things.

Remove typing module from dependencies

The external typing module (which is useless now since typing has been integrated in the standard library in Python3.5) is being installed with sacrebleu, which causes conflicts with the typing module from the std library on recent python versions.

Here is an example of a conflict caused in one of my projects by this :

    File "/root/.cache/pypoetry/virtualenvs/feedly.ml-summarization-py3.7/li  
b/python3.7/site-packages/typing.py", line 1357, in <module>                  
      class Callable(extra=collections_abc.Callable, metaclass=CallableMeta)  
:                                                                             
    File "/root/.cache/pypoetry/virtualenvs/feedly.ml-summarization-py3.7/li  
b/python3.7/site-packages/typing.py", line 1005, in __new__                   
      self._abc_registry = extra._abc_registry                                
  AttributeError: type object 'Callable' has no attribute '_abc_registry'     

This dependency is listed in the project's dependencies here :

install_requires = ['typing', 'portalocker', 'mecab-python3'],

I believe the possible solutions are :

  • conditionally installing the external typing module for users with Python < 3.5 only
  • removing this dependency altogether, if you do not plan to support Python < 3.5 anyway

Cheers !

SGML entities de-escaping in tokenization

https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L396-L399

First, there is a bug: all four entities &quot; &amp; &lt; &gt; are converted to double quotes. Probably a copy-paste error (not present in the original Perl implementation).

Second, I think we should delete this completely. The de-escaping was needed in the original implementation because the translation and reference files were in SGML (*.sgm) format. SacréBLEU expects plain-text input (or API calls), so this is not needed. I think it is a responsibility of a modern MT system to clean the data (ideally the training data) and produce human-readable sentences (i.e. without escaped html/sgml entities).

Similarly, if the input format is expected to be one sentence per line, there is no need for replace('-\n', ''), but this does not matter if there are no newlines in the string, it just obfuscates the code.

And guess what is my opinion on replace('<skipped>', '').

sacreBLEU computation for multiple target translations?

Hi, is it possible to compute sacreBLEU for multiple reference/target translations for a particular hypothesis translation in the corpus? Normal BLEU can be computed in such a scenario, by just providing a list of references for each hypothesis - however I tried the same with sacreBLEU, and it breaks.

Any inputs on the same, would be really helpful! Thanks!

Feature Request: Port the significance test?

In moses, there is this script for determining the "true" BLEU score within a confidence interval. Unfortunately, it does not have the configurability that sacreBLEU has.

In order to compare systems with regard to statistical significance, it would be nice to have a similar script, but supporting sacreBLEU.

Add LDC datasets

If $LDC were defined, LDC datasets installed locally could be extracted in similar fashion without violating any licenses. However this would require writing an XML parser to handle multiple references since Python 3 doesn't include one.

Crashes when run without installation

Hi,
Used to be possible to run sacrebleu directly from the cloned folder like:

git clone https://github.com/mjpost/sacrebleu
./sacrebleu/sacrebleu/sacrebleu.py -h

This dies now with:

Traceback (most recent call last):
  File "./sacrebleu/sacrebleu/sacrebleu.py", line 41, in <module>
    from .tokenizer import TOKENIZERS, TokenizeMeCab
SystemError: Parent module '' not loaded, cannot perform relative import

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.