metadata

title: Google BLEU
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  The BLEU score has some undesirable properties when used for single sentences,
  as it was designed to be a corpus measure. We therefore  use a slightly
  different score for our RL experiments which we call the 'GLEU score'. For the
  GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output and
  target sequence (n-grams). We then compute a recall, which is the ratio of the
  number of matching n-grams to the number of total n-grams in the target
  (ground truth) sequence, and a precision, which is the ratio of the number of
  matching n-grams to the number of total n-grams in the generated output
  sequence. Then GLEU score is simply the minimum of recall and precision. This
  GLEU score's range is always between 0 (no matches) and 1 (all match) and it
  is symmetrical when switching output and target. According to our experiments,
  GLEU score correlates quite well with the BLEU metric on a corpus level but
  does not have its drawbacks for our per sentence reward objective.

Metric Card for Google BLEU

Metric Description

The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score is designed to limit these undesirable properties when used for single sentences.

To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.

precision: the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
recall: the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence

The minimum value of precision and recall is then returned as the score.

Intended Uses

This metric is generally used to evaluate machine translation models. It is especially used when scores of individual (prediction, reference) sentence pairs are needed, as opposed to when averaging over the (prediction, reference) scores for a whole corpus. That being said, it can also be used when averaging over the scores for a whole corpus.

Because it performs better on individual sentence pairs as compared to BLEU, Google BLEU has also been used in RL experiments.

How to Use

This metric takes a list of predicted sentences, as well as a list of references.

sentence1 = "the cat sat on the mat"
sentence2 = "the cat ate the mat"
google_bleu = evaluate.load("google_bleu")
result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
print(result)
>>> {'google_bleu': 0.3333333333333333}

Inputs

predictions (list of str): list of translations to score.
references (list of list of str): list of lists of references for each translation.
tokenizer : approach used for tokenizing predictions and references. The default tokenizer is tokenizer_13a, a minimal tokenization approach that is equivalent to mteval-v13a, used by WMT. This can be replaced by any function that takes a string as input and returns a list of tokens as output.
min_len (int): The minimum order of n-gram this function should extract. Defaults to 1.
max_len (int): The maximum order of n-gram this function should extract. Defaults to 4.

Output Values

This metric returns the following in a dict:

google_bleu (float): google_bleu score

The output format is as follows:

{'google_bleu': google_bleu score}

This metric can take on values from 0 to 1, inclusive. Higher scores are better, with 0 indicating no matches, and 1 indicating a perfect match.

Note that this score is symmetrical when switching output and target. This means that, given two sentences, sentence1 and sentence2, whatever score is output when sentence1 is the predicted sentence and sencence2 is the reference sentence will be the same as when the sentences are swapped and sentence2 is the predicted sentence while sentence1 is the reference sentence. In code, this looks like:

predictions = "the cat sat on the mat"
references = "the cat ate the mat"
google_bleu = evaluate.load("google_bleu")
result_a = google_bleu.compute(predictions=[predictions], references=[[references]])
result_b = google_bleu.compute(predictions=[predictions], references=[[references]])
print(result_a == result_b)
>>> True

Values from Popular Papers

Examples

Example with one reference per sample:

>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references)
>>> print(round(results["google_bleu"], 2))
0.44

Example with multiple references for the first sample:

>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references  = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references)
>>> print(round(results["google_bleu"], 2))
0.61

Example with multiple references for the first sample, and with min_len adjusted to 2, instead of the default 1, which means that the function extracts n-grams of length 2:

>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
>>> print(round(results["google_bleu"], 2))
0.53

Example with multiple references for the first sample, with min_len adjusted to 2, instead of the default 1, and max_len adjusted to 6 instead of the default 4:

>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
>>> print(round(results["google_bleu"], 2))
0.4

Limitations and Bias

The GoogleBLEU metric does not come with a predefined tokenization function; previous versions simply used split() to split the input strings into tokens. Using a tokenizer such as the default one, tokenizer_13a, makes results more standardized and reproducible. The BLEU and sacreBLEU metrics also use this default tokenizer.

Citation

@misc{wu2016googles,
title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean},
year={2016},
eprint={1609.08144},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Further References

This Hugging Face implementation uses the nltk.translate.gleu_score implementation