lvwerra HF staff commited on
Commit
d807f7c
1 Parent(s): 617842a

Update Space (evaluate main: 828c6327)

Browse files
Files changed (5) hide show
  1. README.md +129 -4
  2. app.py +6 -0
  3. google_bleu.py +156 -0
  4. requirements.txt +4 -0
  5. tokenizer_13a.py +100 -0
README.md CHANGED
@@ -1,12 +1,137 @@
1
  ---
2
- title: Google_bleu
3
- emoji: 📈
4
- colorFrom: red
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Google BLEU
3
+ emoji: 🤗
4
+ colorFrom: blue
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for Google BLEU
16
+
17
+
18
+ ## Metric Description
19
+ The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score is designed to limit these undesirable properties when used for single sentences.
20
+
21
+ To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.
22
+
23
+ - **precision:** the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
24
+ - **recall:** the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence
25
+
26
+ The minimum value of precision and recall is then returned as the score.
27
+
28
+
29
+ ## Intended Uses
30
+ This metric is generally used to evaluate machine translation models. It is especially used when scores of individual (prediction, reference) sentence pairs are needed, as opposed to when averaging over the (prediction, reference) scores for a whole corpus. That being said, it can also be used when averaging over the scores for a whole corpus.
31
+
32
+ Because it performs better on individual sentence pairs as compared to BLEU, Google BLEU has also been used in RL experiments.
33
+
34
+ ## How to Use
35
+ This metric takes a list of predicted sentences, as well as a list of references.
36
+
37
+ ```python
38
+ sentence1 = "the cat sat on the mat"
39
+ sentence2 = "the cat ate the mat"
40
+ google_bleu = evaluate.load("google_bleu")
41
+ result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
42
+ print(result)
43
+ >>> {'google_bleu': 0.3333333333333333}
44
+ ```
45
+
46
+ ### Inputs
47
+ - **predictions** (list of str): list of translations to score.
48
+ - **references** (list of list of str): list of lists of references for each translation.
49
+ - **tokenizer** : approach used for tokenizing `predictions` and `references`.
50
+ The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT. This can be replaced by any function that takes a string as input and returns a list of tokens as output.
51
+ - **min_len** (int): The minimum order of n-gram this function should extract. Defaults to 1.
52
+ - **max_len** (int): The maximum order of n-gram this function should extract. Defaults to 4.
53
+
54
+ ### Output Values
55
+ This metric returns the following in a dict:
56
+ - **google_bleu** (float): google_bleu score
57
+
58
+ The output format is as follows:
59
+ ```python
60
+ {'google_bleu': google_bleu score}
61
+ ```
62
+
63
+ This metric can take on values from 0 to 1, inclusive. Higher scores are better, with 0 indicating no matches, and 1 indicating a perfect match.
64
+
65
+ Note that this score is symmetrical when switching output and target. This means that, given two sentences, `sentence1` and `sentence2`, whatever score is output when `sentence1` is the predicted sentence and `sencence2` is the reference sentence will be the same as when the sentences are swapped and `sentence2` is the predicted sentence while `sentence1` is the reference sentence. In code, this looks like:
66
+
67
+ ```python
68
+ predictions = "the cat sat on the mat"
69
+ references = "the cat ate the mat"
70
+ google_bleu = evaluate.load("google_bleu")
71
+ result_a = google_bleu.compute(predictions=[predictions], references=[[references]])
72
+ result_b = google_bleu.compute(predictions=[predictions], references=[[references]])
73
+ print(result_a == result_b)
74
+ >>> True
75
+ ```
76
+
77
+ #### Values from Popular Papers
78
+
79
+
80
+ ### Examples
81
+ Example with one reference per sample:
82
+ ```python
83
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
84
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], ['he was interested in world history because he read the book']]
85
+ >>> google_bleu = evaluate.load("google_bleu")
86
+ >>> results = google_bleu.compute(predictions=predictions, references=references)
87
+ >>> print(round(results["google_bleu"], 2))
88
+ 0.44
89
+ ```
90
+
91
+ Example with multiple references for the first sample:
92
+ ```python
93
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
94
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
95
+ >>> google_bleu = evaluate.load("google_bleu")
96
+ >>> results = google_bleu.compute(predictions=predictions, references=references)
97
+ >>> print(round(results["google_bleu"], 2))
98
+ 0.61
99
+ ```
100
+
101
+ Example with multiple references for the first sample, and with `min_len` adjusted to `2`, instead of the default `1`, which means that the function extracts n-grams of length `2`:
102
+ ```python
103
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
104
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
105
+ >>> google_bleu = evaluate.load("google_bleu")
106
+ >>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
107
+ >>> print(round(results["google_bleu"], 2))
108
+ 0.53
109
+ ```
110
+
111
+ Example with multiple references for the first sample, with `min_len` adjusted to `2`, instead of the default `1`, and `max_len` adjusted to `6` instead of the default `4`:
112
+ ```python
113
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
114
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
115
+ >>> google_bleu = evaluate.load("google_bleu")
116
+ >>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
117
+ >>> print(round(results["google_bleu"], 2))
118
+ 0.4
119
+ ```
120
+
121
+ ## Limitations and Bias
122
+
123
+ The GoogleBLEU metric does not come with a predefined tokenization function; previous versions simply used `split()` to split the input strings into tokens. Using a tokenizer such as the default one, `tokenizer_13a`, makes results more standardized and reproducible. The BLEU and sacreBLEU metrics also use this default tokenizer.
124
+
125
+ ## Citation
126
+ ```bibtex
127
+ @misc{wu2016googles,
128
+ title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
129
+ author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean},
130
+ year={2016},
131
+ eprint={1609.08144},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CL}
134
+ }
135
+ ```
136
+ ## Further References
137
+ - This Hugging Face implementation uses the [nltk.translate.gleu_score implementation](https://www.nltk.org/_modules/nltk/translate/gleu_score.html)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("google_bleu")
6
+ launch_gradio_widget(module)
google_bleu.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ Google BLEU (aka GLEU) metric. """
15
+
16
+ from typing import Dict, List
17
+
18
+ import datasets
19
+ from nltk.translate import gleu_score
20
+
21
+ import evaluate
22
+ from evaluate import EvaluationModuleInfo
23
+
24
+ from .tokenizer_13a import Tokenizer13a
25
+
26
+
27
+ _CITATION = """\
28
+ @misc{wu2016googles,
29
+ title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
30
+ author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey
31
+ and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin
32
+ Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto
33
+ Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and
34
+ Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes
35
+ and Jeffrey Dean},
36
+ year={2016},
37
+ eprint={1609.08144},
38
+ archivePrefix={arXiv},
39
+ primaryClass={cs.CL}
40
+ }
41
+ """
42
+
43
+ _DESCRIPTION = """\
44
+ The BLEU score has some undesirable properties when used for single
45
+ sentences, as it was designed to be a corpus measure. We therefore
46
+ use a slightly different score for our RL experiments which we call
47
+ the 'GLEU score'. For the GLEU score, we record all sub-sequences of
48
+ 1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
49
+ compute a recall, which is the ratio of the number of matching n-grams
50
+ to the number of total n-grams in the target (ground truth) sequence,
51
+ and a precision, which is the ratio of the number of matching n-grams
52
+ to the number of total n-grams in the generated output sequence. Then
53
+ GLEU score is simply the minimum of recall and precision. This GLEU
54
+ score's range is always between 0 (no matches) and 1 (all match) and
55
+ it is symmetrical when switching output and target. According to
56
+ our experiments, GLEU score correlates quite well with the BLEU
57
+ metric on a corpus level but does not have its drawbacks for our per
58
+ sentence reward objective.
59
+ """
60
+
61
+ _KWARGS_DESCRIPTION = """\
62
+ Computes corpus-level Google BLEU (GLEU) score of translated segments against one or more references.
63
+ Instead of averaging the sentence level GLEU scores (i.e. macro-average precision), Wu et al. (2016) sum up the matching
64
+ tokens and the max of hypothesis and reference tokens for each sentence, then compute using the aggregate values.
65
+
66
+ Args:
67
+ predictions (list of str): list of translations to score.
68
+ references (list of list of str): list of lists of references for each translation.
69
+ tokenizer : approach used for tokenizing `predictions` and `references`.
70
+ The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.
71
+ This can be replaced by any function that takes a string as input and returns a list of tokens as output.
72
+ min_len (int): The minimum order of n-gram this function should extract. Defaults to 1.
73
+ max_len (int): The maximum order of n-gram this function should extract. Defaults to 4.
74
+
75
+ Returns:
76
+ 'google_bleu': google_bleu score
77
+
78
+ Examples:
79
+ Example 1:
80
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
81
+ 'he read the book because he was interested in world history']
82
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], \
83
+ ['he was interested in world history because he read the book']]
84
+ >>> google_bleu = evaluate.load("google_bleu")
85
+ >>> results = google_bleu.compute(predictions=predictions, references=references)
86
+ >>> print(round(results["google_bleu"], 2))
87
+ 0.44
88
+
89
+ Example 2:
90
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
91
+ 'he read the book because he was interested in world history']
92
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
93
+ 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
94
+ 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
95
+ ['he was interested in world history because he read the book']]
96
+ >>> google_bleu = evaluate.load("google_bleu")
97
+ >>> results = google_bleu.compute(predictions=predictions, references=references)
98
+ >>> print(round(results["google_bleu"], 2))
99
+ 0.61
100
+
101
+ Example 3:
102
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
103
+ 'he read the book because he was interested in world history']
104
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
105
+ 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
106
+ 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
107
+ ['he was interested in world history because he read the book']]
108
+ >>> google_bleu = evaluate.load("google_bleu")
109
+ >>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
110
+ >>> print(round(results["google_bleu"], 2))
111
+ 0.53
112
+
113
+ Example 4:
114
+ >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
115
+ 'he read the book because he was interested in world history']
116
+ >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
117
+ 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
118
+ 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
119
+ ['he was interested in world history because he read the book']]
120
+ >>> google_bleu = evaluate.load("google_bleu")
121
+ >>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
122
+ >>> print(round(results["google_bleu"], 2))
123
+ 0.4
124
+ """
125
+
126
+
127
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
128
+ class GoogleBleu(evaluate.EvaluationModule):
129
+ def _info(self) -> EvaluationModuleInfo:
130
+ return evaluate.EvaluationModuleInfo(
131
+ description=_DESCRIPTION,
132
+ citation=_CITATION,
133
+ inputs_description=_KWARGS_DESCRIPTION,
134
+ features=datasets.Features(
135
+ {
136
+ "predictions": datasets.Value("string", id="sequence"),
137
+ "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
138
+ }
139
+ ),
140
+ )
141
+
142
+ def _compute(
143
+ self,
144
+ predictions: List[str],
145
+ references: List[List[str]],
146
+ tokenizer=Tokenizer13a(),
147
+ min_len: int = 1,
148
+ max_len: int = 4,
149
+ ) -> Dict[str, float]:
150
+ references = [[tokenizer(r) for r in ref] for ref in references]
151
+ predictions = [tokenizer(p) for p in predictions]
152
+ return {
153
+ "google_bleu": gleu_score.corpus_gleu(
154
+ list_of_references=references, hypotheses=predictions, min_len=min_len, max_len=max_len
155
+ )
156
+ }
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ nltk
tokenizer_13a.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Source: https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_13a.py
2
+ # Copyright 2020 SacreBLEU Authors.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import re
17
+ from functools import lru_cache
18
+
19
+
20
+ class BaseTokenizer:
21
+ """A base dummy tokenizer to derive from."""
22
+
23
+ def signature(self):
24
+ """
25
+ Returns a signature for the tokenizer.
26
+ :return: signature string
27
+ """
28
+ return "none"
29
+
30
+ def __call__(self, line):
31
+ """
32
+ Tokenizes an input line with the tokenizer.
33
+ :param line: a segment to tokenize
34
+ :return: the tokenized line
35
+ """
36
+ return line
37
+
38
+
39
+ class TokenizerRegexp(BaseTokenizer):
40
+ def signature(self):
41
+ return "re"
42
+
43
+ def __init__(self):
44
+ self._re = [
45
+ # language-dependent part (assuming Western languages)
46
+ (re.compile(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])"), r" \1 "),
47
+ # tokenize period and comma unless preceded by a digit
48
+ (re.compile(r"([^0-9])([\.,])"), r"\1 \2 "),
49
+ # tokenize period and comma unless followed by a digit
50
+ (re.compile(r"([\.,])([^0-9])"), r" \1 \2"),
51
+ # tokenize dash when preceded by a digit
52
+ (re.compile(r"([0-9])(-)"), r"\1 \2 "),
53
+ # one space only between words
54
+ # NOTE: Doing this in Python (below) is faster
55
+ # (re.compile(r'\s+'), r' '),
56
+ ]
57
+
58
+ @lru_cache(maxsize=2**16)
59
+ def __call__(self, line):
60
+ """Common post-processing tokenizer for `13a` and `zh` tokenizers.
61
+ :param line: a segment to tokenize
62
+ :return: the tokenized line
63
+ """
64
+ for (_re, repl) in self._re:
65
+ line = _re.sub(repl, line)
66
+
67
+ # no leading or trailing spaces, single space within words
68
+ # return ' '.join(line.split())
69
+ # This line is changed with regards to the original tokenizer (seen above) to return individual words
70
+ return line.split()
71
+
72
+
73
+ class Tokenizer13a(BaseTokenizer):
74
+ def signature(self):
75
+ return "13a"
76
+
77
+ def __init__(self):
78
+ self._post_tokenizer = TokenizerRegexp()
79
+
80
+ @lru_cache(maxsize=2**16)
81
+ def __call__(self, line):
82
+ """Tokenizes an input line using a relatively minimal tokenization
83
+ that is however equivalent to mteval-v13a, used by WMT.
84
+
85
+ :param line: a segment to tokenize
86
+ :return: the tokenized line
87
+ """
88
+
89
+ # language-independent part:
90
+ line = line.replace("<skipped>", "")
91
+ line = line.replace("-\n", "")
92
+ line = line.replace("\n", " ")
93
+
94
+ if "&" in line:
95
+ line = line.replace("&quot;", '"')
96
+ line = line.replace("&amp;", "&")
97
+ line = line.replace("&lt;", "<")
98
+ line = line.replace("&gt;", ">")
99
+
100
+ return self._post_tokenizer(f" {line} ")