BLEU (Bilingual Evaluation Understudy)

See also: Machine learning terms

Introduction

The Bilingual Evaluation Understudy (BLEU) is an automatic evaluation metric used in the field of Natural Language Processing (NLP) to measure the quality of machine-generated translations. Developed by IBM Research in 2002, it compares translations generated by a machine with a set of human-generated reference translations. BLEU scores are widely used in the evaluation of machine translation systems and other NLP tasks that involve generating text, such as text summarization and image captioning.

Algorithm

Overview

The BLEU metric is based on the concept of modified n-gram precision. An n-gram is a continuous sequence of n words or tokens in a given text. The modified n-gram precision is calculated by comparing the n-grams present in the machine-generated translation with those in the reference translations. A higher proportion of overlapping n-grams indicates a higher similarity between the machine-generated translation and the reference translations, resulting in a higher BLEU score.

Calculating BLEU

The BLEU metric calculation involves the following steps:

1. Compute modified n-gram precision: For each n-gram length (typically 1 to 4), calculate the proportion of overlapping n-grams between the candidate translation and the reference translations.

2. Apply brevity penalty: Since shorter translations tend to have higher n-gram precision, a brevity penalty is applied to account for the length difference between the candidate translation and the reference translations. The brevity penalty helps ensure that the generated translations do not merely consist of short, high-precision phrases.

3. Compute geometric mean: Calculate the geometric mean of the modified n-gram precision values obtained in step 1 for each n-gram length, then multiply by the brevity penalty obtained in step 2. This yields the final BLEU score.

The BLEU score ranges from 0 to 1, with 0 indicating no overlap between the candidate translation and the reference translations, and 1 indicating a perfect match. In practice, however, even high-quality translations rarely achieve a BLEU score of 1 due to the inherent variability in human translation.

Limitations

Although the BLEU metric has been widely adopted for evaluating machine-generated translations, it has some limitations:

Lack of semantic understanding: BLEU is a purely statistical metric that does not account for the meaning of the text. As a result, it may not always correlate with human judgments of translation quality.
Sensitivity to reference translations: The quality of the reference translations can significantly affect the BLEU score. If the reference translations are of low quality or not representative of the target language, the BLEU score may not accurately reflect the quality of the machine-generated translation.
Inability to capture reordering: BLEU primarily focuses on local n-gram matches and may not adequately capture long-range reordering or paraphrasing, which are essential aspects of high-quality translation.

Explain Like I'm 5 (ELI5)

BLEU is a way to check how well a computer can translate text from one language to another. It looks at the words and phrases the computer uses and compares them to translations made by humans. If the computer's translation has a lot of the same words and phrases as the human translations, the BLEU score is higher. But if the computer's translation is very different from the human translations, the BLEU score is lower. However, BLEU has some problems. It doesn't always understand the meaning of the words, and it can be affected by the quality of the human translations it compares to.