METEOR (
Metric for Evaluation of Translation with Explicit ORdering) is a
metric for the evaluation of
machine translation output. The metric is based on the
harmonic mean of unigram
precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as
stemming and
synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular
BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level. Results have been presented which give
correlation of up to 0.964 with human judgement at the corpus level, compared to
BLEU's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.