Pado et al. 2009b
S. Pado, M. Galley, D. Jurafsky, and C. Manning: Robust Machine Translation Evaluation with Entailment Features. Proceedings of ACL 2009. Singapore.
Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-the-art scores (BLEU, NIST, TER, and METEOR) in two different settings. The combination metric outperforms the individual scores, but is bested by the entailment-based metric. Combining the entailment and traditional features yields further improvements.
@InProceedings{pado-EtAl:2009:ACLIJCNLP,
author = {Pado, Sebastian and Galley, Michel and
Jurafsky, Dan and Manning, Christopher D.},
title = {Robust Machine Translation Evaluation with Entailment Features},
booktitle = {Proceedings of the Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP},
month = {August},
year = {2009},
address = {Suntec, Singapore},
publisher = {Association for Computational Linguistics},
pages = {297--305},
url = {http://www.aclweb.org/anthology/P/P09/P09-1034}
}