Pado et al. 2009b

S. Pado, M. Galley, D. Jurafsky, and C. Manning: Robust Machine Translation Evaluation with Entailment Features. Proceedings of ACL 2009. Singapore.

Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-the-art scores (BLEU, NIST, TER, and METEOR) in two different settings. The combination metric outperforms the individual scores, but is bested by the entailment-based metric. Combining the entailment and traditional features yields further improvements.

@InProceedings{pado-EtAl:2009:ACLIJCNLP,
  author    = {Pado, Sebastian  and  Galley, Michel  and  
               Jurafsky, Dan  and  Manning, Christopher D.},
  title     = {Robust Machine Translation Evaluation with Entailment Features},
  booktitle = {Proceedings of the Joint Conference of the 47th Annual Meeting 
               of the ACL and the 4th International Joint Conference on 
               Natural Language Processing of the AFNLP},
  month     = {August},
  year      = {2009},
  address   = {Suntec, Singapore},
  publisher = {Association for Computational Linguistics},
  pages     = {297--305},
  url       = {http://www.aclweb.org/anthology/P/P09/P09-1034}
}

Sebastian Padó