M. Faruqui and P. Majumdar and S. Pado. Soundex-based translation correction in Urdu–English cross-language information retrieval. Proceedings of the IJCNLP Workshop on Cross-Lingual Information Retrieval. Chiang Mai, Thailand, 2011.
Cross-language information retrieval is difficult for languages with few processing tools or resources such as Urdu. An easy way of translating content words is provided by Google Translate, but due to lexicon limitations named entities (NEs) are transliterated letter by letter. The resulting NEs errors (zynydyny zdn for Zinedine Zidane) hurts retrieval. We propose to replace English non-words in the translation output. First, we determine phonetically similar English words with the Soundex algorithm. Then, we choose among them by a modified Levenshtein distance that models correct transliteration patterns. This strategy yields an improvement of 4% MAP (from 41.2 to 45.1, monolingual 51.4) on the FIRE-2010 dataset.
@InProceedings{faruqui-majumder-pado:2011:CLIA5,
author = {Faruqui, Manaal and Majumder, Prasenjit and Pado, Sebastian},
title = {Soundex-based Translation Correction in Urdu--English
Cross-Language Information Retrieval},
booktitle = {Proceedings of the Fifth International Workshop On
Cross Lingual Information Access},
year = {2011},
address = {Chiang Mai, Thailand},
pages = {25--29},
url = {http://www.aclweb.org/anthology/W11-3605}
}