Pado 2007 (PhD Thesis)

Sebastian Pado: Cross-Lingual Annotation Projection Models for Role-Semantic Information. May 31st, 2007. Download the PDF file (A4).

Volume 21, Saarbr�cken Dissertations in Computational Linguistics and Language Technology. German Research Center for Artificial Intelligence and Saarland University. ISBN 978-3-933218-20-9. Order hardcopy here.

Abstract

Due to the high cost of manual annotation, resources with role-semantic annotation exist only for a small number of languages, notably English. This thesis addresses the resulting resource scarcity problem for languages where such resources are not available by developing methods which automatically induce role-semantic annotations for these languages.

We address the induction task by taking advantage of the resource gradient between languages, extracting annotations from existing resources (e.g., for English) and transferring them to new languages. We effect the transfer using annotation projection, a general procedure to exchange linguistic information between aligned sentences in a parallel corpus. Basic annotation projection is a knowledge-lean approach, and thus applicable even to resource-poor languages. Specifically, we apply projection to semantic annotation in the frame semantics paradigm. Frame-semantic annotation consists of two annotation layers: semantic classes for predicates, and semantic roles linking predicates to their arguments. We evaluate our approach by using FrameNet, a large English resource for frame semantics, to induce frame-semantic annotation for two target languages, German and French.

In the first part of this thesis, we assess a prerequisite for annotation projection, namely the degree of parallelism between monolingual reference annotations in a parallel corpus. Inspection of a manually annotated sample corpus shows that frame-semantic annotation in fact exhibits a substantial degree of parallelism, both with respect to semantic classes and semantic roles. This result holds for both language pairs we consider, namely English--French and English--German.

The two central parts of the thesis are concerned with the actual projection step for individual predicates. We project semantic classes and roles in two separate steps, since the two tasks have different profiles. The projection of semantic classes can be realised using simply by using correspondences between predicates, which are usually single words. We identify translational shifts as the central problem of this task, i.e., translations which change the semantic class (frame) of the original predicate. We demonstrate that knowledge-lean filtering mechanisms relying on distributional properties are sufficient to induce high-precision seed lexicons. In contrast, the projection of semantic roles relies mainly on clean correspondences between sentential constituents (i.e.,role-bearing phrases). The latter are difficult to obtain due to errors and omissions in the word alignment. We show that better constituent alignments can be obtained by formalising the task as a graph matching problem which can integrate knowledge about syntactic bracketings. The linguistic information encoded in the bracketings alleviates word alignment errors and results in high-precision projections even for noisy input data (e.g., resulting from automatic shallow semantic parsing).

In the last part of the thesis, we approach the problem of translational shifts. We identify a subclass of cases for which parallelism can be restored by considering groups of more than one frame. These frame group paraphrases are amenable to a generalised version of annotation projection, and we provide a semi-supervised algorithm for their corpus-based acquisition. In a manual pilot study, we show that the acquisition algorithm results in linguistically plausible frame group paraphrases which can furthermore account for a large portion of translational shifts in our sample.

In sum, the results of this thesis indicate that the semantic generalisations made by frame semantics carry over to a considerable degree from English to other languages. The projection methods we have developed can be applied to robustly and automatically create frame-semantic resources for new languages.

Sebastian Padó

Abstract