Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach

Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach Till Christopher Lech CognIT as, Oslo Norway till.christopher.l...
Author: Imogen Walsh
5 downloads 0 Views 231KB Size
Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach Till Christopher Lech CognIT as, Oslo Norway [email protected]

Koenraad de Smedt Universitetet i Bergen Norway [email protected]

Abstract. Semantic annotation of natural language text requires a certain degree of understanding of the document in question. Especially the resolution of unclear reference is a major challenge when detecting relevant information units in a document. The ongoing KunDoc project examines how domain specific ontologies can support the task of Coreference chaining in order to enhance applications such as automatic annotation, information extraction or automatic summarization. In this paper, we present a robust methodology for acquisition of semantic contexts that does not depend on a thorough syntactic parsing as necessary tools often are unavailable for “smaller” languages. Based on a shallow corpus-analysis, verb-subject relations constitute the framework for the extraction of semantic contexts. Our approach either adds the semantic contexts to concepts and instances in an existing ontology or builds up the domain knowledge necessary for coreference chaining from scratch.

Introduction Automatic semantic annotation of natural language text such as web documents requires a certain degree of text understanding, which in turn is dependant on resolving anaphoric expressions and coreference chains. In the current NLP landscape there are numerous approaches for anaphora resolution based on either heuristics, such as (Mitkov 1998) and (Stuckardt 2000), or statistics, such as (Soon, Ng et al. 2001) or (Ng and Cardie 2003). However, there are cases that can be resolved only through the use of domain specific knowledge, such as in Example (1): (1)

a. The police officer was searching for the suspect. b. He has been investigating the murder since Tuesday. b’. He committed the second murder on Tuesday.

Especially Traditional methods such as (Mitkov 1998) are limited in that they cannot resolve the differences illustrated in (1). Only few efforts have been made to explore background knowledge stored in ontologies in order to resolve unclear reference. The aim of the ongoing KunDoc Project1 is to examine how ontologies can be acquired, enhanced and reused for detecting coreference chains in natural language text. In this paper, we will describe a methodology for the learning and use of domain specific ontologies for the support of coreference chaining. We present a methodology for the acquisition of semantic contexts that can either be added to existing ontologies or constitute a starting point for ontology engineering, based on verb-subject relations extracted from a domainspecific text corpus.

Acquisition of Semantic Contexts The idea of deriving semantic classes from noun phrase/verb co-occurrences is not new in itself. Most of the work in this area is based on the distributional hypothesis, i.e. that nouns are similar to the extent that they share context. We assume that certain actions – denoted by verbs – are typically performed by a semantically restricted set of entities. In one of the first significant attempts to exploit the distributional hypothesis, (Hindle 1990) describes a methodology for generating semantic classes based on predicate-argument structures. Starting point for Hindle's approach is the pointwise mutual information of verb-object and verb-subject co-occurences. According to (Manning and Schütze 1999) mutual information is a symmetric, nonnegative measure of the common information in two variables (1), I ( x y ) = log 2

P( x y) P( x ) P ( y )

(1)

where P(x y) is the joint probability of events x and y, and P(x) and P(y) are the independent probabilities. In order to calculate a weighting for each verb-subject pair, Hindle (2) derives a score from the observed frequencies of verb-subject cooccurrences, f (n v) N Csubj (n v) = log 2 f ( n ) f (v ) N N 1

http://kundoc.net

(2)

Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach 3

where f(n v) is the frequency of a noun n occuring as subject of verb v, and N is the total of all verb-subject pairs in the data set. Based on the verb-object co-occurence weighting, Hindle computes the similarity of objects for a certain verb: SIM subj (vi n j nk ) =

(3)

 min (C subj (vi n j ), C subj (vi nk ) ) if C subj (vi n j ) > 0 and C subj (vi nk ) > 0    abs (max(C subj (vi n j ), C subj (vi nk )) ) if C subj (vi n j ) < 0 and C subj (vi nk ) < 0   0, otherwise  

Subsequently, analogously to Hindle’s method, we derive a measure for noun similarity that computes the sums of the respective subject similarities for a pair of nouns: N

SIM (n1 n2 ) = ∑ SIM subj (vi n1 n2 )

(4)

i =0

Although aimed at the generation of classes rather than taxonomies, Hindle's method provides a framework for the initial experiments described in the following paragraphs. The extraction of Predicate-Argument Structures (PAS) requires somewhat accurate parses of the sentences in the corpus, which can be problematic, due to the availability of the necessary tools for various languages. Problems with the availability of parsing tools for the Norwegian language have been discussed in (Eiken 2005). One of the available tools for Norwegian is the LFG-based NORGRAM parser. The problem with NORGRAM is that the lexicon is not complete, making it difficult to use the system with real world corpuses. As an alternative – and more robust – approach, a more shallow parsing of the text was chosen by using the Oslo-Bergen Tagger. Tests were executed on data extracted from two different text collections. The first collection comprises newspaper articles on two murder cases in Norway. The Oslo-Bergen Tagger (OBT) is a PoS-Tagger developed within a cooperation between the Universities of Oslo and Bergen, Norway. The OBT consists of a preprocessor for tokenisation, sentence boundary detection as well as a morphologic tagger and a CG-based module for disambiguation of tags. The CG component gives all found options that cannot be excluded. This gives a fair recall, but low precision. The CG module delivers an annotation of sentence constituents such as subjects, objects or modifiers. The annotation of syntactic functions is by far not exact, as shown in (2) and its annotation (2a).

(2)

Medelevene tente lys for Anne Slåtten under dagens minnestund. (Classmates lit candles for Anne Slåtten during today's obsequies.)

(2a) "" "" "" "" "" "" "" "" ""

"medelev" subst appell mask be fl @obj@subj "tenne" verb pret tr1 tr11 pa5 tr15 @fv "lys" subst appell nøyt ub fl @obj @subj "lys" subst appell nøyt ub ent @obj @subj "for" prep @adv "Anne" subst prop fem @

Suggest Documents