LEARNING VERB INFERENCE RULES FROM LINGUISTICALLY-MOTIVATED EVIDENCE

Bar-Ilan University Department of Computer Science L EARNING V ERB I NFERENCE RULES FROM L INGUISTICALLY-M OTIVATED E VIDENCE by Hila Weisman Advi...

Author: Allan Brown

0 downloads 0 Views 866KB Size

Report

Download PDF

Recommend Documents

Inference Rules and Decision Rules

Verb-second structures: Evidence from Danish VP anaphora

Learning by Investing: Evidence from Venture Capital

Bayesian networks: Inference and learning

Learning Verb Subcategorization from Corpora: Counting Frame Subsets

HFCC Learning Lab NET - Inference 2 MAKING AN INFERENCE

WYOMING RULES OF EVIDENCE

Sound and Complete Inference Rules in FOL

BALANCED BUDGET RULES AND FISCAL POLICY: EVIDENCE FROM THE STATES

Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules

Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs

TITLE 225 RULES OF EVIDENCE

Learning Verb-Noun Relations to Improve Parsing

First-Order Deduction. Inference in First-Order Logic. Sample Proof. Inference Rules for Quantifiers

Observational Learning: Evidence from a Randomized Natural Field Experiment

Impact of Private Tutoring on Learning Levels: Evidence from India

Reading The Web with Learned Syntactic-Semantic Inference Rules

Automatic Generation of Challenging Distractors Using Context-Sensitive Inference Rules

Learning context-sensitive synchronous rules

Machine Learning and Association Rules

Bar-Ilan University Department of Computer Science

L EARNING V ERB I NFERENCE RULES FROM L INGUISTICALLY-M OTIVATED E VIDENCE

by

Hila Weisman

Advisors: Prof. Ido Dagan and Dr. Idan Szpektor

Submitted in partial fulfillment of the requirements for the Master’s degree in the department of Computer Science

Ramat-Gan, Israel March 2013 Adar Copyright 2013

This work was carried out under the supervision of Prof. Ido Dagan and Dr. Idan Szepktor Department of Computer Science, Bar-Ilan University.

Acknowledgments I would like to take this opportunity to thank the people who made this work possible. First, my advisor Ido Dagan. Ido gave me a chance to delve deeper into the fascinating field of NLP. He is a brilliant researcher, whose guidance has taught me how to observe, analyze and scientifically formalize ideas. I would also like to deeply thank Idan Szpektor, my second advisor. His creative ideas and broad knowledge in the field have been of tremendous help to me and my research. I wish to thank Jonathan Berant, my de-facto third supervisor. Working with Jonathan on a daily basis was an illuminating and educating experience. I have learned a lot about truly creative scientific research and diligence through working with him in ’Team Verb’. Thanks to my colleagues at the Bar-Ilan NLP lab: Ofer Bronstien, Naomi Zeichner, Asher Stern, Eyal Shnerch, Meni Adler, Erel Segal, Amnon Lotan, Lili Kotlerman and Chaya Lebeskind for illuminating coffee breaks and for their helpful and kind nature. Thanks to Mindel, Udi, Batels and Yotam for their support, patience and love during the whole time I was working on my thesis. And to Papa, in memoriam, for always pushing me to work hard and pursue my dreams. This work was partially supported by the Israel Science Foundation grant 1112/08, the PASCAL-2 Network of Excellence of the European Community FP7-ICT2007-1-216886, and the European Community Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT).

Contents 1

Introduction

1

2

Background 2.1 Distributional Similarity . . . . . . . . . . . 2.2 Co-Occurrence Methods . . . . . . . . . . . 2.3 Integrating Distributional Similarity with Co-occurrence . . . . . . . . . . . . . . . . . 2.4 Linguistic Background . . . . . . . . . . . . 2.4.1 Fine-Grained Definition of Entailment 2.4.2 Discourse Relations and Markers . . 2.4.3 Verb Classes . . . . . . . . . . . . .

4 4 7

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

11 12 12 15 17

3

Linguistically-Motivated Cues for Entailment 18 3.1 Verb Co-occurrence . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Verb Semantics Heuristics . . . . . . . . . . . . . . . . . . . . . 20 3.3 Typed Distributional Similarity . . . . . . . . . . . . . . . . . . . 22

4

An Integrative Framework for Verb Entailment Learning 4.1 Representation and Model . . . . . . . . . . . . . . . 4.1.1 Sentence-level co-occurrence . . . . . . . . . 4.1.2 Features Adapted from Prior Work . . . . . . . 4.1.3 Document-level co-occurrence . . . . . . . . . 4.1.4 Corpus-level statistics . . . . . . . . . . . . . 4.1.5 Feature Selection and Analysis . . . . . . . . . 4.2 Learning Framework . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

24 24 25 27 30 31 33 36

5

6

Evaluation 5.1 Evaluation on a Manually Annotated Verb Pair Dataset 5.1.1 Experimental Setting . . . . . . . . . . . . . . 5.1.2 Feature selection and analysis . . . . . . . . . 5.1.3 Results and Analysis . . . . . . . . . . . . . . 5.2 Evaluation within Automatic Content Extraction task . 5.2.1 Experimental Setting . . . . . . . . . . . . . . 5.2.2 Building a Verb Entailment Resource for ACE 5.2.3 Evaluation Metrics and Results . . . . . . . . 5.2.4 Error Analysis . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

37 37 38 39 43 47 48 49 50 56

Conclusions and Future Work 60 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Appendix A VerbOcean Web Patterns Experiment 69 A.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 69 A.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Appendix B Fine-grained Verb Entailment Annotation Guidelines

71

Appendix C Verb Syntax and Semantics 73 C.1 Auxiliary and Light Verbs . . . . . . . . . . . . . . . . . . . . . 73 C.2 Verb Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 C.3 Levin Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

List of Figures 2.1 2.2 2.3 2.4 2.5

3.1

3.2 3.3

4.1

5.1

Related clauses, via the anchor ’Mr. Brown’, and their extracted syntactic templates in Pekar (2008) . . . . . . . . . . . . . . . . An automatically learned ’Prosecution’ narrative chain in Chambers and Jurafsky (2008) . . . . . . . . . . . . . . . . . . . . . Classification of verb entailment to sub-types according to Fellbaum (1998) . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Sample Rhetorical Structure Tree, connecting four discourse units via rhetorical relations in Mann and Thompson (1988) . . . A mapping of discourse relations to relations proposed by other researchers: M&T - Mann and Thompson (1988); Ho - Hobbs (1990); A&L - Asher and Lascarides (2003); K&S - Knott and Sanders (1998) . . . . . . . . . . . . . . . . . . . . . . . . . .

.

10

.

11

.

13

.

15

.

16

A dependency parse of the sentence “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Typed Distributional Similarity Feature vectors for the verb “see” . 22 Adverb-typed Distributional Similarity vectors for the verbs “whisper” and “talk” . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 The backward elimination procedure, adapted from Rakotomamonjy (2003) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Evaluated rank-based methods and their Micro-average F1 as a function of top K rules . . . . . . . . . . . . . . . . . . . . . . .

53

5.2

Co-occurrence Level classifiers and their Micro-average F1 as a function of top K rules . . . . . . . . . . . . . . . . . . . . . . .

7

54

List of Tables 2.1

4.1

VerbOcean’s Semantic relations with an example of their associated pattern and an instantiating verb pair . . . . . . . . . . . . .

8

Discourse relations and their mapped markers or connectives. . . .

25

5.1

Top 10 positively correlated features according to the Pearson correlation score and their co-occurrence level. . . . . . . . . . . . . 5.2 Top 10 negatively correlated features according to the Pearson correlation score and their co-occurrence level. . . . . . . . . . . 5.3 Top 10 features according to their single feature classifier F1 score and their co-occurrence level. . . . . . . . . . . . . . . . . . . . . 5.4 Feature ranking according to minimal feature weights, with the feature’s verb co-occurrence level . . . . . . . . . . . . . . . . . 5.5 Average precision, recall and F1 for the evaluated models. . . . . . 5.6 Average precision, recall, F1 , and the statistical significance level of the difference in F1 for each combination of co-occurrence levels. 5.7 10 randomly selected correct entailment rules learned by our model but missed by TDS. . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Average number of positive verb pairs per seed with their frequency compared to the average total of 875 candidate rules per seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 MAP Results of our model and the decision-rule based resources . 5.10 Error types and their frequency from a random sample of 100 false matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Example errors due to the naive matching mechanism . . . . . . . 5.12 Example errors due to seed ambiguity . . . . . . . . . . . . . . .

40 41 42 43 45 46 47

55 56 57 57 58

Abstract Many natural language processing (NLP) applications need to perform semantic inference in order to complete their task. Such inference relies on knowledge that can be represented by inference rules between atomic units of language such as words or phrases. In this work, we focus on automatically learning one of the most important rule types, namely, entailment rules between verbs. Most prior work on learning such rules focused on a rather narrow set of information sources: mainly distributional similarity, and to a lesser extent manually constructed verb co-occurrence patterns. In this work, we claim that it is imperative to utilize information from various textual scopes: verb co-occurrence within a sentence, within a document, as well as overall corpus statistics. To this end, we propose an integrative model that improves rule learning by exploiting entailment cues: linguistically motivated indicators that are specific to verbs. These cues operate at different levels of verb co-occurrence and signal the existence (or non-existence) of the entailment relation between any verb pair. These novel cues are encoded as features together with other useful features adapted from prior work. Two experiments performed in different settings show that our proposed model outperforms, with a statistically significant improvement, state-of-the-art algorithms. Further feature analysis indicates that our novel linguistic features substantially improve performance and that features at all co-occurrence levels, and especially at the sentence level, contain valuable information for verb entailment learning.

Chapter 1 Introduction Semantic inference is the task of conducting inferences over natural language utterances. Many Natural Language Processing (NLP) applications need to perform semantic inferences in order to complete their task. For example, given a collection of documents, a Question Answering (QA) system needs to retrieve valid answers to questions posed in natural language. Such a system should infer that the sentence “Churros are coated with sugar” contains a valid answer for the question “What are Churros covered with?”. In Information Extraction (IE), systems need to find instances of target relations in text. For example, an IE system aiming to extract the relation “is enrolled at” would have to recognize that the sentence “Michelle studies History at NYU” implies that ’Michelle’ is the enrolee and that she is enrolled at ’NYU’. In order to make such inferences, these systems depend on knowledge which can be represented by inference or entailment rules. An entailment rule specifies a directional inference relation between two text fragments T and H. Entailment rules can relate lexical elements or words, e.g., the rule ‘Honda → car’ specifies that the meaning of the word ’Honda’ implies the meaning of the word ’car’. They may include slots for variables, e.g., ‘X bought Y → X purchased Y’ and also represent syntactic transformations, e.g., change a sentence from passive form to active form. In this work we will focus on automatically learning verb entailment rules, which describe a lexical entailment relation between two verbs, e.g., ‘whisper →

1

talk’, ‘win → play’ and ‘buy → own’. Verbs are the primary vehicle for describing events and relations between entities and as such, are arguably the most important lexical and syntactic category of a language (Fellbaum, 1990). Their semantic importance has led to active research in automatic learning of entailment rules between verbs or verb-like structures (Chklovski and Pantel, 2004; Zanzotto et al., 2006; Abe et al., 2008). Large knowledge bases of entailment rules can be constructed both manually or automatically. Manually-created rule bases provide extremely accurate semantic knowledge, but their construction is both time-consuming and their coverage limited. On the other hand, automated corpus-based rule learning methods are able to utilize the large amounts of text available today to provide broad coverage at minimum cost. Most prior efforts to automatically learn verb entailment rules from large corpora employed distributional similarity methods, assuming that verbs are semantically similar if they occur in similar contexts (Lin, 1998; Berant et al., 2012). This led to the automatic acquisition of large scale knowledge bases, but with limited precision. Fewer works, such as VerbOcean (Chklovski and Pantel, 2004), focused on identifying verb entailment through verb instantiation of manually constructed patterns. For example, the sentence “he scared and even startled me” implies that ‘startle → scare’. This led to more precise rule extraction, but with poor coverage since contrary to nouns, in which patterns are common (Hearst, 1992), verbs do not co-occur often within rigid patterns. However, verbs do tend to co-occur in the same document, and also in different clauses of the same sentence. In this work, we claim that on top of standard pattern-based and distributional similarity methods, corpus-based learning of verb entailment can greatly benefit from exploiting additional linguistically-motivated cues that are specific to verbs. For instance, when verbs co-occur in different clauses of the same sentence, the syntactic relation between the clauses can be viewed as a proxy for the semantic relation between the verbs. Moreover, we empirically show that in order to improve performance, it is crucial to combine information sources from different textual scopes such as verb co-occurrence within a sentence or within a document and distributional similarity over the entire corpus. 2

Our goal in this work is thus two-fold. First, to create a novel set of entailment cues that help detect the likelihood of lexical verb entailment. Our novel cues are specific to verbs and linguistically-motivated. Second, to encode these cues as features within a supervised classification framework and integrate them with other standard features adapted from prior work. This results in a supervised corpus-based learning method that combines verb entailment information at the sentence, document and corpus levels. We demonstrate the added power of these cues in two distinct evaluation settings, where our integrated model achieves a substantial and statistically significant improvement over state-of-the-art algorithms. First, we show a 17% improvement in F 1 measure compared to the best performing prior work on a manually annotated dataset of over 800 verb pairs. Then we utilize our verb entailment model in an applied Information Extraction setting and show marked improvement over several automated learning algorithms as well as manually created lexicons. In addition, we ascertain the importance of using features at different levels of co-occurrence by performing ablation tests and applying feature selection techniques. In Chapter 2 we provide background on prior work in the field of learning semantic relations between verbs. In Chapter 3, we introduce linguistically motivated cues that are specific to verbs and signal the entailment relation between verb pairs. In Chapter 4 we describe how these cues are encoded as features within a supervised classification framework. In Chapter 5, we evaluate our proposed model in two different settings: a manually annotated dataset and an applicationoriented setting. Then, we conclude with a short discussion about contributions and possible extensions to our work.

3

Chapter 2 Background One can view the task of learning verb entailment rules as part of a more general task of learning semantic relations between lexical units. In the first part of this chapter we will survey the main approaches for learning verb entailment and other semantic relations between verbs. In the second part, we will describe works from both theoretical and applied linguistics which are required for the full understanding of the motivations underlying our proposed model.

2.1

Distributional Similarity

The main approach for learning lexical entailment rules between verbs has employed the distributional hypothesis (Harris, 1954) which states that two words are likely to have similar meanings if they co-occur, i.e., appear together, with similar sets of words in a corpus (a large, unstructured volume of text). For instance, we expect the verbs ‘buy’ and ‘purchase’ to appear with human beings as their subjects and inanimate commodities or companies as their objects, e.g., “He bought a new car” and “Shari Arison purchased a penthouse”. Relying on this hypothesis, distributional similarity methods measure the semantic relatedness of words by comparing the contexts in which the words appear in. The compared words are termed elements and each element is represented according to its context words. The contexts are in turn represented by feature vectors, which are lists of numerical values expressing a weighted frequency mea-

4

sure between the element and the feature. The feature vectors vary from purely lexical, based on words with no syntactic information, to purely syntactic which are based on dependency relations between the element and the context. For example, suppose the element is the verb ’ask’ and it appears in the following sentence: “He asked his mother for some help”, then the purely lexical representation of the verb ’ask’ will have the following features: ’he’,’mother’,’help’, the lexical-syntactic representation will have the following features: ’subject-he’, ’object-mother’,’preposition complement-help’ and the purely syntactic will have the following features: ’subject’, ’object’,’preposition complement’. Once these feature vectors are constructed, a feature weighting function is computed between each feature and its corresponding element. Last, a similarity measure is computed between the weighted vectors in order to establish the degree of semantic relatedness between the two elements. In his seminal paper, Lin (1998) proposed an information-theoretic distributional similarity measure to compute the semantic relatedness of two words. The presented framework starts by extracting dependency triples from a text corpus. A dependency triple consists of an element, a context word (feature), and the grammatical relationship between them in a given sentence. For example, given the sentence “He likes to read mystery novels” and the element ’novel’, the following triples will be extracted: (mystery, adjective, novel), (novel, object, read). Then, a Mutual Information measure is applied to weigh the frequency counts of each feature f with element e: M I(e, f ) = log2

P (e, f ) P (e) · P (f )

Intuitively, this measures the probability that the common appearance of e and f is not random. Last, the author presents a novel lexical distributional similarity measure, referred to as the Lin similarity measure, which is based on the author’s information-theoretic similarity theorem: “The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are”. Thus, given two element words e1 and e2 , the Lin measure is computed as the ratio between the information shared by the features of both words and the

5

sum over the features of each word: Lin(e1 , e2 ) =

Σf ∈F (e1 )∩F (e2 ) (M I(e1 , f ) + M I(e2 , f )) Σf ∈F (e1 ) M I(e1 , f ) + Σf ∈F (e2 ) M I(e2 , f )

The above mentioned method deals with a symmetric notion of similarity, for instance the word pairs (“dog”, “animal”) and (“animal”,“dog”) will receive high distributional similarity scores, regardless of their order. Yet, in order to deal with non-symmetric semantic relations, such as entailment, there is a need to devise directional distributional similarity measures. Using such measures, there will be a difference in the similarity score between word pairs, such that (“dog”, “animal”) will be scored higher than (“animal”,“dog”). These measures generally follow the notion of distributional inclusion: if the contexts of a word w1 are properly included in the contexts of a word w2 , then this might imply that w1 → w2 . For instance, if the context vector representing the word ’startle’ is subsumed by the context vector of the word ’surprise’ then this might imply that ‘startle → surprise’. Weeds and Weir (2003) have shown that the inclusion of one distributional vector into another correlates well with human intuitions about the relations between general and specific words, i.e., specific is subsumed by general (’startle’ is subsumed by ’surprise’). Geffet and Dagan (2005) further refined the distributional similarity measure, relating distributional behaviour with lexical entailment. Later, Szpektor et al. (2007) proposed a directional distributional similarity measure termed “Balanced Inclusion”, which balances the notion of directionality in entailment with the common notion of symmetric semantic similarity. In general, using these methods results in broad-coverage resources for semantically related lexical items, but the relatedness is somewhat more loose than expected. In symmetric distributional similarity, it has been shown (Lin et al., 2003; Geffet and Dagan, 2005) that distributionallly similar words tend to include not only synonyms but also antonyms, e.g., ’buy’ and ’sell’, and co-hyponyms, words that share a common hypernym, e.g., ’cruise’ and ’raft’. In directional distributional similarity, high scores are often given to rules with obscure entailing verbs (Kotlerman et al., 2010). An orthogonal approach that mitigates these drawbacks is that of word co-occurrence. 6

2.2

Co-Occurrence Methods

In contrast to distributional methods, co-occurrence methods capture more exact and fine-tuned semantic relations by looking at the candidate lexical items as they appear together in varying scopes. However, their accuracy comes at a cost: as we limit the co-occurrence scope, we obtain higher precision and lower recall with regards to the target semantic relation. In her seminal paper in the field, Hearst (1992) describes a pattern-based method for the automatic acquisition of the Hyponymy1 (is-a) relation between nouns, e.g., Honda is a hyponym of car. She manually created lexical-syntactic patterns, which include both syntactic information consisting of Part of Speech (POS) slots, e.g., Noun Pharse (NP), as well as lexical information in the form of specific words. The underlying idea is that certain patterns can be thought of as surface cues to the semantic relation of hyponymy. For example, the patterns “N Py such as N Px ” and “N Px and other N Py ” often imply that N Px is a hyponym of N Py , as exemplified in “Bananas and other fruit” and “Shawn Penn and other actors”. Snow et al. (2005) generalize Hearst’s work by automatically identifying, instead of manually creating, lexico-syntactic patterns indicative of hyponymy. They start with a labeled set of example noun pairs. For instance, the pair (Honda, Car) will have a positive label since ’Honda’ is a hyponym of ’Car’ and the pair (Sun, Football) will have a negative label since ’Sun’ is not a hyponym of ’Football’. Next, they collect all the dependency paths connecting these labelled pairs in a large newsire corpus. They then use machine learning methods that learn the effectiveness of each such path (or pattern) in recognizing hyponymy and apply these patterns to recognize novel hyponymy noun pairs. Intuitively, working with noun patterns seems more promising than working with verb patterns, since nouns are naturally ordered in a hierarchy (Fellbaum, 1998) and co-occur more often within the same text fragment. Indeed, less work has been done using verb pattern-based methods. However, one prominent work taking this approach is Chklovsky and Pantel’s VerbOcean (2004). In VerbOcean, the authors use lexical-syntactic patterns to discover semantic relations between 1

The converse relation is termed Hypernymy

7

verbs. Similarly to Hearst, they manually construct 33 patterns which are divided into five pattern groups, with each group corresponding to a different semantic relation: similarity, strength, antonymy, enablement and happens-before. Table 2.1 presents the relations alongside a sample indicative lexical-syntactic pattern and an ordered verb pair instantiating the pattern.

Semantic Relation

Example Pattern

Example Verb Pair

Symmetric

Similarity

X and Y

(transform, integrate)

Yes

Strength

X and even Y

(wound, kill)

No

Antonymy

either X or Y

(open, close)

Yes

Enablement

Yed by Xing the

(fight, win)

No

Happens-before

X and eventually Y

(buy, sell)

No

Table 2.1: VerbOcean’s Semantic relations with an example of their associated pattern and an instantiating verb pair In order to construct a large resource of fine-grained semantically-related verbs, they use highly associated verb pairs (30K verb pairs from Dekang and Pantel (2001)) and for each verb pair, test its association strength with the aforementioned semantic relations using the following information-theoretic measure: Sp (v1 , v2 ) =

P (v1 , p, v2 ) P (p) · P (v1 ) · P (v2 )

Where (v1 , v2 ) is an ordered pair of verbs, p is a specific pattern out of the 33 available patterns (e.g., “X and eventually Y”) and the probabilities in the above measure estimated by Google hit counts. It is worth noting that while working with web counts largely increase the statistics, the counts are volatile and noisy (Lapata and Keller, 2005), and our preliminary empirical results have corroborated this (see Chapter 4.1.2). The use of lexical-syntactic patterns helps achieve high precision, since it correlates well with specific semantic relations. However this comes at a price of low recall, since patterns impose considerable restrictions on the form of verb cooccurrence in corpus. Thus, a major challenge for pattern-based methods is to 8

mitigate the sparseness problem. One possible solution is to adopt a more relaxed notion of co-occurrence, i.e., utilize less rigid and more "loose" surface cues that exist when two verbs co-occur within a broader, more varied scope. Tremper and Frank (2011) suggest a fine-grained supervised method for learning the presupposition semantic relation. The authors utilize surface cues from the co-occurrence of verbs in the same sentence, without the need of specific rigid patterns, and discern between five fine-grained semantic relations: presupposition, entailment, temporal inclusion, antonymy and synonymy (see 2.4.1 for definitions) using a weakly supervised learning approach. They utilize several shallow syntactic features for verb pairs that co-occur in the same sentence: the verb’s form (e.g., the tense of the verb) and part-of-speech contexts (the part-of-speech of the two words preceding and following each verb), alongside more deep syntactic features such as co-reference binding between the verbs’ arguments, i.e., two verbs sharing the same subject or object. Pekar (2008) further broadens the scope of verb co-occurrence for the task of event entailment. Event entailment is a specific form of verb entailment, where one event is the antecedent, denoted as vq and another is the consequent, denoted as vp . If the antecedent occurred than it implies that the consequent also occurred, for example ‘buy → belong’. Pekar focuses on verb co-occurrence in the same discourse unit. A discourse unit is defined as a paragraph or a few adjacent clauses, where a clause is defined as a verb with its dependent constituents. For example, in the sentence “Although John bought a new laptop, he was unhappy” there are two clauses “John bought a new laptop” and “he was unhappy”. Intuitively, Pekar postulates that if two verbs co-occur within a locally coherent text, they are more likely to stand in an event entailment relation. This assumption is based on Centering Theory (Grosz et al., 1995), a linguistic theory of discourse, which states that a coherent text segment tends to focus on a specific entity, here termed as anchor. Building an unsupervised classifier for the task of recognizing event entailment, the proposed algorithm first identifies pairs of clauses that are related in the local discourse. Then, it uses features such as textual proximity, constituents referring to the anchor and pronouns as anchors. Next, it constructs templates from these clauses and chooses verb pairs that instantiate the templates (see Figure 2.1). 9

Figure 2.1: Related clauses, via the anchor ’Mr. Brown’, and their extracted syntactic templates in Pekar (2008) Last, it calculates an information-theoretic measure between the verbs to test their association and discover the direction of entailment in each pair. Given an ordered verb pair (vq , vp ) the following information-theoretic score is computed:

ScoreP ekar (vq , vp ) =

P (vq |vp ) · log PP(v(vq |vp )p ) Σi (P (vqi |vp ) · log

P (vqi |vp ) ) P (vp )

A verb pair (vq , vp ) is given a high score using this formula if the prior probability of vq is relatively far from its posterior probability given vp , in terms of KL-divergence (Kullback and Leibler, 1951). Chambers and Jurafsky (2008) further broaden the verb co-occurrence scope and formulate the notion of narrative (event) chains. Narrative chains are partially ordered sets of events centered around a common entity (here termed as the protagonist). To induce such chains from unordered text, the authors propose an unsupervised learning approach that uses co-referring arguments as evidence of a narrative relation. Here, the verbs need not appear in the same sentence or paragraph, but in the same document. The algorithm learns an ordering of events within the same document and builds a narrative by looking at the arguments shared by the verbs in the document (see Figure 2.2). The authors rely on the intuition that in a coherent text, any two events that revolve around the same participants are likely to be part of the same story or narrative. 10

Figure 2.2: An automatically learned ’Prosecution’ narrative chain in Chambers and Jurafsky (2008)

2.3

Integrating Distributional Similarity with Co-occurrence

An important line of work that naturally follows from the two above mentioned approaches, namely Distributional Similarity and Co-occurrence, is to integrate them in order to exploit the benefits provided by these orthogonal approaches (i.e., high precision vs. high recall). Observing that the two approaches are largely complementary, Mirkin et al. (2006) introduce a system for extracting lists of entailment rules between nouns, e.g., ‘police → organization’. The authors rely on two knowledge extractors: a pattern-based extractor, using Hearst patterns, and a distributional extractor applied to a set of entailment seeds. They create a novel feature set based on both pattern-based and distributional data, e.g., a feature that computes the probability of a noun pair to appear in a specific Hearst pattern, or an intersection feature whose value is ’1’ if the noun pair is identified by both the 11

pattern-based and distributional methods and ’0’ otherwise. Later, Pennacchiotti and Pantel (2009) worked within the field of Information Extraction and expanded Mirkin et al.’s work by augmenting the feature space with a large set of features extracted from a web-crawl. Hagiwara et al. (2009) propose an integrated approach for the task of Synonym identification. They suggest that in order to fully integrate these two orthogonal methods, one needs to reformulate the distributional similarity approach. Thus, instead of using common features in different vectors in order to compare the commonality of the elements’ context, the authors propose to directly create vectors of distributional features, which represent the degree of context commonality shared by the elements. For example, given the elements “opera” and “ballet”, instead of constructing two different vectors for each element and comparing them according to some similarity measure, the authors propose to build a single feature vector, where each entry represents a contexts the elements share, e.g., given the sentences “He is going to the opera” and “The Welches are going to the ballet”, the feature ’go-obj’ will appear in the pair’s joint vector. In this thesis we aim to follow this line of work and encode co-occurrence cues at both the sentence level and document level as well as distributional similarity information, and incorporate these cues as features within a supervised framework for learning lexical entailment rules between verbs.

2.4

Linguistic Background

As mentioned before, we aim to devise novel and valuable entailment cues that are specific to verbs and are linguistically-motivated. In the following section, we will review the related literature in linguistics that has inspired the construction of many of our novel features.

2.4.1

Fine-Grained Definition of Entailment

The entailment relation between verbs may hold for many different reasons: one verb may represent an action or a state whose manner is more specific than the 12

other, e.g., ’dancing’ is more specific than ’moving’ and ‘dance → move’. Or it may be that one verb denotes an action or state which presupposes another action/state, e.g., ’winning’ a tournament presupposes ’playing’ in a tournament and ‘win → play’. Fellbaum (1998) proposed a verb entailment hierarchy, presented in Figure 2.3, which captures these fine-grain differences. Here, the entailment relation is first divided in two according to the Temporal Inclusion criteria. Then, if the the criteria is met, yet another division is performed, this time according to the Troponymy relation:

Figure 2.3: Classification of verb entailment to sub-types according to Fellbaum (1998)

Temporal Inclusion: A verb V1 is said to temporally include a verb V2 if there is some stretch of time during which the activities or states denoted by the two verbs co-occur, but there exists no time during which V2 occurs and V1 does not. Proper Inclusion: A verb V1 is said to properly include a verb V2 if it temporally includes it and there exists a stretch of time during which V1 occurs but V2 does not, i.e., V1 and V2 are not temporally co-existent. The direction of entailment here is V2 → V1 . Troponymy: A verb V1 is said to be a troponym of V2 if it is temporally (but not properly) included in V2 . In addition, this relation can be expressed by the formula “To V1 is to V2 in some particular manner” and is comparable to the 13

semantic relation of Hyponymy between nouns. The direction of entailment here is V1 → V2 . To illustrate the temporal inclusion division, let us examine two examples: 1. ‘dance → move’: the activity of ’dancing’ in temporally included in the activity of ’moving’, as dancing is an activity that occurs while moving. Furthermore, dance is not properly included in move, since they are always co-existent: one must necessarily be moving every instant that one is dancing. And indeed, to dance is to move in some particular manner. 2. ‘dream → sleep’: the activity of ’dreaming’ is temporally included in the activity of ’sleeping’, as ’dreaming’ is an activity that occurs while ’sleeping’. Furthermore, dream is properly included in sleep, since one cannot dream whilst one is not sleeping and indeed there exists a stretch of time when one sleeps and does not dream. There are two remaining relations in the entailment hierarchy that do not confirm to the Temporal Inclusion criteria: Backward Presupposition and Causation. Backward Presupposition: a verb V1 is said to backward presuppose a verb V2 if the occurrence of the activity denoted by V2 occurs before V1 and is a prerequisite of the activity denoted by V1 , with the direction of entailment being V1 → V2 . For example, ’win’ presupposes ’play’ since the event of ’winning’ occurred after the event of ’playing’ and indeed one must have played in order to win, and ‘win → play’. We note that the reverse relation is called Presupposition, with the direction of entailment being reversed to V2 → V1 . Causation: is a relation between a verb V1 , the cause, and a verb V2 , the effect. This relation holds when V1 is a causative verb of change that occurs before V2 to produce it. V2 is the result of V1 and is a resultative verb, with the direction of entailment being V1 → V2 . For example, the event of showing an object to a person causes this person to see the object and the corresponding entailment rule is ‘show → see’. 14

2.4.2

Discourse Relations and Markers

Discourse relations between textual units are an essential way for humans to properly interpret or produce text. In their seminal paper, Mann and Thompson (1988) present a theory of text organization named Rhetorical Structure Theory (RST). RST is a static, descriptive linguistic theory that deals with what makes a text coherent. It defines a text as coherent due to the discourse (rhetorical) relations which link together its various segments. Figure 2.4 presents such tree structure, which first connects four discourse units into two units by using the ’Purpose’ and ’Non-volitional cause’ discourse relations and subsequently merges the two units into one coherent discourse unit by using the ’Elaboration’ discourse relation.

Figure 2.4: A Sample Rhetorical Structure Tree, connecting four discourse units via rhetorical relations in Mann and Thompson (1988) Discourse relations are semantic relations that connect two text segments, where the smallest segments are clauses and the largest are paragraphs. These relations are conceptual and can be marked explicitly in text by discourse markers (also called ’connectives’) e.g., “because”, “so”, “however”, “although”. The basic idea is that a discourse marker serves to relate the content of connected segments in a specific type of relationship between the text segments. Many researchers (Hobbs, 1979; Rosner and Stede, 1992; Knott and Dale, 1994, 1996) have created various taxonomies of discourse markers. Of special interest to us are the works that try to automatically learn the discourse relations present in a given text. Marcu and Echihabi (2002) use large amounts of labelled data to automatically learn four discourse relations: contrast, explanation-evidence, condition 15

and elaboration. These four relations have been chosen as a sufficient taxonomy of discourse relations and the authors provide a mapping to other taxonomies proposed in previous works (see Figure 2.5). In our work with adopt their taxonomy, with some alterations, and aim to use these markers as insight to the deeper semantic relations between clauses and their corresponding main verbs (see Chapter 3 for more details).

Figure 2.5: A mapping of discourse relations to relations proposed by other researchers: M&T - Mann and Thompson (1988); Ho - Hobbs (1990); A&L - Asher and Lascarides (2003); K&S - Knott and Sanders (1998) Lapata and Lascarides (2004) propose a data-intensive and unsupervised approach that automatically captures the temporal discourse relation between clauses of the same sentence. To that end, the authors suggest a probabilistic framework where the temporal relations are learned by gathering informative features, such as the verb’s tense, aspect, modality and voice, in order to infer a likely ordering of the two clauses. For instance, in the sentence “Many employees will lose their job once the merge is completed”, ’lose’ which is the verb in the main clause appears with a future modality, active voice and imperfective aspect, while ’complete’ which is the verb in the subordinate clause appears with no modality, passive voice and imperfective aspect. In our work, we utilize similar temporal features in order to infer the prevalent temporal relation between verb pairs, which may assist in detecting the direction of entailment between the verb pair (see Chapter 4.1.2).

16

2.4.3

Verb Classes

There are two predominant taxonomies of verbs semantics: WordNet (Fellbaum, 1998) and Levin Classes (Levin, 1993). WordNet is an on-line English lexicon which groups lexicalized concepts to sets of synonyms, termed as synsets. These synsets are in turn connected in a hierarchy by means of lexical and semantic relations, such as Synonymy, Antonymy and Troponymy. Some relations are specific to verbs, such as Troponymy and Entailment and they assist our annotation process (see Chapter 5.1.1) and also serve as a baseline in one of our evaluation settings (see Chapter 5.2). In contrast, the Levin verb classification aims to capture the close relationship between the syntax and semantics of verbs. The underlying assumption is that verb meaning and argument realization, which are alterations in the expression of verbal arguments such as subject, object and prepositional complement, are jointly linked . Thus, looking at verbs with shared patterns of argument realization provides a way of classifying verbs in a meaningful way (see Appendix C for more details). Levin’s coupling of a verb’s behaviour to its meaning is a predominant insight that is utilized throughout this work, and we specifically utilize the idea of verb classification as one of our novel entailment cues (see Chapter 3.2).

17

Chapter 3 Linguistically-Motivated Cues for Entailment In order to fully understand our model and its underlying ideas, we will now outline the motivations for our proposed novel entailment cues: linguistically motivated indicators that are specific to verbs, operate at different levels of verb cooccurrence and cue the existence (or non-existence) of the entailment relation between verb pairs.

3.1

Verb Co-occurrence

Our focus is on utilizing the information embedded when two verbs appear together (co-occur) within a textual unit. We aim to capture this information in different levels of co-occurrence, with an emphasis on devising novel cues at the sentence level, since we believe that co-occurrence at this level bears important information that has been previously overlooked. Sentences are hierarchically structured grammatical units, composed of one or more clauses. We conjecture that the hierarchical organization of a sentence may be pertinent to the semantic organization of the verbs heading the corresponding clauses. The clauses can be coordinated such as in “The lizard moved and raised its head” or subordinate such as in “If we want to win the tournament, we must practice vigorously”. In the latter case, the clauses can be linked explicitly via a

18

subordinating conjunction. More formally: Subordination: one clause is subordinate to another, if it semantically depends on it and cannot be fathomed without it. The dependent clause is called a subordinate clause and the independent clause is called the superordinate (or main) clause. Subordinating Conjunction: a word that functions to link a subordinate clause to its superordinate clause. Many discourse markers are also subordinating conjunctions such as because, unless, since, if. As described in 2.4.2, Discourse markers are lexical terms such as ‘because’ and ‘however’ that indicate a semantic relation between discourse fragments (propositions or speech acts). We suggest that discourse markers may signal the semantic relation between the main verbs of the connected clauses. For example, in the sentence “He always snores while he sleeps”, the marker ‘while’ indicates a temporal inclusion relation between the clauses, indicating that ‘snore → sleep’. Often the relation between clauses is not expressed explicitly via an overt conjuncture, but is still implied by the syntactic structure of the sentence. To that end, we aim to utilize the grammatical relations as labelled by dependency parsers. Dependency parsers offer a functional view of the grammatical relationships in a sentence, where the sentence components are termed “constituents” and are ordered in a hierarchy, where each edge is labelled with the grammatical dependency between the constituents. The hierarchy is rooted by the main verb, which is either the only non-auxiliary1 verb in the sentence, or the non-auxiliary verb of the superordinate clause. For example, given the sentence “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas”, a dependency parser will output a labelled hierarchy (or tree) such as in Figure 3.1. We conjecture that the dependency relation between co-occurring verbs in a sentence can give insight to the semantic relation between them. For instance, verbs can be connected via labeled dependency edges expressing that one clause is an adverbial adjunct2 of the other. Such co-occurrence structure does not indicate 1

see Appendix C for more details on auxiliary verbs. Adverbial: clause elements that typically refer to circumstances of time, space, reason and manner. Adjunct: an optional constituent of a construction, providing auxiliary information. 2

19

Figure 3.1: A dependency parse of the sentence “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas” a deep semantic relation, such as entailment, between the two verbs. For example, in the sentence “David saw Tamar as she was walking down the street”, the verbs ’see’ and ’walk down’ are connected via an adverbial adjunct relation and indeed the two events are temporally connected, but neither verb entails the other.

3.2

Verb Semantics Heuristics

Levin (1993) hypothesized that “the behaviour of a verb...is to a large extent determined by its meaning. Thus verb behaviour can be used effectively to probe for linguistically relevant, pertinent aspects of verb meaning”. We aim to utilize this hypothesis and look at the behaviour of verbs in a corpus in order to gain insight to their semantics, i.e., whether a verb describes a state or event, whether it is general or specific etc. This, in turn, provides some notion of the verb’s likelihood to participate in an entailment relation. Verb generality Verb-particle constructions (VPC’s) are multi-word expressions consisting of a head verb and a particle, e.g., switch off (Baldwin and Villavicencio, 2002). We conjecture that the more general a verb is, the more likely it is to appear with many different particles. In particular, detecting verb generality 20

may assist in tackling an infamous property of distributional similarity methods, namely, the difficulty in detecting the direction of entailment (Berant et al., 2012). For example, the verb ’cover’ appears with many different particles such as ’up’ and ’for’, while the verb ’coat’ does not. Thus, assuming we have evidence for an entailment relation between the two verbs, this indicator can help us discern the direction of entailment and determine that ‘coat → cover’. On the other hand, if the entailing verb (v1 ) is more general than the entailed verb (v2 ) this could be a cue for non-entailment, since we assume that entailment is a relation than goes from specific to general. Inspired by Levin’s hypothesis, we look at the number of different VPC’s a verb appears in, and deduce its generality/specificity from its propensity to appear in such constructions. Verb classes As described in Chapter 2.4.3, verb classes are sets of semanticallyrelated verbs sharing some linguistic properties. We wished to utilize the Levin classification in order to gain insight to the semantics of verbs and their propensity to entail. However, the Levin classes have been reported to have inconsistences, with verbs appearing in multiple classes (Kipper et al., 2000). In addition, due to the granularity of the classes (48 main classes and 191 subdivided classes), it was unclear whether there existed a mapping from the classes with their combinations L × L, to their propensity to entail. Due to these reasons, we chose to classify verbs according to the major conceptual categories of ’State’ and ’Event’, adopted by both Jackendoff (1983) and Dowty (1979): Stative verbs are verbs that express a state, a stable situation which holds for a certain time interval, e.g., “own”, “see”, “think”, “believe”. Event verbs are verbs that express an event, a varying situation which evolves through a series of temporal phases, e.g., “buy”, “run”, “take”. With the fine-grained definition of entailment in mind (Chapter 2.4.1), we conjecture that the verbs in an entailing verb pair should usually belong to the same verb class: Temporal Inclusion and Backward Presupposition deal mostly with events and Troponymy can hold between two events or two states. In Causation, on the other hand, the cause verb has a complex event structure (Levin, 1993) and thus belongs to the Event class while the effect verb, denoting the change caused 21

by the causative verb, is usually a State verb, e.g., ‘show → see’, ‘buy → own’. Conversely, if v1 belongs to the ’State’ class and v2 belongs to the ’Event’ class then we deduce that the verbs are less likely to entail.

3.3

Typed Distributional Similarity

As discussed in Chapter 2.1, distributional similarity is the most common source of information for learning semantic relations between verbs. Yet, we suggest that on top of standard distributional similarity measures, which take several verbal arguments (such as subject, object) into account simultaneously, we should also focus on each type of argument independently. Figure 3.2 shows our proposed representation of the verb “see” with five feature vectors: one for each verbal dependent: adverb, subject, object, preposition complement, and a feature vector which holds all the dependents together (termed ’all’). This feature vector corresponds to the typical feature vector representation in standard distributional similarity approaches.

Figure 3.2: Typed Distributional Similarity Feature vectors for the verb “see” In this work, we apply this representation to compute distributional similarity between verbs based on the set of adverbs that modify them. Our hypothesis, which is based on Distributional Inclusion (see Chapter 2.1), is that adverbs may contain relevant information for capturing the direction of entailment: If a verb appears with a small set of adverbs it is more likely to be a specific verb that already conveys a specific action or state, making some adverbs redundant. For example, the verb ‘whisper’ conveys a specific manner of talking and usually does not appear with the adverbs ‘loudly’, ‘clearly’, ‘openly’ and so forth. Since we

22

adopt the Inclusion Hypothesis of Entailment (Dagan and Glickman, 2004), i.e., that the specific verb entails the general verb, then the verb with a smaller set of adverb modifiers should entail the verb with a larger set of adverb modifiers. For instance, as shown in Figure 3.3, the verb ‘whisper’is represented by a small feature vector of adverbs, while the verb ‘talk’ is represented by a large feature vector of adverbs, indicating that the entailment direction for the verb pair is ‘whisper → talk’. We surmise that measuring directional similarity based solely on adverb modifiers could reveal this phenomenon and assist in establishing the direction of entailment.

Figure 3.3: Adverb-typed Distributional Similarity vectors for the verbs “whisper” and “talk”

23

Chapter 4 An Integrative Framework for Verb Entailment Learning In the previous chapter we discussed linguistic observations regarding novel cues that may help in detecting entailment relations between verbs. We next describe how we incorporated these cues as features within a supervised framework for learning lexical entailment rules between verbs. We follow prior work on supervised lexical semantics (Mirkin et al., 2006; Hagiwara et al., 2009; Tremper and Frank, 2011) and address the rule learning task as a classification task. The classification task includes two distinct phases: a training phase, where we obtain a labelled set of representative verb pairs as a training set for the verb entailment classifier, and a testing phase, where we apply the classifier on new unlabelled verb pairs in order to estimate their entailment likelihood. In both training and testing, verb pairs are represented as feature vectors. We next describe the feature space model and then proceed to detail the construction of our verb entailment classifier.

4.1

Representation and Model

We first discuss how our novel indicators, as well as other diverse sources of information adapted from prior verb semantics works, are encoded as features. Most of our features are based on information extracted from the target verb pair

24

Discourse Relations

Discourse Markers

Contrast

although , despite , but , whereas , notwithstanding , though

Cause

because , therefore , thus

Condition

if , unless

Temporal

whenever , after , before , until , when , finally , during , afterwards , meanwhile

Table 4.1: Discourse relations and their mapped markers or connectives. co-occurring within varying textual scopes (sentence, document, corpus). Hence, we group the features according to their related scope. Naturally, when the scope is small, i.e., at a sentence level, the semantic relation between the verbs is easier to discern but the information may be sparse. Conversely, when co-occurrence is loose the relation is harder to discern but coverage is increased.

4.1.1

Sentence-level co-occurrence

Discourse markers As discussed in Chapter 3, discourse markers may signal relations between the main verbs of adjacent clauses. The literature is abundant with taxonomies that classify markers to various discourse relations (Mann and Thompson, 1988; Hovy and Maier, 1993; Knott and Sanders, 1998). Inspired by Marcu and Echihabi (2002), we employ markers that are mapped to four discourse relations ’Contrast’, ’Cause’, ’Condition’ and ’Temporal’, as specified in Table 4.1. We chose this rather concise classification, as we did not wish to conflate the feature space with features relating to overly-specified and at times irrelevant1 discourse relations. For a target verb pair (v1 , v2 ) and each discourse relation r, we count the number of times that v1 is the main verb in the main clause, v2 is the main verb in the subordinate clause, and the clauses are connected via a marker mapped to the discourse relation r. For example, given the sentence “I stayed here because I’ve never known a place as beautiful as this” the verb pair (‘stay’,‘enjoy’) appears 1

In the sense that the relation does not link clauses, but sentences. e.g., the Elaboration relation.

25

in the ’Causation’ relation, indicated by the marker ‘because’, where ‘stay’ is in the main clause and “enjoy” is in the subordinate clause. To establish that the marker does indeed link the appropriate clauses, we verified that the marker is directly linked to at least one of the verbs in the dependency tree. Next, each count is normalized by the total number of times (v1 , v2 ) appear with any discourse marker. The same procedure is done when v1 is in the subordinate clause and v2 in the main clause. We term the features by the relevant discourse relation, e.g., ‘v1-contrast-v2’ refers to v1 being in the main clause and connected to the subordinate clause via a ’Contrast’ discourse marker. Dependency relations between clauses As noted in Chapter 3, the syntactic structure of verb co-occurrence can indicate the existence or lack of entailment. In dependency parsing this may be expressed via the label of the dependency relation connecting the main and subordinate clauses. In our experiments we used the UkWac2 corpus which was parsed by the dependency parser MALT (Nivre et al., 2006). Thus, we identified three pertinent MALT dependency relations connecting verbs of two clauses. Other relations proved to be mainly erroneous, either at the dependency parser level or the POS tagging level. The first relation is the object complement relation ‘obj’. In this case the subordinate clause acts as a clause complement to the main clause. For example, in the sentence “it surprised me that the lizard could talk”, the verb pair (‘surprise’,‘talk’) is connected via the ‘obj’ relation. The second relation is the adverbial adjunct relation ‘adv’, in which the subordinate clause is an adverbial adjunct of the main clause e.g., “he gave his consent without thinking about the repercussions”. The third and last relation is the coordination relation ‘coord’, in which the two clauses form a coordinated structure. In a coordinated structure, as opposed to a subordinate structure, the sub-units are combined in a non-hierarchal manner (Bluhdorn, 2008), e.g., “every night my dog Lucky sleeps on the bed and my cat Flippers naps in the bathtub”. We conjecture that the first two relations could provide information for identifying non-entailment, i.e., be a negative feature of entailment, while the third relation could provide information for identifying entailment, i.e., be a positive feature of entailment. 2

http://wacky.sslmit.unibo.it/doku.php?id=corpora

26

Similar to discourse markers, we compute for each verb pair (v1 ,v2 ) and each dependency label d the proportion of times that v1 is the main verb of the main clause, v2 is the main verb of the subordinate clause, and the clauses are connected via a dependency relation d, out of all the times they are connected by any dependency relation. We term the features by the dependency label, e.g., ‘v1-adv-v2’ refers to v1 being in the main clause and connected to the subordinate clause via an adverbial adjunct.

4.1.2

Features Adapted from Prior Work

In order to create an integrative model for automatically learning verb entailment rules, we explored ways of utilizing ideas from previous works on automatically learning various semantic relations between verbs and altering them to better suit the task of learning verb entailment rules. VerbOcean Patterns We follow Chklovski and Pantel (2004) and extract occurrences of VerbOcean patterns that are instantiated by the target verb pair. As mentioned in Chapter 2, VerbOcean patterns were originally grouped into five semantic classes. Based on a preliminary study we conducted (see Appendix A for the full experiment), we decided to utilize only four strength-class patterns as positive indicators for entailment, e.g., “he scared and even startled me”, and three antonym-class patterns as negative cues for entailment, e.g., “you can either open or close the door”. We note that these patterns are also commonly used by RTE systems3 . Since the corpus pattern counts were very sparse, we defined for a target verb pair (v1 ,v2 ) two binary features: the first denotes whether the verb pair instantiates at least one positive pattern, and the second denotes whether the verb pair instantiates at least one negative pattern. For example, given the aforementioned sentences, the value of the positive feature for the verb pair (‘startle’,‘scare’) is ‘1’. Patterns are directional, and so the value of (‘scare’,‘startle’) is ‘0’. 3

http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_ Resources#Ablation_Tests

27

Different Polarity Inspired by the use of verb polarity in Lapata and Lascarides (2004) and Tremper and Frank (2011), we compute the proportion of times that a target verb pair (v1 , v2 ) appears in different polarity. For example, in “he didn’t say why he left”, the verb ’say’ appears in negative polarity and the verb ’leave’ in positive polarity. Such change in polarity could be an indicator of non-entailment between the two verbs. Tense ordering The temporal relation between verbs may provide information about their semantic relation (Lapata and Lascarides, 2004). Thus, for each verb pair co-occurrence we extract the verbs’ tenses and order them as follows: past < present < future. We then add the features ‘tense-v1tense-v2’, corresponding to the proportion of times the tense of v1 is smaller, equal to, or bigger than the tense of v2 . For instance, given the sentence “He was talking to Sarah when the alarm went off”, both the verb ’talk’ and the VPC ’go off’ appear in past tense and the extracted feature will be ‘tensev1=tense-v2’. This ordering indicates the prevalent temporal relation between the verbs in the corpus and may assist in detecting the direction of entailment: • ‘tense-v1=tense-v2’ could be indicative of Troponymy or Temporal Inclusion. • ‘tense-v1tense-v2’ could be indicative of Presupposition, or a negative feature. In order to deal with the phantom future tense, we heuristically classify a verb to a future tense if the verb itself appears in present simple form and is preceded by a modal4 such as “will”, “would”, “shall”. For instance, given the verb pair (leave, talk) and the sentence “He will talk to her before he leaves”, the extracted tense ordering feature will be ‘tense-v1 7. We normalize them by the number of times the verbs co-occur within a sentence and term the features according to their associated bin, e.g., ‘gram-path-upto-3’ refers to v1 and v2 appearing in a sentence with at most three dependency edges connecting them in the dependency parse of the sentence. Similar features are computed for the distance in words (lexical distance), with bins 0 < d < 5, 5 ≤ d ≤ 10, d > 10. These features provide insight into the relatedness of the verbs and we hypothesize that as the distance of appearance is larger, the co-occurrence is less meaningful. Sentence-level PMI Pointwise mutual information (PMI) between v1 and v2 is computed, where the co-occurrence scope is a sentence. Higher PMI should hint at semantically related verbs. P M Isen (v1 , v2 ) = log

Psen (v1 , v2 ) Psen (v1 , v ∗ ) · Psen (v2 , v ∗ )

Where Psen (·) is the probability of two verbs to co-occur in a sentence and v is any non-auxiliary verb in the UkWac corpus. The Psen (·) probabilities are ∗

29

estimated as such: Psen (vi , vj ) = log

countsen (vi , vj ) Σv,v∗ ∈{V } countsen (v, v ∗ )

Where countsen (vi , vj ) is the number of sentences vi and vj co-occur in, and V is the set of all non-auxiliary verbs in the UkWac corpus.

4.1.3

Document-level co-occurrence

The following group of features addresses co-occurrence of a target verb pair within the same document. These features are less sparse, but tend to capture coarser semantic relations between the target verbs. Narrative score Chambers and Jurafsky (2008) suggested a method for learning sequences of actions or events (expressed by verbs) in which a single entity is involved. They proposed a PMI-like narrative score:

P M ICham (e(v1 , r1 ), e(v2 , r2 )) = log

P (e(v1 , r1 ), e(v2 , r2 )) P (e(v1 , r1 )) · P (e(v2 , r2 ))

Where e(v1 , r1 ) represents the event denoted by the verb v1 and the dependency relation r1 , e.g., e(push, subject), and similarly e(v2 , r2 ) represents the event denoted by the verb v2 and the dependency relation r2 , e.g., e(drive, object). Ultimately, this score estimates whether a pair consisting of a verb and one of its dependency relations (v1 , r1 ) is narratively-related to another such pair (v2 , r2 ). Their estimation is based on quantifying the likelihood that two verbs will share an argument that instantiates both the dependency position (v1 , r1 ) and (v2 , r2 ) within documents in which the two verbs co-occur. For example, given the document “Lindsay Lohan was prosecuted for DUI. Lindsay Lohan was convicted of DUI.” the pairs (‘prosecute’,‘subj’) and (‘convict’,‘subj’) share the argument ‘Lindsay Lohan’ and are thus part of a narrative chain. Such narrative relations may provide cues to the semantic relatedness of the verb pair. We thus compute for every target verb pair nine features using their narrative score. In four features, r1 = r2 and the common dependency is either a subject, 30

an object, a preposition complement (e.g., “we meet at the station.), or an adverb termed chambers-subj, chambers-obj, and so on. In the next three features, r1 6= r2 and r1 , r2 denote either a subject, object, or preposition complement5 termed chambers-subj-obj and so on. Last, we add as two features the average of the four features where r1 = r2 (termed chambers-same), and the average of the three features where r1 6= r2 (termed chambers-diff ). Document-level PMI Similar to sentence-level PMI, we compute the PMI between v1 and v2 , but this time the co-occurrence scope is a document. P M Idoc (v1 , v2 ) = log

Pdoc (v1 , v2 ) Pdoc (v1 , v ∗ ) · P (v2 , v ∗ )

Where Pdoc (·) is the probability of two verbs to co-occur in a document and v is any non-auxiliary verb in the UkWac corpus. The Pdoc (·) probabilities are estimated as such: ∗

Pdoc (vi , vj ) = log

countdoc (vi , vj ) Σv,v∗ ∈{V } countdoc (v, v ∗ )

Where countdoc (vi , vj ) is the number of documents vi and vj co-occur in and V is the set of all non-auxiliary verbs in the UkWac corpus.

4.1.4

Corpus-level statistics

The final group of features ignores sentence or document boundaries and is based on overall corpus statistics. Distributional similarity Following our hypothesis regarding typed distributional similarity (Chapter 3.3), we represent the verbs as vectors in five different feature spaces, where each vector space corresponds to a different argument or modifier: the verb’s subjects and objects (the proper noun ’Dimitri’ is the subject and ’book’ is the object of the verb ’read’ in the sentence "Dmitri reads an interesting book"), prepositional complements (the noun station in the sentence "We 5

adverbs never instantiate the subject, object or preposition complement positions.

31

meet at every station"), adverbs (the adverb ’exactly’ in the sentence "They look exactly alike") and the joint vector of the aforementioned vectors. Following this representation, we first compute for each verb and each argument a separate vector that counts the number of times each word in the corpus instantiates the argument position of that verb. We also compute a vector that is the concatenation of the previous separate vectors, which captures the standard distributional similarity statistics. We then apply three state-of-the-art distributional similarity measures, Lin (Lin, 1998), Weeds precision (Weeds et al., 2004) and BInc (Szpektor and Dagan, 2008), to compute for every verb pair a similarity score between each of the five count vectors6 . In total, there are 15 typed distributional similarity features, for all combinations of five argument types and three distributional similarity measure. We term each feature by its combined distributional similarity measure and argument type, e.g., weeds-prep and lin-all represent the Weeds measure over prepositional complements and the Lin measure over all arguments, respectively. Verb classes Following our discussion in Chapter 3.2, we designed a feature f which heuristically estimates for each target verb v ∈ (v1 , v2 ) its likelihood to belong to the ’Stative’ verb class, by computing the proportion of times v appears in progressive tense out of all v’s corpus occurrences. The intuition is that stative verbs usually do not appear in the progressive tense, e.g., the progressive form of the stative verb ’know’, ‘knowing’, has a low corpus frequency. Then, given a verb pair (v1 ,v2 ) and their corresponding stative features f1 and f2 , we add two features f1 · f2 and ff21 , which capture the interaction between the verb classes of the two verbs. As previously discussed, we hypothesize that certain class configurations relate to entailment while others relate to non-entailment. For instance, higher f1 · f2 means both verbs lean towards the ’Event’ verb class, and as such are more likely to entail. Lower ff12 means that v1 is highly associated with the ’State’ verb class while v2 is highly associated with the ’Event’ verb class, and as such are less likely to entail. 6

We employ the common practice of using the PMI between a verb and an argument rather than the argument count as the argument’s weight.

32

Verb generality For each verb v ∈ (v1 , v2 ), we add as a feature the number of different particles it appears with in the corpus cv , following the hypothesis that this is a cue to its generality (see Chapter 3.2). Then, given a verb pair (v1 ,v2 ) c and their corresponding counts cv1 and cv2 , we add the feature cvv1 . We expect that 2 c when cvv1 is high, v1 is more general than v2 , which is a negative cue for entailment. 2

In summary, we compute for each verb pair (v1 , v2 ) 63 features, which are in turn passed as input to a machine learning classifier. We next describe a framework for selecting and analyzing the predicative power of these features.

4.1.5

Feature Selection and Analysis

Since our model contains many novel features, it was important to investigate their utility for detecting verb entailment. To that end, we implemented several feature selection methods as suggested by Guyon and Elisseeff (2003). Feature ranking methods are widely used mainly because of their computational efficiency, as they require only the computation of n scores (n being the number of features in the feature space). However, Guyon and Elisseeff (2003) outline several examples where a feature that is completely ’useless’ on its own (i.e., ranked very low in the feature ranking procedure), but provides a significant performance improvement when joined with other features. Thus, it is useful to look at subsets of features together, and not just in isolation. These methods belong to the following approaches: Feature Ranking Approach Given a set of labelled examples, feature ranking methods make use of a scoring function S computed between each example’s feature values and its corresponding label (in our case +1 if ‘v1 → v2 ’ and -1 otherwise). To that end, we implemented two methods for feature ranking: the first uses the Pearson Correlation criteria as its scoring function and the second uses the performance of a classifier built with a single feature as its scoring function. Pearson correlation criteria is defined as:

33

R(i) = p

cov(Xi , Y ) var(Xi )var(Y )

Where Xi is the random variable corresponding to the ith component of an input feature vector and Y is the random variable corresponding to the appropriate label ∈ {1, −1}. The estimate we used in order to approximate the Pearson criteria is: Σm ¯i )(yk − y¯) k=1 (xk,i − x R∗ (i) = q ¯)2 ¯i )2 Σm Σm k=1 (yk − y k=1 (xk,i − x Where xk,i is the value of the ith feature in example k, x¯i is the average of values for feature i, yk is the label of example k and y¯ is the average of labels. Single-feature classifier The underlying idea here is to select features according to their individual predictive power, using the performance of a classifier built with only a single feature as the ranking criterion. To that end, we performed the following: 1. For each feature: (a) Sort the feature values of all examples, in both descending and ascending order (to account for class polarity). (b) Go through the sorted list, consider each example as the separating hyperplane and compute the corresponding F 1 measure. (c) We now have two maxima, one for each class polarity. Compare them and choose the maximal. 2. Sort the features according to their maximal F1 value, to create a ranked list of features. We now have two ranked lists of our 63 features, which we utilize in order to evaluate the discriminative power of each feature. These rankings also allow us to corroborate or refute our hypotheses regarding the polarity of the feature, i.e., whether it is a positive (or negative) cue for entailment, as we conjectured in the previous chapter. 34

Figure 4.1: The backward elimination procedure, adapted from Rakotomamonjy (2003) Feature Subset Selection The feature subset selection approach examines features in the context of classification, in order to find a subset of features that together have good predicative power. Guyon and Elisseeff describe three methodologies for feature subset selection, and we chose to implement a variant of the embedded methodology, as it offers a simple yet powerful way to address the problem of feature selection. We implemented a greedy algorithm, backward elimination, which removes the minimal weighted7 feature for each subset size of features. Figure 4.1 delineates the backward elimination procedure. As a result of applying the procedure on our feature set, we have a ranked list of features according to their effect on the classifier (Rakotomamonjy, 2003). Looking at the top of the list allows us to discern redundant features, while looking at the bottom of the list provides us with a different view of the features utility, as there might be features that did not show high correlation and usefulness on their own, but when applied in tandem with other features, provide useful discriminative information. 7

Each feature is given a weight f|w| according to its relevancy and role in the classification process.

35

4.2

Learning Framework

We next describe our procedure for learning a generic verb entailment classifier, which can be used to estimate the entailment likelihood for any given verb pair (v1 , v2 ). Supervised approaches learn from a set of positive and negative examples of entailment. We construct such a set by starting with a list of candidate verbs, termed seeds. For each seed, we extract top k candidates for entailment using either a pre-computed distributional similarity score or an outside resource such as WordNet. We then have a list of verb pairs of the form (seed, candidate) and (candidate, seed), which will be annotated for entailment in order to create a labelled set of examples (see Chapter 5.1.1). This set is utilized as a training set for a Support Vector Machines (SVM) classifier. SVM is a prominent learning method used for binary classification, introduced by Cortes and Vapnik (1995). The basic idea is to find a hyperplane (termed the separating hyperplane) which separates the d-dimensional8 data into two classes, with a maximal margin between them.

8

Where d is the number of features.

36

Chapter 5 Evaluation In this section, we will evaluate our proposed model and establish that our linguistically motivated entailment cues and their integration as features in a supervised learning framework, significantly outperform previous state-of-the-art baselines. There does not exist, at present, a common framework for evaluating entailment rules and resources (Mirkin et al., 2009). However, one option for evaluating entailment rules is to ask annotators to directly judge the correctness of each entailment pair, which yields an explicit and optimal judgment of rules (Dagan and Glickman, 2004) but can prove to be time and resource-consuming. Another option is to measure the rules’ impact on an end task within an NLP application. This evaluation approach shows the resource’s importance and its relevancy in a real-world application setting. In this section we utilize these evaluation settings and demonstrate the improved performance of our learning model.

5.1

Evaluation on a Manually Annotated Verb Pair Dataset

As argued by Dagan and Glickman (2004), the best judgment for a given entailment pair can be provided by human experts, since “people attribute meanings to texts and can perform inferences over them”. In order to obtain direct results assessment, we conducted a manual evaluation of our model based on human judgments (Szpektor et al., 2007). 37

5.1.1

Experimental Setting

To evaluate our model, we constructed a dataset containing verb pairs annotated for entailment and non-entailment. We started by randomly sampling 50 verbs from a list of the 3500 most common words in English, according to Rundell and Fox (2003), which we denoted as seed verbs. Next, we extracted the 20 most similar verbs to each seed verb according to the Lin similarity measure (Lin, 1998), which was computed on the RCV1 corpus1 . Then, for each seed verb vs and one of its extracted similar verbs vsi we generated the two directed pairs (vs , vsi ) and (vsi , vs ), which represent the candidate rules ‘vs → vsi ’ and ‘vsi → vs ’ respectively. To reduce noise, we filtered out verb pairs where one of the verbs is an auxiliary or a light verb such as ’do’, ’get’ and ’have’ (see Appendix C for the full list). This resulted in 812 verb pairs as our dataset2 , which were manually annotated by two annotators as representing either a valid entailment rule or not, according to the following framework: Annotation Framework Supervised approaches learn from a set of positive and negative examples for entailment. Thus, we need to label the candidate verb pairs as either positive or negative examples of entailment. A positive example is an ordered verb pair (v1 , v2 ), where v1 entails v2 . A negative example is an ordered verb pair (v1 , v2 ), where v1 does not entail v2 . To perform the annotation, we generally followed the rule-based approach for entailment rule annotation (Dekang and Pantel, 2001; Szpektor et al., 2004) with the following guidelines: 1. The example rule ‘v1 → v2 ’ is judged as correct if the annotator could think of reasonable contexts under which the rule holds, i.e., v1 entails v2 . The entailment relation does not have to hold under every possible context, as long as the context which it holds for is a natural one and not too obscure or anecdotal. For example, the verb ’win’ can be used in the context of winning a war or winning a game. The latter use is frequent enough to infer that the rule ‘win → play’ is correct, since in order to win a game you must 1

http://trec.nist.gov/data/reuters/reuters.html The data set is available at http://www.cs.biu.ac.il/~nlp/downloads/ verb-pair-annotation.html 2

38

have first played the game, despite the fact that “winning” a war does not entail “playing” a war. 2. The rule ‘v1 → v2 ’ is judged as incorrect if no reasonable context can be found as evidence to the correctness of the entailment rule, i.e., v1 does not entail v2 . We wanted to ascertain that the manual annotations comply with WordNet. We therefore verified that verb pairs appearing under certain WordNet relations were also manually annotated as entailing. The relations are either directly mapped to entailment through the ’Entailment’ and ’Cause’ relations, or mapped to more general relations that are commonly used by RTE systems3 : ’Hyponymy’ and ’Synonymy’. In total 315 verb pairs were labeled as entailing (the rule ‘v1 → v2 ’ was judged as correct) and 497 verb pairs were labeled as non-entailing (the rule ‘v1 → v2 ’ was judged as incorrect). The Inter-Annotator Agreement (IAA) for a random sample of 100 pairs was moderate (0.47), as expected from the rule-based approach (Szpektor et al., 2007). For each verb pair, all 63 features within our model were computed using the UkWac corpus (Baroni et al., 2009). UkWac is the largest freely available resource for English with over 2 billion words, complete with POS tags and dependency parsing. For classification, we utilized SVM-perf’s (Joachims, 2005) linear SVM implementation with default parameters (maximizing F1), and evaluated our model by performing 10-fold Cross Validation over the labeled dataset.

5.1.2

Feature selection and analysis

As discussed in Chapter 4.1.5, we followed the feature ranking methods proposed by Guyon and Elisseeff (2003) to investigate the utility of our proposed features. Table 5.1 depicts the 10 most positively correlated features with entailment according to the Pearson correlation measure. From the table, it is clear that distributional similarity features, appearing at ranks 1, 4-7 and 10, are amongst the most 3

http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_ Resources

39

Rank

Top Positive

Pearson Score

Co-occurrence Level

1

Weeds-adverb

0.1945

Corpus

2

Chambers-obj

0.1646

Document

3

v1-coord-v2

0.1529

Sentence

4

Weeds-pmod

0.1431

Corpus

5

Binc-adverb

0.1405

Corpus

6

Weeds-all

0.1337

Corpus

7

Binc-obj co-occurrence

0.1198

Corpus

8

v2-coord-v1

0.1153

Sentence

9

Sentence-level PMI

0.1082

Sentence

10

Weeds-sbj

0.1024

Corpus

Table 5.1: Top 10 positively correlated features according to the Pearson correlation score and their co-occurrence level. positively correlated with entailment, which is in line with prior work (Geffet and Dagan, 2005; Kotlerman et al., 2010). Looking more closely, our suggestion for typed distributional similarity proved to be useful, and indeed most of the highly correlated distributional similarity features are typed measures. Standing out are the adverb-typed measures, with two features in the top 10, including the highest, ‘Weeds-adverb’, and ‘BInc-adverb’. We also note that the highly correlated distributional similarity measures are directional, Weeds and BInc. The table also indicates that document and sentence-level co-occurrence contributes positively to entailment detection. This includes both the Chambers narrative measure, with the typed feature Chambers-obj, and the coordinated clauses with the ‘v1-coord-v2’. Finally, we note that PMI at the sentence level is highly correlated with entailment even more than at the document level (PMI documents is further down the list, at rank 22) since the local textual scope is more indicative, though sparser. Table 5.2 depicts the 10 most negatively correlated features with entailment according to the Pearson correlation measure. Looking at the table, one immediately sees that many of our novel co-occurrence features at the sentence level contribute useful information for identifying non-entailment. For example, as hypothesized 40

Rank

Top Negative Feature

Pearson Score

Co-occurrence Level

1

v1-adv-v2

-0.1056

Sentence

2

v2-obj-v1

-0.0860

Sentence

3

v1-obj-v2

-0.0846

Sentence

4

v2-obj-v1

-0.0841

Sentence

5

tense-v1 > tense-v2

-0.0751

Sentence

6

lexical-distance 0-5

-0.0710

Sentence

-0.0688

Corpus

f1 f2

7

verb generality

8

tense-v1 < tense-v2

-0.0638

Sentence

9

VerbOcean positive

-0.0619

Sentence

10

v1-temporal-v2

-0.0549

Sentence

Table 5.2: Top 10 negatively correlated features according to the Pearson correlation score and their co-occurrence level. in Chapter 4.1.1, verbs connected via an adverbial adjunct (‘v2-adverb-v1’ and ‘v1-adverb-v2’) or an object complement (‘v1-obj-v2’ and ‘v2-obj-v1’) are negatively correlated with entailment. In addition, the novel ‘verb generality’ feature, as well as the tense difference feature (‘tense-v1 > tense-v2’) also prove to be relatively strong cues for non-entailment. We note, however, that the averaged absolute value of the negative scores (0.0755) is somewhat smaller than the averaged value of the positive scores (0.1158), which could slightly decrease the impact of the negative signals. Table 5.3 presents the ranking of features according to the Single Feature classifier method. Looking at the results, we note that the features ‘v1-coord-v2’, ‘v2-coord-v1’ and ‘Sentence-level PMI’ are both positively correlated with entailment as well as achieve high F1 scores in the single feature classification. We also note that all presented features relate either to the sentence or the document level of verb co-occurrence. Specifically, 50% of the features are variants of the Narrative score (4.1.3), which relates to verb co-occurrence at the document level. In order to fully grasp a feature’s contribution it is crucial to view it in context with other subsets of features. To that end, we implemented the backward elimination algorithm, as described in Chapter 4.1.5. Table 5.4 shows the features 41

Rank

Single Feature

F1 of Feature

Co-occurrence Level

1

Chambers-obj

0.5784

Document

2

v1-coord-v2

0.5748

Sentence

3

Sentence level PMI

0.5695

Sentence

4

Chambers-adverb

0.5680

Document

5

v2-coord-v1

0.5644

Sentence

6

Chambers-same dep

0.5607

Document

7

Chambers-subject

0.5599

Document

8

Chambers-pmod

0.5598

Document

9

PMI documents

0.559

Document

10

v1-condition-v2

0.559

Sentence

Table 5.3: Top 10 features according to their single feature classifier F1 score and their co-occurrence level. found to be useful by the SVM classifier at almost every subset, i.e., features that remained until the last 10 iterations (Maximal Weight). The table also presents the features whose removal caused the minimal change in the first 10 iterations (Minimal Weight). Looking at the maximal weighted features, we note again the equal distribution of features from all co-occurrences levels: Sentence, Document and Corpus. We also notice that most of the features have appeared in the previous top 10 features tables. Three features, however, have consistently been at the bottom of the feature ranking methods: ‘Tense-v=Tense-v2’, ‘Chambers-obj-pmod’ and ‘Positive VerbOcean’. We see the converse phenomena with the top 10 minimal features, with four features appearing both in the previous top features ranking lists and in the minimal weights features list. In order to further understand these results, we checked the Pearson correlation score between the features, and saw that there are interactions between the features that influence their utility for the classifier. For instance, the feature ranked 2nd in the list, ‘v2-adv-v1’, has a high correlation with the lexical distance feature 0 < dist < 5, in accordance with the adverbial adjunct structure which usually occurs when the verbs appear close to each other, e.g., “David saw Tamar as she was walking down the street”. To conclude, our feature analysis shows that features at all levels: sentence, 42

Max. Weight Feature

Co-occurrence Level

Min. Weight Feature

Co-occurrence Level

PMI documents

Document

Weeds-obj

Corpus

Binc-adv

Corpus

v2-adv-v1

Sentence

Lin-sbj

Corpus

diff-polarity

Sentence

Weeds-adv

Corpus

v1-temporal-v2

Sentence

Positive VerbOcean

Sentence

gram-path-upto-3

Sentence

PMI Sentences

Sentence

v2-coord-v1

Document

Chambers-obj

Document

Chambers-adv

Corpus

Chambers-obj-pmod

Document

v1-adv-v2

Sentence

Tense-v1=Tense-v2

Sentence

v2-generality

Corpus

Tense-v1>Tense-v2

Sentence

Binc-all

Corpus

Table 5.4: Feature ranking according to minimal feature weights, with the feature’s verb co-occurrence level document and corpus contain valuable information for verb entailment learning and detection, and thus should be combined together. Furthermore, many of our novel features are amongst the highly correlated features, showing that our intuitions when devising a rich set of verb-specific and linguistically-motivated features were correct. Our feature subset selection analysis, however, shows that the learning process is more complex, with many interactions between features and correlations. Thus, it is best to judge the features’ contribution quantitatively within the classifier, for example using evaluation metrics such as Precision and F1 on a test set. We shall next describe the results of such evaluation settings, starting with the evaluation results on a manually-annotated dataset.

5.1.3

Results and Analysis

We next present the experimental results and analysis that show the improved performance of our novel verb entailment model compared to the following baselines, which were mostly taken from or inspired by prior work: Random: A simple decision rule: for any pair (v1 , v2 ), randomly classify as 43

“yes” with a probability equal to the number of entailing verb pairs out of all verb 315 pairs in the labeled dataset (i.e., 812 = 0.387). VO-KB: A simple unsupervised rule: for any pair (v1 , v2 ), classify as “yes” if the pair appears in the strength relation in the VerbOcean knowledge-base, which was computed over web counts, and is commonly used in RTE systems4 . VO-UkWac: A simple unsupervised rule: for any pair (v1 , v2 ), classify as “yes” if the value of the positive VerbOcean feature is ‘1’(see Chapter 4.1.2 for details on the VerbOcean feature construction). TDS: Includes the 15 distributional similarity features in our supervised model. This baseline extends Berant et al. (2012), who trained an entailment classifier over several distributional similarity features. This baseline provides an evaluation of the discriminative power of distributional similarity alone, without cooccurrence features. TDS+VO: Includes the 15 distributional similarity features and the two VerbOcean features in our supervised model. This baseline is inspired by Mirkin et al. (2006), who combined distributional similarity features and Hearst patterns (Hearst, 1992) for learning entailment between nouns. All: Our full-blown model, including all features described in Chapter 4.1. For all tested methods, we performed 10-fold cross validation and macroaveraged Precision, Recall and F1 over the 10 folds. Table 5.5 presents the results of our full-blown model as well as the baselines. We first note that, as expected, the VerbOcean baselines (VO-KB and VOUkWac) provide low recall, due to the sparseness of rigid pattern instantiation for verbs both in the UkWac corpus and the web. They do, however, provide precise rules, with VO-KB being the most precise baseline. Second, it is the combination 4

http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_ Resources#Ablation_Tests

44

Method

P%

R%

F1 %

All

48.73 85.09

61.81

TDS

47.59 60.44

52.94

TDS and VO 46.55 59.73

52.02

Radnom

36.27 35.23

35.74

VO-KB

53.57 14.28

22.55

VO-UkWac

22.64

6.52

3.80

Table 5.5: Average precision, recall and F1 for the evaluated models. of all types of information sources that yields the best performance; Our complete model, employing the full set of features, outperforms all other models in terms of both recall and F1. Its improvement in terms of F1 over the second best model (TDS), which includes all distributional similarity features, is by 17% and is statistically significant according to the Wilcoxon signed-rank test at the 0.01 level (Wilcoxon, 1945). This result shows the benefits of integrating linguistically motivated co-occurrence features with traditional pattern-based and distributional similarity information. Another interesting observation is that the VO features seem to slightly decrease precision and recall, possibly due to their sparsity. To further investigate the contribution of features at various co-occurrence levels, we trained and tested our model with all possible combinations of feature groups corresponding to a certain co-occurrence scope: sentence, document and corpus. Table 5.6 presents the results of these tests. The most notable result of this analysis is that sentence-level features play an important role within our model. Sentence-level features alone (Sent-level) provide the best discriminative power for verb entailment with a statistically significant improvement in F 1 compared to the document and corpus levels. Yet, we note that sentence-level features alone do not capture all the information within our model, and they should be combined with one of the other feature groups to reach performance close to the complete model. Furthermore, it is interesting to examine the statistical significance level: our full model outperforms classifiers

45

Co-occurrence Level

P%

R%

F1 %

Statistical Diff.

Sent+Doc+Corpus-level (ALL) 48.73 85.09

61.81

–

Sent+Corpus-level

47.75 84.44

60.88

–

Sent+Doc-level

47.82 84.20

60.82

–

Doc+Corpus-level

48.15 83.22

60.74

–

Sent-level

44.71 84.65

58.27

0.01

Doc-level

44.03 79.22

56.39

0.01

Corpus-level

45.73 66.36

53.9

0.01

Table 5.6: Average precision, recall, F1 , and the statistical significance level of the difference in F1 for each combination of co-occurrence levels. built with features from a single level of co-occurrence with a statical significance of 0.01, while for classifiers composed of features from two different levels of co-occurrence, the difference is not statistically significant. This shows that by combining only two co-occurrence levels, we almost reach the performance of the full model. Specifically, as can be seen in Table 5.6, the Sent+Doc-level model performs almost as good as the full model, with no statistically significant difference. This subset may be a good substitute to the full model, since its features are easier to extract from large corpora, as they may be computed in an on-line fashion, processing one document at a time, as opposed to corpus-level features whose computation must be done off-line. As a final analysis (Table 5.7), we randomly sampled 20 correct entailment rules learned by our model but missed by the typed distributional similarity classifier (TDS). Looking at Table 5.7, our overall impression is that employing cooccurrence information helps to better capture entailment sub-types other than ‘synonymy’ and ‘troponymy’. For example, our model recognizes that ‘acquire → own’, corresponding to the ‘cause-effect’ entailment relation sub-type, and recognizes ‘stand → stand up’, corresponding to the ‘causation’ entailment relation sub-type.

46

Entailment Rule

Entailment Relation

abuse → harm

Troponymy

examine → study

Synonymy

acquire → own

Cause-Effect

carry → transport

Troponymy

ruin → destroy

Synonymy

identify → examine

Presupposition

abuse → harass

Troponymy

begin → start

Synonymy

eliminate → defeat

Troponymy

identify → detect

Presupposition

Table 5.7: 10 randomly selected correct entailment rules learned by our model but missed by TDS.

5.2

Evaluation within Automatic Content Extraction task

A natural way to evaluate entailment resources is to test their utility in a real-world NLP application that aims to solve an end task, such as an Information Extraction (IE) system. We tested our verb entailment model and its learned entailment rules by using them as a simple IE system. In our experiments we utilized the ACE 2005 event detection task, a standard IE benchmark, in order to compare our verb entailment model against state-of-the-art baselines. We will next describe and demonstrate how our proposed model can be utilized to improve performance on the ACE event detection task. We note that the results are not claimed to be in any way optimal, since our emphasis is on using the ACE task as a comparative tool, rather than building an effective and complete IE system that solves the task. The rest of the section is organized as follows: In 5.2.1 we describe how we utilized the ACE dataset for evaluating verb entailment rules and in 5.2.2 how we built a verb entailment resource for the ACE task. Then, we describe the evaluation metrics and results of comparing our model against two types of baselines: the first type is a rank-based baseline and the second type is decision-rule-based. In 47

total, we compare our model against six resources and present the results along with a detailed analysis in 5.2.3.

5.2.1

Experimental Setting

In the ACE event detection task, when given a set of sentences, an IE system needs to identify mentions of certain events in each sentence. The task utilizes the ACE 2005 annotated event corpus as its training set5 . This corpus contains 15,724 sentences, each annotated with one or more events out of 33 possible event types such as “Attack”, “Divorce” and “Start-Position”. For instance, the sentence “They have been married for three years.” is annotated as mentioning the event “Marry”. In order to utilize the ACE dataset for evaluating verb entailment rules, we generally follow the evaluation methodology described in Szpektor and Dagan (2008), where the authors worked with 26 event types6 and represented each ACE event type with a few typical verbs, termed as seeds, that were selected from the textual description of the event in the ACE guidelines7 . For instance, for the event “Start-Position” the seed verbs are “start”, “found” and “launch”. We start our evaluation by learning for each seed, lexical entailment rules of the form ‘candidate → seed’. We then use these rules to retrieve all sentences containing either seed or candidate (this is referred to as a match), and label the retrieved sentences with the event mapped to the seed. For example, if our model learns the entailment rule ‘maim → injure’ for the seed ’injure’ which is in turn mapped to the ’Injure’ event, then all sentences containing the candidate verb ’maim’ will also be labelled as ’Injure’ event mentions. We then compute various evaluation metrics on these matches and compare our resource’s performance to other state-of-the-art baselines. We next elaborate on how we constructed a verb entailment resource for the ACE task. 5

http://www.itl.nist.gov/iad/mig//tests/ace/ace05/resource/ After filtering out events that ambiguously correspond to multiple distinct verbs 7 http://projects.ldc.upenn.edu/ace/docs/English-Events-Guidelines_ v5.4.3.pdf 6

48

5.2.2

Building a Verb Entailment Resource for ACE

The resource is constructed based on the UkWac corpus as follows: (a) Construct a list of verb pairs of the form (candidate, seed) for each seed (b) Represent each (candidate, seed) pair as a vector of features, as described in Chapter 4.1 (c) Train a classifier on an augmented labelled data-set (d) Retrieve a ranked list of rules of the form ‘candidate → seed’ according to the verb pair’s classification score (a) Constructing a list of candidate verb pair The candidates for each seed s are verbs that co-occur with s in at least 5 sentences in the UkWac corpus8 . This corpus was constructed by extracting text from crawled HTML pages. This meant that we encountered many non-words in the corpus, such as banner advertisements and HTML code. To filter this noise, the candidate verbs were required to also appear in WordNet as non-auxiliary verbs. Last, since the number of candidates was very large (~70,000), we filtered out verb pairs with negative ’PMI sentences’ scores so as to remain with a smaller number of candidates that are semanticallyrelated and have meaningful sentence-level co-occurrences. In total, we now have 40,285 semantically related verb pairs of the form (candidate, seed). (b) Representing verb pairs as feature vectors Each (candidate, seed) pair is converted into a feature vector that includes our manually-engineered features. As before, we collect the feature values from the UkWac corpus and perform normalizations as described in Section 4.1. (c) Training a classifier on an augmented labelled data-set Since this task deals with the classification of over 40,000 verb pairs, we wanted to expand our training set such that our model will be trained on a representative and sufficiently large amount of examples. We thus augmented the original 812 pairs list (5.1.1) 8

we filtered out the verbs ’be’,’do’,’will’,’take’,’give’,’get’

49

with positive and negative examples for entailment, retrieved from WordNet (Fellbaum, 1998). A pair of verbs is considered as a positive example for entailment if the verbs appear under the following WordNet relations: Entailment, Cause, Hyponymy and Synonymy. A pair of verbs is considered as a negative example for entailment if the verbs appear as Antonyms (e.g., buy and sell) and Cohyponyms, i.e., words that share a common Hypernym, e.g., “cruise” and “raft” are co-hyponyms since they share the common hypernym “transport”. In total, we had 3174 verb pairs which were manually annotated for entailment in the manner described in 5.1.1. We then used the annotated verb pairs as a training set for SVM-perf’s (Joachims, 2005) linear SVM implementation with default parameters, maximizing Precision. (d) Ranking the verb pairs The classifier gives a score representing the verb pair’s likelihood of belonging to the entailing class. To illustrate, suppose the verb pair (remand, arrest) was given a high score by our classifier. This means that the classifier is relatively certain that remand entails arrest. We utilize these classification scores in order to rank the list of entailing candidates for each event.

5.2.3

Evaluation Metrics and Results

After building a verb entailment resource for the ACE task, our goal now is to evaluate the resource’s performance in comparison to other state-of-the-art resources. We will use two types of resources as baselines: the first is rank-based, e.g., a classifier built with Distributional Similarity features, where the pairs are ranked according to the appropriate classification scores. The second type is a decisionrule-based resource, such as VerbOcean, where the pairs are given a binary score according to the following decision rule: ’1’ if they appear under certain VerbOcean patterns, ’0’ otherwise. To evaluate the rank-based type, we will create resources at different cut-off points of the ranked list of verb pairs. This method is not suitable for evaluating the decision-rule-based resources, since they are inherently sparse and have a binary scoring method which does not create a natural ranked list. Thus, we will evaluate these sparse baselines by aggregating scores for all 40,275 proposed verb pairs and computing the overall MAP measure. 50

Evaluation of Rank-based Baselines Following Szpektor and Dagan (2008), we test each learned ranked resource by taking the top K entailment rules for each seed verb, where K ranges from 0 to 100. This parameter controls the number of rules that can be used. For example, if K = 20 then for every seed s we will consider only the 20 rules with highest score that expand s. Naturally, when K is small the recall will decrease, but precision will increase. For each such resource we perform the following: we first search for all sentences containing either the seed verbs or the candidate expansion verbs of an event e. We mark each retrieved sentence as correct if e has been annotated as having an event mention in said sentence. We next compute the following standard evaluation metrics: P recision(e) =

Recall(e) =

F 1(e) =

#correct matches of e’s verbs #total matches of e’s verbs

#correct matches of e’s verbs #total annotations of e

2 · P recision(e) · Recall(e) P recision(e) + Recall(e)

Where e’s verbs are either the seed verbs associated with it, or the entailing candidate expansions. To aggregate the score, we computed micro average Recall, Precision and F1 over all event types. Evaluation Results: We compare our model to other rank-based resources, with a randomly created resource as a lower bound: (a) DistSim: Includes the three distributional similarity features (Lin, Weeds and Balanced Inclusion) used in our supervised model. This baseline extends Berant et al. (2012), who trained an entailment classifier over several distributional similarity features. This baseline provides an evaluation of the discriminative power of distributional similarity alone, without co-occurrence features. (b) DistSim+VO: Includes the three distributional similarity features and the two VerbOcean features in our supervised model. This baseline is inspired 51

by Mirkin et al. (2006), who combined distributional similarity features and Hearst patterns (Hearst, 1992) for learning entailment between nouns. (c) Random: A simple decision rule: for any pair (candidate, seed), randomly classify the pair as entailing with a probability equal to the ratio of entailing 1379 verb pairs out of all verb pairs in the labeled training set (i.e., 3174 = 0.434). (d) All: Our full-blown model, including all features described in Chapter 4.1. Figure 5.1 presents micro-averaged Precision, Recall and F1 of the evaluated methods for different cutoff points. We first notice that all resources provide rules that are less precise than using the seeds without expansions (K = 0). A possible reason for this phenomena is that many of the surface verb seeds are ambiguous, having multiple meanings. For instance, the seed verb ’release’ can be used in many contexts, yet in ACE it is defined to be used only in the sense of releasing a criminal from detention. A possible solution to the ambiguity problem is to only apply entailment rules in specific contexts (as exemplified in (Szpektor et al., 2008)), but this extension falls out of the scope of our work. Thus, since we learn entailment rules without a specific context and meaning, a rule which is correct for a certain meaning of the seed but incorrect for the specific event-mapped meaning, has the potential of incurring many false matches, resulting in a decreased precision rate. For example, the rule ‘unleash → release’ is a valid rule in most contexts of the seed verb ’release’, but not in the specific context of the event type ’Release-Parole’ in ACE. For a detailed analysis of this phenomenon, we refer the reader to our Error Analysis section (5.2.4). The graphs in Figure 5.1 show that using just the Distributional Similarity measures of Lin, Weeds and Balanced Inclusion increase recall, but dramatically decrease precision already at the cutoff point of K = 20. As a result, F1 only decreases for this method. Adding VerbOcean features slightly increases the performance of the DistSim model, and shows the complementary nature of the Distributional similarity and Pattern-based approaches, as discussed in previous works (Mirkin et al., 2006; Hagiwara et al., 2009). But it is our verb entailment model that achieves the best performance, with a statistically significant improvement in terms of both Precision and F1 at the 0.01 level, compared to all baselines except 52

Figure 5.1: Evaluated rank-based methods and their Micro-average F1 as a function of top K rules DistSim+VO where we achieve a statistical significance at the 0.05 level. Our model yields a more accurate resource, as evident from the much slower precision descending rate compared to the other baselines. Our model’s Recall increases moderately compared to the other baselines but it is still substantially better than not using any rules, with no statistically significant difference compared to the other baselines, apart from the DistSim baseline where the difference is at the level of 0.1. In order to further investigate the contribution of features from various verb co-occurrence levels, we compared the results of three rank-based resources, each comprising of features from a distinct level of co-occurrence. Figure 5.2 presents the results of taking the top K rules for each seed verb, where K ranges from 0 to 100. Looking at the graphs, we notice that as the level of co-occurrence gets broader, from sentence level through document and ending in corpus level, the resulting resource is more broad-scale but less precise, showing increase in recall and decrease in precision. This is in accordance with our assertions about the nature of 53

Figure 5.2: Co-occurrence Level classifiers and their Micro-average F1 as a function of top K rules verb co-occurrence at different levels (see Chapter 3.1). The sentence level resource is the best performing resource in terms of F1, and it also contains the vast majority (66%) of our novel features. This affirms the importance of our linguistic features in capturing valuable positive and negative information of entailment that exists when two verbs co-occur within the same sentence. Evaluation of Decision Rule-based Baselines We next compare our model against the more sparse resources, which were either constructed by using a strict decision rule, e.g., the verb pair appear with certain VerbOcean patterns or by using an existing lexical resource such as WordNet. As mentioned, due to the inherent sparsity of these decision-rule-based resources and their binary scoring method, we cannot use the cutoff method (see Table 5.8 for a comparison of the average sparseness of these methods). Instead, we will take into account all 40,275 proposed verb pairs and compute the Mean Average Precision (MAP) measure to compare the entailment rule bases’ quality. MAP is a common evaluation metric when comparing several ranked lists, which computes the average of the Average 54

Resource

Avg.# of Positive Rules

Avg.#P ositive Avg.#CandidateRules

All Model

80

0.0893

VO-KB

3

0.0033

VO-Ukwac

9

0.0089

WordNet

23

0.0256

Table 5.8: Average number of positive verb pairs per seed with their frequency compared to the average total of 875 candidate rules per seed Precision (AP) scores of each list. Here, each event has a list of rules ranked by the resource and we compute the event’s AP by averaging the precision values when truncating the ranked list after each correct match, i.e., after we found a sentence containing a verb associated with event e and that sentence was indeed annotated as mentioning event e. Evaluation Results: We compare our model against the following sparse resources: (a) VO-KB: a pair (candidate, seed) will be scored with 1 if it appears under the “stronger-than” relation in the original VerbOcean knowledge-base as constructed by Chklovski and Pantel (2004), and 0 otherwise. (b) VO-Ukwac: a pair (candidate, seed) will be scored with 1 if it appears under certain VerbOcean patterns in the Ukwac corpus, as described in 4.1.2, and 0 otherwise. (c) WordNet: a pair (candidate, seed) will be scored with 1 if it appears under certain WordNet relations, as described in 5.1.1, and 0 otherwise. Table 5.9 presents the results of comparing the ranks of all 40,275 verb pairs by computing the MAP measure. Looking at the table, our model outperforms the VerbOcean baselines with a statistically significant improvement at the 0.01 level, and at the level of 0.05 for the WordNet baseline. It is interesting to note the improvement of the VO-KB resource over the VO-Ukwac respource, which 55

demonstrates that using a large amount of web queries yields more robust results than using a limit-sized corpus. Resource

MAP

All Model

0.139

VO-KB

0.117

VO-Ukwac

0.103

WordNet

0.124

Table 5.9: MAP Results of our model and the decision-rule based resources To conclude, our ACE experiments demonstrate that our model creates an accurate and broader-coverage resource, compared to state-of-the-art models and knowledge bases. We also ascertain the importance of using features at different levels of co-occurrence and show marked improvement as a result of using them. However, our system, like all other evaluated systems, suffers from incorrect event matches. In the next section we examine these incorrect event matches and highlight the reasons behind them.

5.2.4

Error Analysis

In order to perform error analysis, we randomly sampled from the corpus 100 false-positive examples, i.e., sentences that according to our model contained an event mention of an event e, but the ACE annotators did not annotate the sentence as containing a mention to event e. The investigation of the sampled examples assist us in identifying reasons for decision errors, whose distribution is presented in Table 5.10. Naïve matching mechanism Our evaluation framework involves a simple matching algorithm: first each sentence is tokenized, then each token is lemmatized according to a few manually constructed rules. Once we have these lemmatized tokens, we look for an appearance of the seed or candidate verb. This requires little pre-processing at a cost of many wrong matches, since we lack part-of-speech annotations: 44% of the false positive matches were due to the fact that the verb 56

Type of Error

Mentions

Naïve matching mechanism

46

Incorrect rule learned

32

Context-related expanding verb

13

Ambiguous seed

6

Valid entailing verb, not substitutable

3

Table 5.10: Error types and their frequency from a random sample of 100 false matches lemmas were matched to either nouns or adjectives. In addition, some errors were caused by the lemmatizer, such as the “bite - bit” error presented in Table 5.11, where the noun ’bit’ was mistaken to be a past tense occurrence of the verb ’bite’. Event

Verb

Sentence Fragment

Convict

convict

those favoring the rule say it gives convicts incentive to behave..

Injure

bite

a little bit right after the war started..

Start-Org

institute

from Russian institutes they visited and built..

End-Org

end

had returned to their jobs since the end of the war..

Table 5.11: Example errors due to the naive matching mechanism

Context-related expanding verb The errors here are due to verbs that are invalid as a direct substitute to the seed, but they do share a semantic relatedness with the seed and the contexts in which it appears in, e.g., the expanding verb of “accuse” can appear in contexts of a ’Charge-Indict’ event, because usually when a person is charged for a crime, he is accused of committing the crime, but this does not strictly comply with the Lexical Entailment definition. Other examples include ‘arrest → convict’, ‘kill → attack’ and ‘sue → jail’. Seed ambiguity Some ACE seeds are ambiguous to begin with. Thus, verbs that would lead to correct event detection for at least one of the seed’s senses, sometimes lead to wrong matches. For instance, the seed ’appeal’ can either mean ’take a court case to a higher court for review’, or it can mean ’request earnestly’. 57

If we look at the second sense, then ’plead’ is a correct expansion of appeal, but the ACE task deals only with the first sense. Table 5.12 presents examples of this error type. Event

Entailment rule

Sentence Fragment

Release-Parole

unleash → release

President Bush, who unleashed hell acting on faulty intelligence..

Appeal

plead → appeal

the Canadian province of British Columbia pleaded no contest to driving drunk..

Execute

carry out → execute

a huge step forward in carrying out the US-backed Middle East policy..

Table 5.12: Example errors due to seed ambiguity

Invalid expanding words the expanding words do not have an evident connection to the seed’s contexts, such as ‘hit → arrest’, ‘follow → arrest’. Valid entailing verbs, not substitutable These errors are due to the sometimes non-overlapping definition of Lexical Entailment and the ACE event detection task. A system is required to identify mentions of an event that are evident in the sentence, with the help of matching certain content words. While in Entailment, we are interested in the events that can be inferred from certain content words. For example, the learned rule ‘divorce → marry’ follows the presupposition sub-type of entailment, since if a person has divorced his spouse, he must have been first married to her. But for the ACE sentence “The Welches disclosed their plans to divorce a year ago” the event of ’Marry’ was not mentioned in the sentence, although one can infer that it did occur due to the appearance of the verb “divorce”. To conclude, out error analysis shows that: (a) Almost half of the errors were caused by the naïve matching procedure. (b) Correct rules with entailing verbs, applied in inappropriate contexts and contextrelated rules, yield about fourth of the errors. 58

(c) Only 32% of errors originate from using invalid expansion rules. Thus, the potential of our learned entailment rule resource is much higher than shown by the evaluation results. For instance, if we were to use a more sophisticated matching algorithm or add a disambiguation module to handle context and seed ambiguity, we could avoid many of these errors and increase performance.

59

Chapter 6 Conclusions and Future Work 6.1

Conclusions

The main contribution of this thesis is the design of novel linguistically motivated cues that capture both positive and negative entailment information which is embedded when two verbs co-occur at the sentence, document and corpus levels. Our model incorporates these novel cues as features together with useful features adapted from prior work to combine co-occurrence and distributional similarity information about verb pairs and create an integrative framework for the detection of verb entailment. Experiment over a manually labeled dataset showed that our model outperforms, with a statistically significant improvement, state-of-the-art algorithms. Further feature analysis indicated that features at all levels: sentence, document and corpus, contain valuable information for verb entailment learning and detection, and thus should be combined together. Furthermore, many of our novel features were amongst the highly correlated features, showing that our intuitions when devising a rich set of verb-specific and linguistically-motivated features were correct. In a second experiment, we performed application-oriented evaluations of our novel verb entailment model in an applied NLP setting and demonstrated that our model creates an accurate and broad-coverage resource, compared to state-of-the-art models and knowledge bases. We also ascertained the importance of using features at different levels of co-occurrence and show marked im-

60

provement as a result of using them. To conclude, this work demonstrates that by exploiting the linguistic properties of verb co-occurrence, one can achieve more precise and scalable methods for verb entailment rule learning and detection. We believe that this opens the door for further integration of linguistic insights into the applied field of NLP.

6.2

Future Work

Our study highlighted the potential of integrating linguistically-motivated cues at different levels of co-occurrence. Yet, there are several extensions we have in mind for future work in this direction. Fine-grained Classification As explained in Chapter 2.4.1, entailment is a semantic relation comprising of several relation sub-types such as troponymy, temporal inclusion and backward presupposition. We hypothesize that these subtypes manifest themselves differently in corpora and that by training a different classifier for each sub-type, we will add discriminative power which will, in turn, improve the entailment model. We believe that since our features were designed with this entailment hierarchy in mind, they correspond to different sub-types of entailment and a natural research direction would be to train a multi-class classifier that classifies every pair of verbs (v1 , v2 ) to one of the sub-types of entailment. Then, we can deduce whether ‘v1 → v2 ’ based on these scores. Similarly, we can change our annotations guidelines to be specific for every sub-type of entailment and then aggregate the fine-grained annotations to see if we get a more cohesive annotated data set. A first attempt at this approach can be seen in Appendix B. Utilizing Typed Distributional Similarity As discussed in 3.3, we can change the traditional representation of verbs in distributional similarity approaches and represent them as vectors in different feature spaces, with each vector space corresponding to a different argument or modifier, and improve the discriminative power of the feature space. We can, in a similar way, represent nouns as vectors in different feature spaces, for instance to compare a noun’s adjectives. An interesting research question is whether this new typed representation can improve the 61

performance of a model which learns entailment rules between nouns. Extending Our Model to Predicates Predicates are grammatical constructions which are an extension of verbs. Simply put, a predicate is the main verb and any auxiliaries that accompany the main verb in a given sentence. In many predicate structures, the main verb is a light verb such as ’take’, e.g., the predicate ’take a picture’ in “Shiri took a picture of Yoav”, or an auxiliary verbs such as ’is’, e.g., the predicate ’is born in’ in “Sheila was born in Hampstead”. It would be interesting to see if our model boosts performance on the task of predicative entailment, which is a more general notion of verb entailment and has the potential to directly benefit Open Information Extraction systems (see Fader et al. (2011)). Adding a Paragraph Co-occurrence level As exemplified in our ablation tests in 5.1.3 and 5.2.3, the use of different levels of verb co-occurrence is imperative for detecting the entailment relation between verb pairs. A natural extension to this approach is to utilize yet another co-occurrence level: the paragraph level. This co-occurrence level is interesting since it presents a broader scope than the sentence level (thus increasing recall) and on the other hand since a paragraph can be thought of as a single discourse unit, it presents a tighter and more semantically-related scope than the document level (as shown in Pekar (2008)). Utilizing RTE as an additional evaluation setting An RTE system has to judge whether a given text segment entails a hypothesis sentence. One can utilize this framework in order to test our generated rules by providing them as an additional knowledge resource to an entailment system, for instance the system developed at Bar-Ilan NLP lab (Stern and Dagan, 2011). The evaluation will compare our classifier’s entailment rule-set with other state-of-the-art rule-sets, to measure the added value of our classifier in this framework. This task is more broad-scale than ACE, but since it involves many components and knowledge resources, it is relatively harder to empirically ascertain whether the rule-set contributes to the performance of the RTE system. Still, it would be beneficial to see the effect of using our model in such a broad-scale and important task.

62

Bibliography 1. Shuya Abe, Kentaro Inui, and Yuji Matsumoto. Acquiring event relation knowledge by learning co-occurrence patterns and fertilizing co-occurrence samples with verbal nouns. In Proceedings of The 3rd International Joint Conference on Natural Language Processing, 2008. 2. Nicholas Asher and Alex Lascarides. Logics of conversation. Cambridge University Press, 2003. 3. Timothy Baldwin and Aline Villavicencio. Extracting the unextractable: a case study on verb-particles. In proceedings of the International Conference on Computational Linguistics, 2002. 4. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The wacky wide web: A collection of very large linguistically processed webcrawled corpora. Language Resources and Evaluation, 43(3):209–226, 2009. 5. Jonathan Berant, Ido Dagan, and Jacob Goldberger. Learning entailment relations by global graph structure optimization. Computational Linguistics, 38 (1):73–111, 2012. 6. Hardarik Bluhdorn. Subordination and coordination in syntax, semantics and discourse: Evidence from the study of connectives. John Benjamins Publishing Company, 2008. 7. Joan L. Bybee and Suzanne Fleischman. Modality in grammar and discourse. John Benjamins Publishing Company, 1995.

63

8. Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative event chains. In Proceedings of the Annual Meeting-Association for Computational Linguisitics (ACL), 2008. 9. Timothy Chklovski and Patrick Pantel. Verb ocean: Mining the web for finegrained semantic verb relations. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), 2004. 10. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. 11. Ido Dagan and Oren Glickman. Probabilistic textual entailment: Generic applied modeling of language variability. In Proceedings of the Workshop on Learning Methods for Text Understanding and Mining (PASCAL), 2004. 12. Lin Dekang and Patrick Pantel. Dirt - discovery of inference rules from text. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2001. 13. David R Dowty. Word meaning and Montague grammar. Springer, 1979. 14. Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. 15. Christiane Fellbaum. English verbs as a semantic net. International Journal of Lexicography, 3(4):278–301, 1990. 16. Christiane Fellbaum. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998. 17. Maayan Geffet and Ido Dagan. The distributional inclusion hypotheses and lexical entailment. In Proceedings of the Annual Meeting-Association for Computational Linguisitics (ACL), 2005. 18. Barbara J. Grosz, Scott Weinstein, and Aravind K. Joshi. Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225, 1995. 64

19. Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. 20. Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. Supervised synonym acquisition using distributional features and syntactic patterns. Information and Media Technologies, 4(2):558–582, 2009. 21. Z. Harris. Distributional structure. Word, 10(23):146–162, 1954. 22. Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics, 1992. 23. Jerry Hobbs. Coherence and coreference. Cognitive Science, 3:67–90, 1979. 24. Jerry R. Hobbs. Literature and Cognition. CSLI Publications, 1990. 25. Eduard Hovy and Elisabeth Maier. Organizing Discourse Structure Relations using Metafunctions. Pinter Publishing, 1993. 26. Ray Jackendoff. Semantics and Cognition. The MIT Press, 1983. 27. T. Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, 2005. 28. Karin Kipper, Hoa Trang Dang, and Martha Palmer. Class-based construction of a verb lexicon. In Proceedings of the National Conference on Artificial Intelligence, 2000. 29. Alistair Knott and Robert Dale. Using linguistic phenomena to motivate a set of coherence relations. Discourse processes, 18(1):35–62, 1994. 30. Alistair Knott and Robert Dale. Choosing a set of coherence relations for text generation: a data-driven approach. Trends in Natural Language Generation An Artificial Intelligence Perspective, 1:47–67, 1996. 31. Alistair Knott and Ted Sanders. The classification of coherence relations and their linguistic markers. Journal of Pragmatics, 30(2):135–175, 1998. 65

32. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389, 2010. 33. Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951. 34. Mirella Lapata and Frank Keller. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1): 3–34, 2005. 35. Mirella Lapata and Alex Lascarides. Inferring sentence-internal temporal relations. In Proceedings of the North American Chapter of ACL: Human Language Technologies, 2004. 36. Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University Of Chicago Press, 1993. 37. Dekang Lin. An information-theoretic definition of similarity. In Proceedings of the 15th international conference on Machine Learning, 1998. 38. Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming Zhou. Identifying synonyms among distributionally similar words. In Proceedings of the International Joint Conference on Artificial Intelligence, 2003. 39. William Mann and Sandra Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3):243–281, 1988. 40. Daniel Marcu and Abdessamad Echihabi. An unsupervised approach to recognizing discourse relations. In Proceedings of the Annual MeetingAssociation for Computational Linguisitics (ACL), 2002. 41. Shachar Mirkin, Ido Dagan, and Maayan Geffet. Integrating pattern-based and distributional similarity methods for lexical entailment acquisition. In Proceedings of the International Conference on Computational Linguistics, 2006.

66

42. Shachar Mirkin, Ido Dagan, and Eyal Shnarch. Evaluating the inferential utility of lexical-semantic resources. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009. 43. Joakim Nivre, Johan Hall, and Jens Nilsson. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of the International Conference on Language Resources and Evaluation, 2006. 44. Viktor Pekar. Discovery of event entailment knowledge from text corpora. Computer Speech and Language, 22(1):1–16, 2008. 45. Marco Pennacchiotti and Patrick Pantel. Entity extraction via ensemble semantics. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), 2009. 46. Alain Rakotomamonjy. Variable selection using svm-based criteria. Journal of Machine Learning Research, 3:1357–1370, 2003. 47. Dietmar Rosner and Manfred Stede. Customizing rst for the automatic production of technical manuals. Aspects of automated natural language generation, 1:199–214, 1992. 48. Michael Rundell and Gwyneth Fox. Macmillan Essential Dictionary for Learners of English. Macmillan Education, 2003. 49. R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of the 17th Conference on Advances in Neural Information Processing Systems, 2005. 50. Asher Stern and Ido Dagan. A confidence model for syntactically-motivated entailment proofs. In Proceedings of the Conference on Recent Advances in Natural Language Processing, 2011. 51. Idan Szpektor and Ido Dagan. Learning entailment rules for unary templates. In Proceedings of the International Conference on Computational Linguistics, 2008. 67

52. Idan Szpektor, Hristo Tanev, and Ido Dagan. Scaling web-based acquisition of entailment relations. In Proceedings of the conference on Empirical methods in natural language processing, 2004. 53. Idan Szpektor, Eyal Shnarch, and Ido Dagan. Instance based evaluation of entailment rule acquisition. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007. 54. Idan Szpektor, Ido Dagan, Roy Bar-Haim, and Jacob Goldberger. Contextual preferences. In Proceedings of the Annual Meeting-Association for Computational Linguisitics (ACL), 2008. 55. Galina Tremper and Anette Frank. Extending semantic relation classification to presupposition relations between verbs. In Proceedings of the NP Syntax and Information Structure Workshop, 2011. 56. Julie Weeds and David Weir. A general framework for distributional similarity. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), 2003. 57. Julie Weeds, David Weir, and Diana McCarthy. Characterising measures of lexical distributional similarity. In Proceedings of the 20th international conference on Computational Linguistics, 2004. 58. Thomas A. Werner. Deducing the Future and Distinguishing the Past: Temporal Interpretation in Modal Sentences in English. Rutgers University, 2003. 59. Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945. 60. Fabio Massimo Zanzotto, Marco Pennacchiotti, and Maria Teresa Pazienza. Discovering asymmetric entailment relations between verbs using selectional preferences. In Proceedings of the 23rd International Conference on Computational Linguistics, 2006.

68

Appendix A VerbOcean Web Patterns Experiment As mentioned in Chapter 4.1.2, we performed a preliminary study of the VerbOcean patterns 9 in the context of Entailment. Our goal was to find out whether these lexical-syntactic patterns can be used to detect lexical entailment between verbs. To that end, we ran an experiment using the Web, which we will next describe.

A.1

Experimental Setting

1. We chose 50 commonly used verbs as seeds. 2. For each of the 50 verbs we took the top 20 most similar verbs according to the Lin similarity measure. This resulted in 1000 verb pairs with some semantic association between them. 3. We manually annotated the 1000 pairs in a four-way manner (v1 → v2 , v2 → v1 , v1 ↔ v2 , N on − entailment). This resulted in 1952 ordered verb pairs, annotated in a binary way (after removing duplicates). 4. For each (v1 , v2 ) pair and the 33 VerbOcean patterns, we ran Bing queries of the form: "v1 pattern v2 ". For each pattern p, we computed Recall, Precision and F1 measures to check if pattern p is a good indicator of verb entailment. (i.e. verbs that get a high count 69

when appearing with p are more likely to entail than to not-entail)1 . We then represented each verb pair (v1 ,v2 ) as a vector of features. The features are: 1. For each pattern a binary feature denoting whether the pair got zero or nonzero count with the pattern 2. For each pattern the actual count that the pair got with the pattern 3. For each pattern the log (count + 1) that the pair got with the pattern 4. For each pattern the normalized count This resulted in 264 features (132 for each verb pair assignment = 33 patterns ·4 )

A.2

Results

We compared the Recall-Precision scores of the patterns against that of the baseline (which is to output yes on all candidate verb pairs given by the Lin similarity measure) whose precision is 14.4, recall is 100 and F1 is 25.2. We checked to see where the curve began to look similar to the baseline (i.e., the precision was close to 14.4): if the recall at that point was low, we concluded that this pattern was akin to a random pattern and thus does not give a meaningful signal of entailment. For patterns that are supposed to be a negative signal, the Recall-Precision curve wasn’t suitable for comparison. Instead we plotted the ROC curve and computed the AUC (Area under the ROC curve). In this case, we looked for the patterns with high AUC. The patterns that look promising are: 1. Positive signals for entailment: Xed even Yed (Y → X), Xed and even Yed (Y → X) , Xed or at least Yed (X → Y) , Xed by Ying the (X → Y). 2. Negative signals for entailment: either Xing or Ying (Y → X) (Area under the ROC curve 0.55), Xed but Yed (Area under the ROC curve 0.57 for both directions), to X but Y (Y → X) (Area under the ROC curve 0.569)

1

For the Antonymy relation patterns, high scores should mean non-entailment

70

Appendix B Fine-grained Verb Entailment Annotation Guidelines In this task, you are given an ordered pair of verbs (v1,v2) and you need to decide whether v1 entails v2. We adopt the textual entailment framework, where a text fragment T is said to entail another text fragment H, if a person reading T will infer that H is most likely true. In our case this means that v1 entails v2 if you can think of reasonable contexts under which, for the given pair, one of the following statements is true: 1. v1 is a verb expressing a specific manner or elaboration to v2. The two verbs must occur at the same time. The test frame to use in this case is: to v1 is a way of v2-ing (v2 is in gerund form). For example, to march is a way of walking, to bake is a way of cooking, and to donate is a way of giving. In all these examples v1 is a particular way of performing v2 and both verbs occur at the same time. 2. v1 describes an event or state that occurs while v2 occurs and rarely occurs without v2. For example, for the pair (snore, sleep) we note that snoring almost always occur while sleeping, that is, people will infer that if someone is snoring, then that someone is most likely sleeping. Other examples include (walk, step), (ambush, wait). However, in (beep, ring) one could say that a beep is a part of a ring, but beep is more general than just a sound

71

that occurs when ringing, and can often appear outside the event of ring and so the relation does not hold for this pair. 3. v2 happens before v1 and must have occurred in order for v1 to occur (i.e., v1 presupposes v2). For example, win entails play since the event of play happens before win and in order for the event of winning to occur, then one must have played. Please note that win is an ambiguous verb and under some contexts these rule will not hold. See next paragraph for a clarification on this matter. Other examples: if you employ someone then it presupposes that he was hired, and so employ entails hire. However, you can say that if you detect something then you must have noticed it first, but the two events are not disjoint and cannot be temporally separated (the event of noticing ended and then you detected the object) and so the relation does not hold for this pair. 4. v1 is a causative verb of change (a verb that causes change in state, position etc) and v2 is a non-causative verb that occurred because v1 occurred before it (i.e., the two events do not occur at the same time). For example, the event of buying something causes the buyer to own that something. Or giving X to person Y causes person Y to have X. 5. v1 is a synonym of v2 (and vice versa) , i.e., v1 and v2 express the same meaning, and can be considered as alternative ways of conveying the same information. In all cases, ambiguity might arise since we look at the verbs out of context and each verb can be used in different contexts and have several different meanings (senses). For example, the verb ’win’ can be used as winning a war or winning a game. The latter use is frequent enough to infer that winning presupposes playing (since in order to win a game you must have first played the game). The rule is that v1 entails v2 if there are reasonable non-anecdotic contexts in which one of the four previously described cases hold.

72

Appendix C Verb Syntax and Semantics C.1

Auxiliary and Light Verbs

An auxiliary verb is a verb which adds functional or grammatical meaning to the clause in which it appears, e.g., to express tense, aspect and voice. Auxiliary verbs usually accompany a main verb, which provides the main semantic content of the clause in which it appears. A light verb is a verb which carries little semantic content on its own and usually forms light-verb-constructions (LVC) with other verbs or nouns. In our work, when we look at the meaningful co-occurrence of verbs, we take into account only non-auxiliary and non-light verbs, and manually filter out, when needed, the following auxiliary verbs: ’be’, ’have’, ’do’, and the following light verbs: ’have’,’get’,’take’,’give’ and ’get’.

C.2

Verb Syntax

As mentioned in Chapter 1, verbs are the primary vehicle for describing events and expressing relations between entities. Raising verbs are a special type of verbs, atypical in that they do not express an event or a state and as such, and do not stand in a semantic relation with any arguments. Thus, the subject of the sentence is semantically related to the main verb of the subordinate clause, and not to the main verb of main (superordinate) clause, as is usually the case. This effect is illustrated in the following sentences (with the raising verbs in bold): 73

1. Sholmi seems to be happy 2. It appears that Noga left 3. Sagi happens to be a trombone player 4. Shmuel used to chew tobacco Similar to raising verbs, control verbs can take an infinitive clause complement, but unlike raising verbs, control verbs have semantic content; they semantically select their arguments, that is, their appearance strongly influences the nature of the arguments they take. 1. Sharon wants to fly to Japan next year. 2. Modi refused to join the strike. 3. Sara will try to finish her assignments by Friday. Control verbs appear in contexts that look just like the contexts for raising verbs, i.e., they are the main verbs in the main clause, with an infinitive clause as a subordinate. However, a control verb can be viewed as expressing an intent about an activity or event which is denoted by its complement, as opposed to raising verbs that cannot be viewed as linked in a meaningful way to their complements.

C.3

Levin Classes

As mentioned in Chapter 2.4.3, Levin’s underlying assumption is that verb meaning and argument realization, which are alterations in the expression of verbal arguments such as subject, object and prepositional complement, are jointly linked. To illustrate, let’s examine the following examples: The verbs “spray” and “load” can realize their arguments in two different ways (called locative alteration): 1. Gideon sprayed water on the plants. 2. Gideon sprayed the plants with water. 74

1. Dana loaded apples into the cart. 2. Dana loaded the cart with apples. But, semantically related verbs such as “pour” and “dump” can only realize their arguments in one way: 1. Gideon poured water on the plants. 2. (UNGRAMMATICAL) Gideon poured the plants with water. 1. Dana dumped apples into the cart. 2. (UNGRAMMATICAL) Dana dumped the cart with apples. Thus, we can group the first two verbs in one class and the last two verbs to another.

75