Establishing relevant features for identification of multiword expressions

Establishing relevant features for identification of multiword expressions Bego˜ na Villada Moir´ on Alfa-Informatica, RuG CLS Nijmegen – March 9, 200...
0 downloads 0 Views 165KB Size
Establishing relevant features for identification of multiword expressions Bego˜ na Villada Moir´ on Alfa-Informatica, RuG CLS Nijmegen – March 9, 2006

RROPQR

1

Overview • What are multiword expressions? • Rationale and motivation behind identification task ? Why do we care to identify these expressions automatically? • Previous attempts to solve the identification problem ? Hybrid approaches: linguistic insights + statistical methods • A new proposal ? Focus on three properties of mwe ? Empirical measurements of these properties. ? Identification task = learning concept mwe? ? Results and evaluation ? Discussion and future improvements

RROPQR

2

A first glance at multiword expressions Structure adj+noun prep+noun prep+noun+prep noun+verb adj+verb pp+verb xp+pp+verb

RROPQR

Examples een gesloten boek, een dooie boel, de eerste burger, mobiele banditen in zwang, op reis ter wille van, in ruil voor gebruik maken, stand houden, rekening houden met vast stellen, goed keuren, bloot stellen, los laten ter sprake komen, ten val komen, door de bocht gaan, in het geweer komen als [eerlijk] te boek staan iem. het hemd van zijn lijf vragen iem. in het zadel helpen vast in het zadel zitten alsof de duivel iem. op de hielen zitten iem. op de hoogte brengen/stellen van

3

Lawlessness in the mwes population “In investigating the properties of lexical items one is looking at the inmates of a prison (the lexicon) who have in common only the fact of their lawlessness [Di Sciullo and Williams, 1987].”

• How are multiword expressions different from regular and productive phrases and expressions? ? lexical level ? morphology ? syntax ? semantics

RROPQR

4

Semantics Mismatch between the literal sense of individual lexemes, the linguistic context and the discourse marks out the idiosyncratic expression and triggers the peculiar (idiomatic) interpretation.

(1)

Maar schaatsen zit ons in het bloed, zegt Karlstad. ‘But skating is in our blood, says Karlstad.’

Meaning ranges from fully transparent to opaque (compositional to non-compositional).

(2)

a. b. c.

iets bij de hand hebben ‘to have something at hand’ het paard achter de wagen spannen ‘to put/set the cart before the horse (to reverse the natural or proper order)’ een loer draaien ‘play a nasty trick on someone’

RROPQR

5

‘Denoting’ and ‘non-denoting’ words: (3)

a. b.

(4)

a. b.

de laatste hand aan iets leggen ‘to put the finishing touches on something’ een vergadering sluiten ‘to adjourn a meeting’ de plaat poetsen ‘to depart unnoticed’ in petto hebben/houden ‘to reserve’

RROPQR

6

Lexical level Lexical fixedness results from the lexical cooccurrence of two or more lexemes. Selectional restrictions not (necessarily) of a semantic nature, rather arbitrary. (5)

a.

de laatste hand aan iets leggen

b.

de laatste hand *op/aan iets *plaatsen/leggen

c. *de hand aan iets leggen d.

een vergadering sluiten/∗dichtmaken

Sets of semantically related fixed expressions exist. A light verb is often involved. (6)

a.

aan de gang zijn/blijven

b.

iets aan de gang houden/brengen

RROPQR

7

Morphology Some nouns have singular/plural/diminutive variants. Within some multiword expressions only one form is plausible. (7)

a. b. c.

Dat houdt de kosten binnen de perken/*perk ‘That limits the costs’ Keurmeesters houden een extra ?oog/oogje in het zeil ‘Inspectors keep a very careful look-out.’ . . . die ook Douglas tijdlang aan het *lijn/lijntje houdt. ‘. . . who also keeps Douglas dangling’.

Archaic forms in nouns or determiners in some mwes. (8)

a. b.

ten kost-e van, ten last-e van, van hart-e ‘at the expense of, at the expense of, honestly’ zich van de/den domme houden ‘play ignorant/innocent’

RROPQR

8

However, evidence of ’productive’ inflectional and derivational morphology is found. Note that verbs typically allow tense inflection. (9)

a.

. . . om genetisch gemanipuleerd voedsel buiten de EU-deur te houden

b.

. . . om deze ongewenste gast buiten de deuren van het ziekenhuis te houden

RROPQR

9

Syntax Morpho-syntactic structure: regular, syntactically marked and irregular. (10)

a. b. c.

een vergadering sluiten naar huis (gaan) bij lange na ‘far from (a)’, om en nabij ‘around’, al met al ‘all in all’

Internal variation: modification, quantification, determiner alternation. (11)

a. b.

In die kringen houden ze allemaal de hand boven elkaars hoofd en gaan weer over tot de orde van de dag. Directeur Bontrop steekt niet onder stoelen of banken dat hij op deze manier ook een wetenschappelijke oog in het zeil kan houden.

Other: Limited syntactic versatility (topicalization, passive, clefting, extraction); Non-homomorphism at the syntax-semantics interface.

RROPQR

10

A working definition A list of properties and idiosyncrasies of expressions that (most people would agree) qualify as lexical units (extended lexical units, mwes). However, not all idiosyncrasies are observed in a mwe and not all mwes show the same idiosyncrasies. “A combination of two or more lexemes that must at least satisfy the (a) condition and perhaps, but not necessarily, condition (b) and/or (c): (a) the lexemes mutually select each other; (b) the combination as a whole has a non-compositional or partially compositional meaning; (c) the syntactic/morphological behavior of the mwe and/or its parts is not to be expected given the syntactic/morphologic behavior of the individual lexemes or the combination as a whole (adapted from [Everaert, 1993]).”

RROPQR

11

Non-negligible phenomenon Multiword expressions (mwes) abound in many languages. “The number of mwes in a speaker’s lexicon is of the same order of magnitude as the number of single words [Jackendoff, 1997, p.156].” “This seems likely to be an underestimate, even if we only include lexicalized phrases [Sag et al., 2001].” (Taken from the 300M word Twente Nieuws Corpus.) expression iets in de gaten houden iemand voor de gek houden iemand op de hoogte stellen aan de slag gaan voet bij stuk houden om de tafel gaan zitten

RROPQR

translation keep an eye on something to fool someone to inform someone get started stick to one’s decision to begin negotiations

frequency 4,328 421 1,021 3,697 380 403

12

Towards a language model of multiword expressions Three good reasons to decide to treat mwes seriously:

• mwes are numerous in many languages, • some mwes are used very frequently and, • a system of regular productive rules (morphological,syntactic) cannot always account for their behavior. Aiming at building a (computational) language model that accounts for mwes expressions, we need (i) (ii) (iii)

RROPQR

an inventory of the expressions, a description of their linguistic behavior and, a theoretical framework.

lexicon grammar frame

13

Towards an inventory of Dutch mwes Within the stevin project Identification and Representation of Multiword Expressions (irme) a large lexical database of Dutch mwes is being developed. We investigate

• hybrid methods to automatically compile a dictionary of mwes from corpora and, • a standard for the representation of mwes in the dictionary. Project partners:

• Alfa-Informatica, Groningen: Gertjan van Noord, Gosse Bouma and Bego˜na Villada Moir´on. • UIL-OTS, Utrecht: Jan Odijk and Nicole Gr´egoire. • Van Dale Lexicografie

RROPQR

14

Automatic identification of mwes • We know that multiword expressions abound in large corpora. • From previous studies of the linguistic phenomenon we know (some of) the descriptive properties of idiomatic expressions, phrasal verbs and other types of mwes. Task:

• Design a corpus-based method to identify and extract word combinations that ought to be described as mwe in a lexicon. Objective:

• tool that provides a list of mwes, • classification into subgroups (phrasal verbs, support verb constructions, idioms, adjective noun combinations, etc.).

RROPQR

15

Previous approaches to mwe identification in written text • Two main approaches: Purely statistical approach raw corpora, lexical affinity, ngram models, stop lists, seed terms, anticollocations, manual evaluation

Hybrid approach corpora annotated with linguistic information, syntagmatic relationship between components, capture other properties, bigram statistics and more expressive probabilistic models, automatic evaluation

RROPQR

16

Hybrid approach • Earlier work concentrates on extracting collocations (een vergadering sluiten). descriptive property lexical affinity experimental measures co-occurrence frequency, log-likelihood, pointwise mutual information, dice coefficient, etc.

• More recent work aimed at collocations [Krenn and Evert, 2001, Wanner, 2004] but also at phrasal verbs (look up) [Baldwin and Villavicencio, 2002], complex prepositions [Bouma and Villada, 2002], determinerless pps [Baldwin et al., 2005], support verb constructions (make progress) [Villada Moir´on, 2004, Stevenson et al., 2004], prepositional verbs (take sth into account) [Baldwin, 2005], idiomatic verb phrases [Villada Moiron and Tiedemann, To appear, Fazly and Stevenson, 2006], non-compositional phrases [Baldwin et al., 2003], etc. descriptive properties lexical affinity, non-compositional semantics, modifiability, lexical variation inside arguments. experimental measures probabilistic measures tailored to capture linguistic properties, log-linear models, well-known association measures, latent semantic analysis, etc.

RROPQR

17

• In general, methods that incorporate more linguistic information into identification model seem to be more successful. Procedure

• corpora annotated with linguistic information (part-of-speech tags, phrase chunks, full syntactic analysis) • extract patterns that satisfy specific morpho-syntactic constraints ([verb noun phrase], [adjective noun], [verb prepositional complement]) • apply quantitative techniques to capture linguistic properties • if reference data is available, evaluation can be automatic • Limitations: annotation tools are needed, size problem and low frequency data.

RROPQR

18

Identification seen as classification task Procedure:

• • • •

Select three characteristics that (may) split apart mwes from regular verb pp phrases. Measure the extent to which a characteristic is observed in the candidate expressions. Use quantitative measurements as evidence in a classification task. If a candidate expression shows mwe characteristics, label it yes, else no

RROPQR

19

What characteristics? • Lexical affinity between the component words in a mwe. We expect that word combinations that form a mwe show stronger lexical affinity than productive word combinations. • Head dependence • Preference for specific syntactic contexts.

RROPQR

20

Head dependence Head dependence is a linguistic diagnostic used to distinguish real arguments from optional adjuncts of a verb [Merlo and Leybold, 2001, Merlo, 2003]. A pp phrase selected by a large number of verbs is likely to be an optional adjunct. A pp phrase selected by one or a few verbs is probably a required argument of its selecting verb. pp van wijs aan pols bij paaltje in hongerstaking met vut uit gezin in cassatie aan beurt in jaar aan kant op markt in brief van ministerie

RROPQR

# selecting verbs 4 3 2 2 4 6 2 3 823 249 173 145 43

verbs brengen,laten,maken houden,komen,geven komen,terugschrikken gaan,zijn gaan,kunnen,moeten,aflopen komen,doen,geven,trouwen gaan,maken komen,zijn,inspringen zijn,brengen,doen,gaan,nemen zijn,houden,staan brengen,komen schrijven,stellen krijgen,hebben,zijn,vragen,weten

21

Preference for specific syntactic contexts pp complements closer to the verb in v-final context than adverbial pps. Restriction applies when adverbials are realized by pps, not by other adverbs [Broekhuis, 2004]. Arguments in fixed expressions tend to inmediately precede the verb group in v-final contexts (Jack Hoeksema (p.c.)). Deze 440 is comfortabel geveerd; hij laat zich te gauw [van de wijs] brengen door hobbels en bobbels in de weg. Maar als puntje [bij paaltje] komt, is de Nederlander toch een koopman en zal hij Duits praten. Elias zou [in een brief] aan de hoofdredactie hebben geschreven dat ... Toen de werknemers moesten inleveren op de WAO stonden de werkgevers [aan de kant] te roepen dat het hard nodig was. Over Toni Morrison werd [in deze krant] al uitgebreid geschreven . Minister Ter Beek (defensie) zegt vandaag in een interview [in deze krant] blij te zijn dat...

RROPQR

22

Quantifying linguistic properties Lexical affinity measured as

• co-occurrence frequency: c(V,P,N), • log-likelihood score : log-lik(V,PN) and • salience: sal(V,PN)

RROPQR

word combination (in,jaar,zijn) (van,wijs,brengen)

frequency 189 124

log-lik 4.5073 806.5314

salience 4.1910 45.9399

(aan,pols,houden)

87

564.5474

41.2776

(aan,kant,staan)

85

153.4448

16.5825

(bij,paaltje,komen)

58

228.0177

28.0868

23

Head dependence measured as

• number of pp selecting verbs: hd d int and entropy of the distribution of pp selecting verbs (see 2)

hd d int(P, N )i = |verb selecting (P, N )i|

H(P, N )i = −

X X f ((P, N )i, Vj ) i

j

f (P, N )i

word combination (bij,paaltje,komen) (aan,pols,houden) (van,wijs,brengen) (aan,kant,staan) (in,jaar,zijn)

RROPQR

hd d int 2 3 4 249 823

f ((P, N )i, Vj ) log f (P, N )i entropy 0.086 0.123 0.172 4.037 4.903

(1)

(2)

24

Preference for specific syntactic context measured as

• relative frequency of seeing the pp inmediately preceding the verb group in an observed verb pp,

c((V, P, N ), pos =0 ipr 0) f ((V, P, N ), pos = ipr ) = c((V, P, N )) 0

0

word combination (van,wijs,brengen) (bij,paaltje,komen) (aan,pols,houden) (aan,kant,staan) (in,jaar,zijn)

RROPQR

f((P,N,V), pos=’ipr’) 1.00 1.00 0.92 0.73 0.16

(3)

25

The classification task Classification task: candidate expressions categorically split into mwes and productive combinations. Given a collection of candidate expressions that specify some attributes (characteristics), find a learning function that is able to infer to which class each candidate expression belongs. expression

a1

breng_van_wijs, 124, kom_bij_paaltje, 58, houd_aan_pols, 87, sta_aan_kant, 85, ben_in_jaar, 189,

a2

a3

a4

1.00, 4, 0.172, 1.00, 2, 0.086, 0.92, 3, 0.123, 0.73, 249, 4.037, 0.16, 823, 4.903,

a5 806.5314, 228.0177, 564.5474, 153.4448, 4.5073,

class ? ? ? ? ?

Decision tree learning: a supervised machine learning method.

RROPQR

26

Settings and data Candidate expressions:

• mwes including a pp argument. • Representation: verb pp (other args in vp ignored). E.g. een vinger aan de pols houden: houd aan pols Resources:

• Corpus: clef data ? text of 2 newspapers (2 years); fully parsed with Alpino • Candidate extraction: ? extract dependency triples from parsed sentence ? collect pps sharing a dependence relation with a verb • Learning algorithm ? Weka machine learning algorithms: decision trees algorithm (J4.8)

RROPQR

27

Learning attributes Attributes: co-occurrence frequency, frequency of pp in ’ipr’ position in verb final context, head dependence (integer and entropy) and log-likelihood/salience, Data representation houd_aan_afspraak, heb_in_handen, kom_tot_resultaat, sta_ter_discussie, ga_op_dag, sta_in_krant, ga_met_gulden, krijg_de_tijd, krijg_bij_er, heb_van_doen, kom_naar_land, laat_met_rust,

251,0.88, 19,0.825,1428.3465,? 207,1.00, 7,1.215, 576.6651,? 59,0.98, 5,0.942, 113.5017,? 85,0.98, 2,0.445, 276.8422,? 67,0.48,328,4.610, 9.0425,? 67,0.71,124,3.676, 159.0070,? 65,0.33, 51,2.853, 95.5512,? 368,0.97, 3,1.059,1237.7703,? 472,0.14, 82,2.633, 520.3565,? 390,1.00, 1,0.000,2993.8551,? 63,0.97, 76,3.034, 5.3596,? 72,1.00, 9,0.920, 788.4035,?

RROPQR

28

Dataset Candidates all f req > 50

Types 455985 1068

Tokens 912970

• Selected triples f req > 50: 1068 types • 25% Training data ? 200 candidates manually classified as mwe: yes/no ? Checked in newest Van Dale’s dictionary • 75% Testing data

RROPQR

29

Evaluation Classifier Baseline (salience) Decision trees (J4.8)

RROPQR

Training data 80.20% 87.81%

Testing data 75.89% 89.01%

30

Results (1) Features used Baseline salience (a6) Decision trees classifier a2+a3+a4+a5b+a6 Effect of lexical affinity none a3+a4+a5 no c(V,P,N) a3+a4+a5+a6 no sal(V,PN) a2+a3+a4+a5 Effect of head dependence none a2+a3+a6 no hd d int(P,N) a2+a3+a5+a6 no H(P,N) a2+a3+a4+a6 Effect of syntactic context preference no f((V,P,N), pos=’ipr’) a2+a4+a5+a6

Precision 75.89 89.01 84.40 88.18 86.05 89.01 89.01 89.01 74.23

a2=co-occurrence frequency, a3=rel. freq. of pp in ’ipr’ position in v-final context, a4=head dependence (integer), a5=head dependence (entropy), a6=salience

RROPQR

31

Results (2) As the decision tree algorithm suggests,

• the head dependence features do not improve the task of classification In our experiments, the concept of mwe can best be approximated with

• relative frequency of seeing a pp in ’ipr’ position in a verb final context and the lexical affinity measured with the salience score. Where does the classifier go wrong? Correctly classified Incorrectly classified

RROPQR

89.01 10.99

32

False negatives

kom_tot_akkoord kom_in_bezit ben_in_bezit stel_onder_curatele

y y y y

n n n n

unknown unknown unknown unknown

201 118 69 91

0.91 0.80 0.90 0.92

4 21 21 5

0.191 1.925 1.925 0.500

758.896 83.444 119.343 650.311

• A modifer of the pp occurs before the verb. Technique to extract position information needs to be improved. onder aandacht brengen • Candidates that select lexicalized np. np inmediately precedes verb. van nood maken (Dan is het goed van de nood een deugd te maken)

RROPQR

33

kom_te_wereld y n heb_in_dienst y n ben_in_dienst y n ga_in_werk y n houd_in_handen y n neem_in_hand y n kom_aan_slag y n kom_in_verleiding y n krijg_in_bezit y n krijg_in_gaten y n neem_in_aanmerking y n

unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown

60 0.94 61 3.000 66.047 78 0.95 47 2.039 58.102 98 0.96 47 2.039 40.818 156 0.97 103 2.450 94.847 96 0.97 7 1.215 68.071 111 0.99 101 2.776 89.122 191 1.00 22 1.316 27.135 50 1.00 10 1.093 102.598 53 1.00 21 1.925 52.413 133 1.00 4 0.603 71.675 84 1.00 28 0.256 3.589

• pps may occur with many different verbs (support or not), • verb pp combination has one literal use and one (or more) idiomatic uses, • in these cases salience score lower than cutoff

RROPQR

34

False positives • half of the false positives are correctly classified but were wrongly labelled in our reference data • directional PPs (naar huis), figurative (tot NP (resultaat,succes)) • locative PPs: in (regering,rolstoel,positie), bij kas • doubtful cases: naar beurs breng/ga, tot daad kom, onder druk kom

RROPQR

35

Discussion and conclusions • An advantage of using the supervised machine learning method is to be able to systematically test the effect of empirical features that attempt to capture linguistic properties of mwes in the classifier accuracy. • No limit in the number of features. No restrictions on the information encoded as features. • A disadvantage is that one needs to annotate some training data. It’s worth the trouble: With only 25% of training data, a reasonable accuracy (89.01%) was obtained. • Further tests are needed to see if results are valid (across other type of expressions) or a side-effect of experimental procedure. • Routine provides relevant features to be used as input for Maximum Entropy model in a bigger dataset.

RROPQR

36

Future improvements

• Improve approximation of characteristics ? position of pp in verb final context ? effect on head dependence features • Include other attributes: ? ? ? ?

pp dependency relation label proposed by parser NP/XP argument idiomaticity [Villada Moiron and Tiedemann, To appear] variation within pp (adjective, determiner, other mods.)

• Combine efforts of various classifiers. Expressions labelled as mwes by the first classifier are input to another classifier that issues a subtype label (support verb construction, phrasal verb, etc.)

RROPQR

37

References T. Baldwin, C. Bannard, T. Tanaka, and D. Widdows. An Empirical Model of Multiword Expressions Decomposability. In Proc. of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 89–96, Sapporo, Japan, 2003. T. Baldwin, J. Beavers, L. van der Beek, F. Bond, D. Flickinger, and I.A. Sag. In search of a systematic treatment of Determinerless PPs. Computational Linguistics Dimensions of Syntax and Semantics of Prepositions. Kluwer Academic, 2005. Timothy Baldwin. Looking for prepositional verbs in corpus data. In Proc. of the 2nd ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their use in computational linguistics formalisms and applications, Colchester, UK, 2005. Timothy Baldwin and Aline Villavicencio. Extracting the unextractable: A case study on verb-particles. In Proceedings of CoNLL-2002, pages 98–104. Taipei, Taiwan, 2002. Gosse Bouma and Bego˜ na Villada. Corpus-based acquisition of collocational prepositional phrases. In M. Theune, A Nijholt, and H. Hondorp, editors, Computational Linguistics in the Netherlands 2001. Selected Papers from the Twelfth CLIN Meeting, pages 23—37, Amsterdam, 2002. Rodopi. Hans Broekhuis. Het voorzetselvoorwerp. Nederlandse Taalkunde, 9(2):97—131, 2004. A. M. Di Sciullo and E. Williams. On the definition of Word. MIT Press, Cambridge Ma., 1987. Martin Everaert. Vaste verbindingen (in woordenboeken). Spektator, 3:3–27, 1993. A. Fazly and S. Stevenson. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), Trento, Italy, 2006.

RROPQR

38

Ray Jackendoff. The architecture of the language faculty. MIT Press, Cambridge, MA, 1997. Brigitte Krenn and Stefan Evert. Can we do better than frequency? a case study on extracting PP-verb collocations. In Proceedings of the ACL workshop on Collocations, Toulouse, 2001. Paola Merlo. Generalised pp–attachment disambiguation using corpus–based linguistic diagnostics. In Procs of the EACL’03, Budapest. Hungary, 2003. Paola Merlo and Matthias Leybold. Automatic distinction of arguments and modifiers: the case of prepositional phrases. In Procs of the Fifth Computational Natural Language Learning Workshop (CoNLL–2001), pages 121–128, Toulouse. France, 2001. Ivan Sag, T. Baldwin, F. Bond, A. Copestake, and D. Flickinger. Multiword expressions: a pain in the neck for NLP. LinGO Working Paper No. 2001-03, 2001. Suzanne Stevenson, Afsaneh Fazly, and Ryan North. Statistical measures of the semi-productivity of light verb constructions. In Takaaki Tanaka, Aline Villavicencio, Francis Bond, and Anna Korhonen, editors, Second ACL Workshop on Multiword Expressions: Integrating Processing, pages 1–8, Barcelona, Spain, July 2004. Association for Computational Linguistics. Bego˜ na Villada Moir´ on. Discarding noise in an automatically acquired lexicon of support verb constructions. In Proceedings of 4th International Conference on Language Resources and Evaluation 2004, volume V, pages 1859–1862, Portugal, 2004. Bego˜ na Villada Moiron and J¨org Tiedemann. Identifying idiomatic expressions using automatic wordalignment. In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context”, Trento, Italy, To appear. Leo Wanner. Towards automatic fine-grained semantic classification of verb-noun collocations. Natural Language Engineering, 10(2):95–143, 2004.

RROPQR

Suggest Documents