FINITE-STATE COMPOSITION OF. Rank Xerox Research Centre. F Meylan, France. 7 November 1994

FINITE-STATE COMPOSITION OF FRENCH VERB MORPHOLOGY Jean-Pierre Chanod Rank Xerox Research Centre 6 Chemin de Maupertuis F-38240 Meylan, France Jean.P...
Author: Noel Walton
4 downloads 0 Views 140KB Size
FINITE-STATE COMPOSITION OF FRENCH VERB MORPHOLOGY Jean-Pierre Chanod Rank Xerox Research Centre 6 Chemin de Maupertuis F-38240 Meylan, France

[email protected]

7 November 1994

Summary

This paper gives a novel account of French verb morphology as a composition of nite-state constraints. It consists of a cascade of transductions, each one dedicated to a speci c and linguistically motivated task. Each transducer expresses a number of linguistic generalizations that can be shared between various in ectional paradigms, traditionally classi ed separately. The sequence of ordered transductions is nally composed into a single lexical transducer which can be used both for analysis and generation. This results in a system which is transparent to the lexicographer, as transductions have a clear interpretation at every stage of the cascade, whereas systems limited to one stage of transductions tend to be opaque and dicult to debug. We show how this approach was applied to an extensive description of the rich morphology of French verbs, without even using diacritic signs pointing to traditional in ectional paradigms. This work was implemented with Xerox nite-state tools and is currently used in various higher level NLP projects.

Subject Areas: Morphology, nite-state technology, transducers Word Count: 3778

FINITE-STATE COMPOSITION OF FRENCH VERB MORPHOLOGY Summary

This paper gives a novel account of French verb morphology as a composition of nite-state constraints. It consists of a cascade of transductions, each one dedicated to a speci c and linguistically motivated task. Each transducer expresses a number of linguistic generalizations that can be shared between various in ectional paradigms, traditionally classi ed separately. The sequence of ordered transductions is nally composed into a single lexical transducer which can be used both for analysis and generation. This results in a system which is transparent to the lexicographer, as transductions have a clear interpretation at every stage of the cascade, whereas systems limited to one stage of transductions tend to be opaque and dicult to debug. We show how this approach was applied to an extensive description of the rich morphology of French verbs, without even using diacritic signs pointing to traditional in ectional paradigms. This work was implemented with Xerox nite-state tools and is currently used in various higher level NLP projects.

Subject Areas: Morphology, nite-state technology, transducers Word Count: 3778

1. INTRODUCTION In their 1992 COLING paper, L. Karttunen, R. Kaplan and A. Zaenen (Karttunen et al., 1992) describe the construction of a lexical transducer for French. One component of that transducer carries out the analysis and generation of French verb forms. This lexicon was created by Carol Neidle and Annie Zaenen. Recent advances in lexical technology have made it possible to provide a simpler and linguistically more appealing description of the system.

2. TWO-LEVEL MORPHOLOGY In the classical two-level morphology model introduced by K. Koskenniemi (Koskenniemi, 1983; Antworth, 1990), the lexicon may be viewed as a simple minimized nite-state machine that describes lexical forms, while at run time, rule transducers match surface forms with their corresponding lexical forms. Morphological information is stored as separate information strings. The Xerox lexical transducers model (Karttunen, 1991; Karttunen et al., 1992; Karttunen, 1994) introduces some major improvements compared to the classical approach, namely: 

Lexical representations consist of a canonical form followed by morphological tags. The canonical form can be in nitive for verbs, masculine-singular for nouns and adjectives (if applicable in a given language), etc. The lexical form can then be viewed as a string of symbols that represents the base form as well as morphological features (number, gender, tense, etc.) or part of speech. For instance, the lexical representation for the second person singular imperative of the French verb danser is danser +Imp +SG 1

+P2 +Verb which corresponds to the surface string danse. Previous systems, on the contrary, accept ad hoc lexical forms in order to recover the associated surface form. 

The lexicon and the two-level rules are composed into a single structure (a lexical transducer) that encodes both the lexical and the surface level. As a consequence, there is no need to apply the rules at run time and the same description can be used for analysis (i.e. recovering the lexical form associated to a given surface form) or for generation.

In a sense, the classical Bescherelle conjugation books (Bescherelle, 1990) could be viewed as lexical transducers: a given verb (identi ed by its canonical in nitive form) and its tense, person, number and gender indices point to a particular in ected form and vice versa. In the initial implementations of this model, the rules were dicult to write because of the distance between the lexical and the surface forms, especially in a rich in ectional system like the French verbs. One had to implement a large set of two-level rules with the wellknown side e ects of the parallel decomposition of complex relations: rules apply outside their intended domain, the modi cation or the introduction of a rule has unpredictable e ects on the whole system, tracing is cumbersome, etc. As a result, systems limited to one stage of transductions tend to be opaque and dicult to debug. Recent advances in lexical technology (Karttunen & Beesley, 1992; Karttunen & Yampol, 1992; Karttunen, 1993) allowed us to overcome these diculties. The new approach combines the two-level rule formalism and the cascade model discussed in (Kaplan & Kay, 1981; Kaplan & Kay, forth.). The implementation uses both the intersection and the composition operations, as described in (Karttunen et al., 1992). The nal result is a two-level system 2

(i.e. a lexical transducer as described above), but it does not have to be directly constructed as such in one operation. Indeed, the various morphological alternations of the French verb systems are described as a sequence of linguistically motivated alternations that are logically ordered into a sequence of transductions. In other words, the lexicographic description of French verbs is not limited to two levels. Most alternations and apparent irregularities of French verbs are described in a tractable and compact way using intermediate levels between the lexical and surface levels. Each level is associated with a small set of two-level rules corresponding to a particular type of transformation, as will be detailed below. By doing so, we depart from the Bescherelle-like conjugation approach. Rules apply according to a particular context de ned by sequences of letters and/or morphological tags, regardless of the conjugation class of the verb. Indeed, our description of French verbs does not refer to numerous conjugation classes. The description of over 6000 verbs was implemented with the Xerox nite-state tools. The di erent levels introduced in the description are compiled into a nal lexical transducer, in a way which is transparent to the lexicographer.

3. REPRESENTATION OF FRENCH VERBS The lexical representation of the verbs is a string of symbols that may consist of the concatenation of: 

a lexical description of pre xed clitics



a lexical description of derivational pre xes



a base form (the in nitive of the verb) 3



a sequence of morphological features and a part-of-speech tag



a lexical description of suxed clitics

The verb surface forms can be decomposed as a sequence: < stem > < inflection >, where < stem > represents the verb stem, and where < inflection > is a realization of the tense/person dependent morpheme. For instance, dansons (let's dance) is decomposed as < dans > < ons >. (For the sake of simplicity, we will ignore the transduction of the clitics (e.g. dis-le-moi/ tell it to me) and in ectional pre xes (e.g. re-raconter/ to tell again), unless otherwise speci ed). It is then natural in the nite-state framework to try and establish a correspondence between the lexical and surface representations of verbs as illustrated in the following diagram.

Lexical level: Surface level:

l

l





According to this scheme, the base form is transduced into a stem and the sequence of morphological features is transduced into a tense/person in ection. The main diculty is to account for the many irregularities in the verb system. However natural this approach may seem, its implementation is not exactly straightforward due to the large number of these irregularities in French verbs.

4. MORPHOLOGICAL PROPERTIES OF FRENCH VERBS Before describing how we implemented a two-level representation of French verbs as sketched above, we will brie y draw a general picture of the morphological system of French 4

verbs. We limit ourselves to properties that are of particular interest for a nite-state representation of this system. By doing so, we depart from Bescherelle representations according to which verbs are merely classi ed among about 100 distinct in ectional paradigms. Indeed, it is the diverse irregularities attached to a given verb that lead to the traditional notion of in ectional paradigm. But actually, each particular pattern follows general properties that are not speci c to any given classical paradigm (consider for instance the extra i appearing at the same persons of the present indicative, subjunctive or imperative of verbs like venir, acquerir, asseoir: e.g. viens, acquiers, assieds). Such particularities can be handled

conveniently by contextual rules, in the general framework of nite-state lexicons. Most irregularities result from phonological or orthographic alternations at the junction of stems and in ections (the same would be true of clitics, as in chante-je, vas-y, donnes-en, chante-t-il). Alternations are not speci c to a given in ectional paradigm, but rather to

a particular context, identi ed by sub-strings of symbols (letters extracted from the base form, sequences of morphological features, verb endings, etc.). Those alternations tend to be regularly distributed over speci c tenses and persons. Let us consider for instance verbs ending in ir or re at the in nitive. The present indicative for such verbs consists of a stem (derived from the in nitive by deletion of the ir or re in nitive ending) and followed by the suxes: -s -s -t -ons -ez -ent (for 1st, 2nd and 3rd pers., sg, and plur. respectively). Example: in nitive: courir conclure 1st pers sg pres:

cours conclus

2nd pers sg pres:

cours conclus 5

3rd pers sg pres:

court

conclut

1st pers pl pres:

courons concluons

2nd pers pl pres:

courez

3rd pers pl pres:

courent concluent

concluez

This scheme is no longer applicable if we consider, among many others, verbs such as: in nitive: battre vendre dormir

bats

vends

dors

vends

3rd pers sg pres:

bats bat

dors dort

1st pers pl pres:

battons vendons dormons

2nd pers pl pres:

battez

3rd pers pl pres:

battent vendent dorment

1st pers sg pres: 2nd pers sg pres:

vend vendez

dormez

However, in the case of battre and vendre, the apparent irregularities (deletion of t in bats, bat, vend) can be accounted by the general principle in French that two dentals (as tt

or dt) do not occur at the end of a word or just before a nal s. In the case of dormir, the alternation results from the impossibility of having a nasal m followed by a nal s or t. Such alternations can be easily described by two-level rules, without reference to the particular verb to which they apply. Provided we apply a composition operation between the nite-state lexicon and these alternation rules, the verbs battre, vendre and dormir follow the general scheme. For instance, in the case of dormir, the following sequence of operations produces the desired surface form (dors): 6

dormir +IndP +SG +P1

Lexical form: (stem)

l

l

dorm +IndP +SG +P1

intermediate form: (in ection)

l

l

intermediate form:

dorm

s

(deletion of m followed by s)

l

l

Surface form:

dor

s

The up-and-down arrow recalls that the transduction is a bi-directional operation. The same description can be used in analysis (computation of the lexical form from the surface form) and in generation. Actually, the system of French verbs is more complicated than the above schema, as in many cases the stem for the present indicative cannot be derived from the in nitive by merely deleting the ir or re ending. In that sense, the stem is not predictable, given only the lexical form of the verb. For instance: in nitive: peindre ecrire 1st pers. sg. pres.: peins

ecris

2nd pers. sg. pres.: peins

ecris

3rd pers. sg. pres.: peint

ecrit

1st pers. pl. pres.:

peignons ecrivons

ecrivez 3rd pers. pl. pres.: peignent ecrivent

2nd pers. pl. pres.: peignez

The predicted stem for peindre would be peind, while we nd pein for the three singular 7

persons and peign for the three plural persons. As for ecrire, the predicted stem ecri only appears for the singular, the plural stem being ecriv. But if one de nes the stems for these two verbs to be peign and ecriv respectively, for all persons of the present indicative, the general scheme applies again as described earlier. All we need is to add some more phonological or orthographic constraints, namely that v and gn must be followed by a vowel or deleted (actually only the g of gn is deleted, the palatal gn turning into n). The following sequence of operations produces the desired surface form

for peindre: Lexical form:

peindre +IndP +SG +P1

(stem)

l

l

intermediate form:

peign

+IndP +SG +P1

(in ection)

l

l

intermediate form:

peign

s

(depalatalisation of gn)

l

l

Surface form:

pein

s

The stem is identical for all persons: it depends on the tense only. Orthographic alternations of the stem can be handled by two-level rules that are independent of particular in ectional paradigms. Similar observations can be extended to most French verbs. For instance conna^tre is associated to the stem connaiss at the present indicative. The double s is deleted when the in ectional sux begins with a consonant, to generate forms like the

rst person singular connais (i.e. < connaiss > + < s > ! < connais >). This is again a regular pattern. For the third person singular, the deletion of the s before the t sux imposes 8

that the i be rewritten with a circum ex, in order to produce conna^t. Our two-level rules then reproduce a well-known diachronic phenomenon.

5. SEQUENCE OF COMPOSITIONS The complete cascade of two-level transformations implemented in our description of French verbs looks as follows:

Lexical form m

Stem, clitics and pre xes m

Stem alternations m

In ection realization m

In ection/clitic alternations m

Surface form At the upper level, lexical forms are encoded in a nite-state lexicon. This lexicon describes the base forms of French verbs, with the possible combination of clitics and morphological features attached to them. Entries in the source lexicon may also specify a transduction between the lexical and a preliminary intermediate form. We use this feature to associate the base form to a tense speci c stem. For instance, the lexical form

peindre +IndP +SG +P1 is related to peign +IndP +SG +P1 in the lexicon, where 9

peign is the present indicative stem. The transduction between the lexical and surface forms of the clitics and pre xes is also de ned at this stage. The other steps of the cascade (alternations of stems, transduction of morphological features into the corresponding in ections, alternations of in ections connected to clitics) are handled by three separate sets of two-level rules. Rules that handle stem alternations as described in the previous section translate straightforwardly into the formalism of the Xerox two-level rule compiler (Karttunen & Yampol, 1992). For instance, a rule that deletes the character n preceded by g and followed by s or t

n:0 g [s j t] .

(as in the case of peign s) is written as:

Here the n:0 means that n is transduced to the empty symbol (written as zero) in the context described on the right side of the formula. The underscore indicates the position of the transduced n. Alternations at the clitic boundaries, as in chante-je, vas-y, donnes-en, chante-t-il, are described by similar rules at a latter stage of the cascade. For instance, the e of chante-je is produced by:

e:e -je .

As for describing in ections using two-level rules, we had to de ne a transduction for every sequence of morphological features. Roughly speaking, most in ectional suxes can be decomposed into a tense dependent and a person dependent part. A reduced set of twolevel rules provides a complete description of in ectional suxes, some rules transducing the tense features into tense dependent in ections, others transducing the person features into person dependent realizations. Exceptions to this scheme are easily controlled by constraints 10

expressed in the two-level rule formalism. French verbs can be grouped into three main classes characterized by di erent in ectional suxes, (e.g. er, ir and e respectively for in nitives, as in danser, nir and battre). Although stem alternations may apply accross classes, we chose to split the description of French verbs into three independent transducers, each one being associated with one of the three in ectional classes. This simpli es the transduction of morphological tags into verb endings. For instance, the sequence of features: +IndP +SG +P1 ( rst person singular of present indicative) generally gets realized as e for the rst class and as s for the second and third classes. Once we have constructed a separate lexical transducer for each of the three verb classes, we merged them (via a union operation) into a single transducer, that represents the global set of French verbs.

6. DESIGN OF THE SOURCE LEXICON Some lexical patterns are explicitly coded in the source lexicon. Such patterns include all canonical base forms, combinations of morphological features and clitics, and also irregularities that do not follow any predictable scheme. Such irregularities include in particular tense-dependent stems. One should emphasize the fact that the lexicographer need not explicitly encode all lexical forms in the source lexicon. Sequences of symbols shared by series of entries may be grouped into a same continuation class (in other words, a shared sub-lexicon). All forms that share the same continuation class can be encoded in a compact way: only the speci c information (i.e. the in nitive form) is explicitly coded, the rest of the lexical description being handled by a pointer to the appropriate sub-lexicon. 11

For instance, verbs from the rst class (in nitive ending in er) can be encoded as follows: chanter

rstclass

danser

rstclass

manger

rstclass

... where rstclass is de ned as the common continuation class of all possible sequences of morphological features, such as +IndP +SG +P1 +Verb for rst person singular indicative present. In fact, continuation classes are used in a somewhat more elaborate way, as the lexicon accounts not only for the base form and the morphological features (i.e. for the lexical level of the two-level representation), but also for tense-dependent stems (i.e. for the rst step of transduction in the cascade described above). For example, if we consider verbs like peindre, joindre, craindre, they can be decomposed into a word-speci c substring pei, joi, crai and a

variable sub-string ndre. These three verbs are coded as follows: NDRE

pei

crai NDRE

NDRE

joi

where NDRE is a common continuation class that points to the following sub-lexicon:

12

ndre:gn

PresentIndic

ndre:gn

ImparfaitIndic

ndre:gn

PresentSubj

ndre:ndr Future ndre:ndr Conditional ...

...

This sub-lexicon de nes the tense speci c alternations of the variable part of the stem in terms of transductions (the transduction between two forms is marked by a colon). For instance the sub-string ndre transduces into gn at the present indicative. PresentIndic is itself a continuation class, common to all verbs, and points to the sequences of morphological features for the present indicative. Extra compaction is possible as similar stem alternations are actually common to several tenses within the NDRE lexicon. Regular stem alternations (such as deleting the nal e of the in nitive form in order to produce the stem for the future and conditional) can be handled by two-level rules as well as within the lexicon. For the sake of completeness, one must mention stem alternations that are not predictable given the context. For instance, verbs ending in eter at the in nitive get either a geminated t or an open e at certain tenses and persons:

jeter + IndP + SG + P3 $ jette acheter +IndP + SG + P3 $ achete We mark such arbitrary alternations by diacritic symbols in the stem. Those symbols mark the position and the type of the alternation. For instance, the present indicative stem of jeter is jet+GEM, where +GEM marks the possible gemination. This same symbol is shared 13

by verbs that otherwise belong to distinct paradigms. Two-level rules transduce the +GEM symbol into the geminated consonant wherever required, or into the  symbol (i.e. the void symbol). There is no need to mark the e $ e transduction of acheter, as the absence of the

+GEM symbol is in itself a sucient sign. About half a dozen symbols suce to describe the set of possible arbitrary alternations in the whole verb system. All other alternations are handled by contextual two-level rules that require no particular marking. For instance, in our lexicon, verbs like placer, manger, peser, modeler, assieger, payer, broyer that belong to distinct paradigms in Bescherelle are coded exactly the same way as the typical regular verb aimer. All irregularities (such as the e in achete, the c in placons or the e in mangeait), are described by less than 20 contextual rules.

7. CONCLUSION French verbs constitute a complex system. The approach described here shows that the apparent complexity can be reduced into a limited set of in ectional rules coupled with a small set of phonological alternations. Regular and irregular phenomena can be split out and described separately. The initial source lexicon encodes arbitrary signs (base forms, stems, unpredictable alternations, clitics and pre x positions and surface realizations) while the cascade of three two-level rule systems describes elementary alternations. These rules express broad linguistic generalizations about the phonological and orthographic system of French. Irregularities do not appear as capricious variations but rather as natural alternations easily described as nite-state transductions. This methodology provides for a tractable and compact representation of the whole verb system. The cascade of transducers brings the advantage of rule ordering without the draw14

back of being unidirectional: the resulting lexical transducer can still be used both for generation and analysis. We believe this general architecture is reusable for other languages. The composition of the initial lexicon with three sets of rules yields to an extensive description of French verbs. The nal lexical transducer has over 6000 states, 16000 arcs and over 10 million words (this last gure accounts for the productivity of pre xes and clitics), but is held in 50 kilobytes of disk space. Tractability is essential in order to develop morphological tools that can be easily maintained and integrated into more advanced NLP applications. The French lexical transducer is currently used in various projects, including tagging, nite-state syntax and the description of idioms and multiple-word expressions in the framework of the Locolex LRE project.

Acknowledgments: I bene ted from the groundwork of Carol Neidle and Annie Zaenen on French morphology, of Ken Beesley on Spanish morphology, and from extensive advice from Lauri Karttunen, as well as comments from Ted Briscoe and Gregory Grefenstette.

REFERENCES

Antworth E. L. PC-KIMMO: a two-level processor for morphological analysis. Occasional publications in academic computing, # 16, Summer institute of linguistics, Dallas, 1990. Bescherelle La Conjugaison. Hatier, Paris, 1990. Kaplan R., Kay M. Phonological Rule and Finite-State transducers. Linguistic Society of America Meeting Handbook, 1981. Kaplan R., Kay M. Regular Models of Phonological Rule Systems. To appear in Computational Linguistics Karttunen L. KIMMO: a general morphological processor. Texas linguistics forum, 1983. 15

Karttunen L. Finite-State Constraints. Proceedings of the International Conference on Current Issues in Computational Linguistics, Penang, Malaysia, 1990. Karttunen L., Kaplan R., Zaenen A. Two-level morphology with constraints. Proceedings of the fteenth International Conference on Computational Linguistics, Nantes, France, 1992. Karttunen L., Beesley K. R. Two-Level Rule Compiler. Technical report, Xerox Palo Alto Research Center, 1992. Karttunen L., Yampol T. Interactive Finite-State Calculus. Technical report, Xerox Palo Alto Research Center, 1993. Karttunen L. Finite-State Lexicon Compiler. Technical report, Xerox Palo Alto Research Center, 1993. Karttunen L. Constructing Lexical Transducers. Forthcoming in the proceedings of the 15th International Conference on Computational Linguistics, Coling, Kyoto, 1994. Koskenniemi K. Two-level morphology. A general computational model for word-form recognition and production. University of Helsinki, 1983.

16

Suggest Documents