A Morphological Parser for Sinhala Verbs

A Morphological Parser for Sinhala Verbs T.N.E. Fernando1, A.R. Weerasinghe2 University of Colombo School of Computing 35, Reid Avenue, Colombo 07, Sr...
Author: Edwina Shaw
21 downloads 0 Views 308KB Size
A Morphological Parser for Sinhala Verbs T.N.E. Fernando1, A.R. Weerasinghe2 University of Colombo School of Computing 35, Reid Avenue, Colombo 07, Sri Lanka. Email: [email protected], [email protected] Address: 161/14A, Lumbini Mavatha, Dalugama, Kelaniya, Sri Lanka Abstract This paper presents a morphological parser capable of analyzing and generating Sinhala verbs. Morphological analysis and generation plays a vital role in many applications related to natural language processing, such as spell checkers, grammar checkers, intelligent information retrieval, machine translation and other complex applications. The parser consists of a lexicon of more than 400 verb stems and handles 45 inflectional rules for each stem. Analyses produces the verb stem together with its feature tags depicting verb class, person, number, tense, gender, mood, voice, etc. The parser is modelled in the framework of two level morphology model using Xerox finite state morphology tools. To our knowledge, this is the first such parser for Sinhala verbs. Keywords: Morphology, Natural Language Processing, Parsing, Sinhala

1. Introduction Morphology is an important area in computational linguistics that studies word structure, how words are formed and how they are related to other words. Morphological parsing is a computation, which takes as input a derived form of a word and outputs the dictionary form of the word & vice versa. It can yield information that is useful in many NLP applications such as spell checkers, grammar checkers, machine translation and intelligent information retrieval. This paper presents a morphological parser that takes as input Sinhala verbs and outputs its stem plus detailed feature structure information. In generation mode, the aforementioned functionality is reversed.

1.1 The Sinhala Language Sinhala, a language belonging to Indo-Aryan branch of the Indo-European languages, has a rich system of

morphology. It is a syllabic script with its own writing system which is an offspring of the Brahmi script.

1.2 The Sinhala Verb Morphology Morphological analysis, particularly in a morphologically rich language such as Sinhala can reveal much information. For example, the verb නටමි natəmi (I dance) implies that the subject of the sentence - from which this verb was extracted - must be first person, singular. The Sinhala verb chiefly comprises of a verb root and a set of auxiliaries which enhance the meaning given in the verb root. In order to identify a word as a verb, it must contain a verb root and its last morpheme (suffix) must denote action, i.e., it must be a ‘verb suffix’. The suffix must be taken into account because even nouns can be created by using ‘verb’ roots [1]. For example, consider the following list of nouns which all share the verb root බල balə (to look): බලන්නා balanna: (the person who looks), බැලීම bæli:mə (the act of looking), බලන්ෙනක් balannek (a person who looks). Owing to the complexity of Sinhala morphology, it is impossible to have an exhaustive lexical listing. For instance, one stem could generate more than 45 inflected verb forms. That is why there is an urgent need for a parser that will use the morphological system to compute the part of speech and inflectional categories of Sinhala words.

1.3 Approach The parser was modelled using Koskenniemi’s Two Level Morphology (TWOL) approach [2] and implemented using the Xerox finite state tool [3] (xfst version 8.1.3 non-commercial). The lexicon was programmed in the lexc language and contains over 400 verb stems that were collected by compiling a section of the corpus. Orthographic & phonological rules are modelled using the regex language and represents 45 basic inflectional forms.

by combining such a Krudanthə verb with a verb suffix. Table 2 lists several examples for these types of verbs [1].

1.4 Scope One reason for limiting this study to verbs only is that, to consider the whole Sinhala language (nouns, adjectives, etc) would require a very large amount of effort, resources and time and thus outside the scope of an undergraduate study. Also, a morphological parser for Sinhala nouns has already been developed by the LTRC, University of Colombo School of Computing [4]. Furthermore, the scope is based on the language rules presented by Prof. J.B.Disanayake in his book Kriya Pathaya as it is one of the most comprehensive studies of Sinhala verbs in published work [1].

2. Morphology of Sinhala verbs 2.1 Types of Sinhala Verbs Contemporary Sinhala linguistics categorizes verbs into several categories. According to Prof. J B Disanayake [1], verbs can be denoted depending on six factors. Of these, the following two categories were chosen to be used as the two main dividing factors for the verbs in the project: 1) Pure (ශුද්ධ ) vs. Derived (සාධිත ) 2) Finite(අවසාන ) vs. Non-Finite (අනවසාන ) Several of the other categories were represented as morphological features. Therefore four types of verbs were considered for the parser: finite-pure, finite-derived, non finite-pure & non finite-derived. They were categorized thus because the morphotactics and the orthographies that affect the Sinhala verb are somewhat similar in the aforementioned four classes. Pure verbs vs. Derived verbs. Pure verbs are formed when a verb root (kriya: prəkurthi) such as balə is combined with a verb suffix (kriya: prathyə). Table 1 lists several examples for these types of verbs. Verb Root Verb Suffix Pure Verb

balə

balə

balə

mi (1st Person Singular) st

mu (1 Person Plural)

baləmi

baləmu

Krudantha form baləna

bælu:

baləna

Table 1 : Pure verbs A specialty in Sinhala is the usage of not just roots to form verbs, but also a certain kind of verbs themselves. These verbs are called Krudanthə verbs. A Krudanthə verb is formed when a verb root is combined with a Krudanthə suffix. For example, nə, və, unə, unu are Krudanthə suffixes. Derived (sa:dhithə) verbs are formed

Derived Verb

emi (1st Person balənnemi Singular) emi (1st Person bæluvemi Singular) e: balənne: Table 2 : Derived verbs

From the examples in Table 1 and Table 2 it is clear that the suffixes used in the two verb forms are also different. Finite verbs vs. Non-finite verbs. In Sinhala, the finite forms of a verb are the forms where the verb shows tense, person, gender or number. The form of the finite form must match with that of the sentence’s subject and these verbs can form independent clauses, which can stand on their own as complete sentences. On the contrary, Nonfinite verb forms have no person, tense, gender or number. Therefore there is no relation between a nonfinite verb and the sentence’s subject. For example, the sentence ‘මම බත් කමි - I eat rice’ contains the finite verb කමි (eat) and the sentence ‘මා බත් කද්දී අම්මා මට කතා කළාය - Mom called me when I was eating rice’ contains the non-finite verb කද්දී (eating).

2.2 Phonological & Orthographic Rules In many languages how roots and suffixes are fused together to form valid words are governed by a set of elaborate phonological rules [5]. Sinhala verb roots can be categorized into four classes depending on the way the verb roots are inflected [1]. Table 3 shows four examples belonging to the aforementioned four classes; Root Category

SingularPresent

balə

balayi (yi) adiyi (yi)

adi

st

i:mi (1 Person Singular) bæli:mi

Verb Suffix

æle

ka

PluralPresent

SingularPast

baləthi bæli: (i:) (athi) adithi ædi: (i:) (athi) æleyi (yi) ælethi æli: (i:) (athi) ka: (a:) kathi kæ: (æ:) (athi) Table 3: Verb forms of Sinhala

PluralPast

bæli: (i:) ædi: (i:)

æli: (i:)

kæ: (æ:)

2.3 Verb Roots Categorizing verb roots. Sinhala verb roots can be categorized into four basic groups according to their last vowel; ‘a group’, ‘e group’, ‘i group’ and the ‘irregular group’. ‘a group’ consists of verb roots that end with the vowel ‘a’, ‘e group’ of roots ending with ‘e’ and ‘i group’ with roots ending with ‘i’. The ‘irregular group’ is the collection of renegade roots that do not conform to any of the previous three categories [1]. Although the suffix list is somewhat similar across the groups of verb roots, there are differences in the orthographical rules employed to fuse the roots and the suffixes together. Therefore organization of the lexicon in the parser has used this classification of verb roots.

3. Two Level Morphology for Sinhala It has been shown in [6] that concatenation, composition and iteration are sufficient means for describing the morphology of languages with concatenative morphological processes. Therefore it was decided to employ the TWOL approach to model the Sinhala verb morphology.

3.1 Methodology The methodology of encoding the Sinhala verb morphology using the two level morphology model is summarized as follows: The lexicon stores the collection of known Sinhala verb roots such as balə, natə, etc and suffixes such as mi, mu, emi, etc. It also dictates the rules that specify the legal combinations of morphemes (morphotactics) and is encoded as a finite-state network. The rules that determine the form of each morpheme (orthographical alterations), i.e., spelling rules used to model the changes that occur in the word, usually when two morphemes combine - for example the Sandhi rules concerning Sinhala verbs - are implemented as finite-state transducers. The lexical network and the rule transducers are then composed together into a single network, a ‘lexical transducer’ that incorporates all the morphological information about the language including the lexicon of morphemes, derivation, inflection, etc. Figure 1 shows the basic view of the morphological parser.

Figure 1 : Basic view of the parser

3.2 Xerox Finite State Transducer Xerox has created an integrated set of software tools to facilitate linguistic development. Their tools (xfst, twolc, lexc) are built on top of a software library that provide algorithms for creating automata from regular expressions and equivalent formalisms and contains both classical operations such as union and composition and also new algorithms such as replacement and local sequentialisation. Over the years, the products of their research have come to be used all over the world in many linguistic applications such as morphological analysis, tokenization, and shallow parsing of a wide variety of natural languages. The xfst tool has been licensed to over 70 universities world-wide. Many components have been incorporated into commercial software [7]. For this project, the non-commercial version was used. Corpus. A corpus is important for the training and testing of the parser because a corpus helps identify the “in” forms of the language, i.e. the contemporary language. For example, even though Sinhala text books would list certain words such as kərəhi, baləhu, (the 2nd person) etc, these words are now not in everyday usage. The project used the 7 million Sinhala words corpus with 312,000 distinct words, put together by the University of Colombo School of Computing language research laboratory. Since the corpus contains a mixture of all kinds of words: nouns, verbs, and adjectives etc, only the distinct verb forms were manually filtered out.

4. Implementation of the Parser 4.1 System Architecture The steps involved in the operation of the parser can be summarized as follows: 1. The input text file is fed to the transliteration module. This file contains the Sinhala words that need to be analyzed or generated and the text should be in Sinhala Unicode. 2. The transliterator converts the contents of the input text file into Romanized Sinhala text. 3. xfst(Xerox finite state transducer) is invoked and the compiled finite state network is loaded onto the stack. The transliterated input text file is given as the input for the xfst’s ‘apply up’ or ‘apply down’ commands, depending on analysis or generation mode. 4. xfst analyzes/generates the input strings and the output is written to a text file in ANSI encoding. 5. The output file is processed by the result formatter and a formatted text file in Unicode encoding is produced. 6. The formatted output file is input into the transliterator and the transliterated output file is produced.

4.3 Lexicon The lexicon comprises of several files which are programmed in lexc (lexicon compiler). There is a lexicon file for each class of verb root and these contain the verb roots and morphemes recorded during implementation and training. As mentioned before in section 2.3 four basic verb root classes were identified. However, some of these verb groups could further be categorized into subgroups depending on their orthography. For example, the roots adi අදි and ari අරි both belong to the same ‘i group’. However, they behave differently in their past tense form; adi අදි becomes ædda ඇද්ද while ari අරි becomes æriya ඇරිය. This phenomenon is not isolated and since there are other roots that display similar behaviour [1], they were put into separate sub groups. Therefore separate lexicons were created for each of the sub-groups. Altogether there are 11 such lexicon files. The format of the lexicon files programmed in lexc is given in Figure 3.

4.2 Transliterator Because the Xerox finite state-tool version 8.1.3 does not support Unicode, an alternative was needed to represent the Sinhala script. Therefore a transliteration scheme was designed for the representation of Sinhala characters in Roman notation inside the system. As a standardized transliteration scheme for Sinhala does not exist, the scheme used here is unique to this system. The transliteration program is written in Java, and takes the input text file and converts the Sinhala characters to Romanized script. Figure 2 shows the structure of the transliterator.

Figure 2 : Structure of Transliterator The transliteration scheme used is roughly based on the system presented for Roman notation for Devanagari in Natural Language Processing - A paninian perspective [8]. The complete listing of the transliteration scheme used in this project is listed in Appendix B.

Figure 3 : Format of the lexicon lexicon The Multichar_Symbols define the tag-set used inside that file. LEXICON Root is a reserved name, corresponding to the start of the network. Other LEXICONS are defined as needed and are named according to the requirements of the grammar. The optional END keyword terminates the lexc description [3]. All morphemes (prefixes, roots, suffixes, etc) that are used to build verbs are organized into the sub-lexicons which reside inside the core lexicon. Lexicons are chiefly used for describing non-phonological stem end alternations. Thus the language denoted by the lexicon is typically unfinished orthographically. That is, the lexical strings produced by the lexc grammars are actually intermediate strings and need some degree of modification before they can be candidates for the target language. These modifications may reflect orthographical conventions and/or phonological processes such as deletion, vowel frontation, epenthesis, etc. An example is given in figure 5. The resulting surface string is transformed into a recognizable verb after passing

through the phonological/orthographical layer containing the above alteration rules. The string further has to go through the filter layer to come out as a valid string in the target language. These two layers are explained in detail later (in sections 4.4 & 4.5). The separately compiled lexicon and rule network layers are subsequently composed together. Figure 4 illustrates this process.

Figure 5 : Intermediate strings in lexicon

4.5 Morphotactic Filter Layer

Figure 4 : Parser as a transducer

4.4 Orthographical/Phonological Layer The Phonological/Orthographical layer comprises of .regex files that contain all the alternation rules needed to map the intermediate string to the ultimate target string (surface string). These mappings are notated using xfst replace rules, are compiled into transducers, and are composed on the bottom of the lexical transducer. There are 4 such rule files, one for each verb category, dictating the orthographies related to each category. For example, in Figure 5 the first rule: [n-> {nn} || (vowel) %+Kru] dictates that all instances of the symbol ‘n’ in the context of it being followed by a vowel and/or the symbol ‘%+Kru’ must be replaced by the symbols ‘nn’. The second rule: [%+Kru->0] simply states that the intermediate tag ‘+Kru’ be replaced by the empty string. The third rule: [vowel ->0 || vowel] instructs the program to delete any vowel that is followed by another vowel.

It is impossible to lay down rigid rules declaring the behaviour of a natural language, more so for a morphologically rich language such as Sinhala. The filter layer acts as a cleanup transducer to the surface side. The surface side is filtered for removing certain verbs that comply with the morphotactics and alteration rules, but however are not in the Sinhala language due to some irregularities. For example, the verb root ‘va’ has a lexical form of va+VFM+Derived+Past+3P+Sg+Mus (the finite derived verb form in past tense 3rd person singular muscular) which corresponds to the surface strings ‘vu:ve:yə’ and ‘vu:ye:yə’. However, there is only one surface form in the feminine version of the form, namely ‘vu:va:yə’. Even though the form ‘vu:ya:yə’ is generated as according to the lexical rules, it is not an acceptable verb in the language. To try to impose this irregularity inside the lexicon would break the design of the lexicon. Such irregularities violate the principle that lexicons describe the rules of morphotactics and two level rules can be used to formalize the distribution and the phonological relations of stem variants [3]. Stem final alternations have often become individual properties of a word and are not predictable by phonological rules. Therefore a filter layer filter.regex is used on the surface side to remove the above invalid verb form.

4.6 The Tag Set A set of multi-character symbol tags are used as the morphological and syntactic tags that convey information about part-of-speech, tense, mood, number, gender, etc. Multi-character symbols are treated as atomic entities by

the regular expression compiler; that is, the multicharacter symbol +VFM is stored and manipulated just like a, b and c. For example, the sigma alphabet of the network compiled from [{bala} “+VFM”:0] consists of the symbols a, b, l and +VFM. Since a standardized tagset for Sinhala does not exist yet, a set of tags were formalized by consulting several Sinhala language textbooks. The complete set of multicharacter symbol tags is given in Appendix A.

4.7 Parser Output The parser is capable of analyzing and generating strings at the word level in both Unicode Sinhala format and Romanized English using the transliteration format used in the project. Samples of some verbs which have correctly been analyzed are given below: • කළ (kələ) => කර +Kru+Past - Krudantha Past tense of root කර (kərə). • කෙළේ (kəle:) => කර+VNF+Derived+Dec - Non Finite, Derived, Declarative form of root කර (kərə). • ගිෙය්ය (giye:yə) => ය +VFM+Derived+Past+3P+Sg+Mus – Finite, Derived, Past tense, 3rd Person, Singular, Muscular verb of root ය (yə). Likewise, some examples of verb forms which have been accurately generated are given below: • ලබ +VFM+Pure+Past+3P+Sg => ලැබීය (læbi:yə) – Finite, Pure, Past tense, 3rd Person, Singular tense of root ලබ (labə). • හිත +VNF+Derived+Nom => හිතනවා (hithənəva:) - Non finite, Derived, Nominative verb of root හිත (hithə). • කිය+VFM+Derived+Past+3P+Sg+Fem => කීවාය (ki:va:yə) - Fintie, Derived, Past tense, 3rd Person, Singular, Feminine verb of root කිය (kiyə).

4.8 Problems & Challenges Insufficient domain knowledge: The primary difficulty faced was the lack of linguistic expertise. Since the author’s knowledge on Sinhala verb structure was limited, the first task was to learn the language features. As the study and understanding of the Sinhala’s linguistic structures was very important in designing a linguistic model, it was crucial that a thorough study was done first before going into implementation. Lack of standards: There is a marked lack of standards for Sinhala linguistics. While there exists numerous valuable text books, etc, there is no set of agreed rules and conventions for the language. Finally it was decided to adopt Prof. J. B. Dissanayake’s [2] structure for the project.

Absence of a tagged-corpus: Because we are yet to have a tagged-corpus for the Sinhala language, a considerable effort had to be spent on manually extracting words for the distinct verb list for the corpus. Due to the author’s lack of expertise on the Sinhala verb, some verbs may have been omitted from the verb corpus in the process, and also several words that are not verbs may have been added. Unicode support: The Xerox compiler version 8.1.3 does not have Unicode support. Since it was not possible to acquire a version that does, a way to incorporate the Sinhala language without using Unicode had to be found. A Transliteration scheme was designed and implemented as a solution.

5. Experiments & Results 5.1 Training Methodology. The initial verb roots and rules were acquired from the textbook Kriya Pathaya [1]. These were used during the development of the parser. After completing the implementation, the corpus was used to acquire words for training purposes. First, a list of verbs was manually filtered out of the corpus. Since the corpus itself contained distinct words, the verb list also comprised of distinct verbs. Next 700 verbs out of a list of 1631 were chosen from the verb list for training. The training set was acquired by grouping the corpus into 200 word sets and extracting every other set. Thus, the training set was formed from the verbs 1 − 200, 401 − 600, 701 − 900, and 901 − 1000. This was done because an allowance for different data domains was needed. It was assumed that the corpus contained data from different domains and the words that are spatially similar displayed similar morphotactic characteristics. The training results with respect to the number of errors are shown in Figure 6. Although this helps to give a general picture, it must be noted that the training set itself was not completely accurate, i.e. the verb list contained several manual processing errors such as including nonverbs in the list, including compound words (eg : සිදුකරන sidukərənə ), etc.

training progressed decreased while the number of new roots increased. Furthermore, the total number of new roots found far outweighs the total number of new rules added. This is due to the corpus structure, which is frequency based. Thus, the verbs at the top are the highest occurring verbs in the language.

Figure 6 : Overall results in training Corrections. The corrections that were performed were twofold: • Adding new verb roots • Adding new rules The first correction, adding a new root was done whenever a variation of a verb root that was not in the lexicon was encountered. The new roots that were added are illustrated as a line chart in Figure 7.

Figure 8 : Rules added during training

5.2 Error Prediction From a total of 700 words, the parser analyzed 495 words correctly, thus placing the number of errors at 205 excluding non-verbs. Using these statistics the expected error for unseen data was predicted at 29%. Predicted error =

Total number of errors X 100 Total number of words parsed

5.3 Testing

Figure 7 : Roots added during training The second correction, the addition of a new rule was done when the new rule encountered was another variation of a rule that was already compiled. Otherwise, if the rule was completely novel, it was ignored. For example, the Non finite, Pure, Lakshya verb of the verb root ය ‘ya’ was initially defined as යන්ට ‘yanta’. However, during training several other variations of this rule were found. After the corrections the parser gives four generations for this form: • යෑමට yæ:mətə • යාමට ya:mətə • යන්නට yannətə • යන්ට yantə Figure 8 shows the new rules that were added during training. The number of rules that were encountered as

A set of unseen data was chosen from the remaining 200 word sets that were left from the ones taken for training. Thus, the test data consisted of 300 unseen verbs. Without taking into account the number of non-verbs, which was found to be 6, the testing results are illustrated in Table 4. Verb set (in corpus) 201-300

Total number of errors 24

Number of errors due to new roots 14

Number of errors due to new rules 10

301-400

27

14

13

601-700

39

20

19

Table 4 : Testing results According to these figures, out of a set of 300 test words, there were 90 parse failures. After removing the number of non-verbs (25 in number) in the test set, the input set came down to 275 words. This gives an error rate of 32.73% as the actual error and a parse rate of

67.27%. However, it is obvious that a majority of failures were caused by verb roots that were not in the system (See Table 4). Ideally, the verb roots in the language could have been obtained by using a stemmer. However, such a stemmer for Sinhala does not yet exist. Therefore, if the error caused by data sparseness, i.e., new roots, were to be eliminated, an estimate of the true error rate can be derived. This was calculated by removing the number of failures due to roots that were not in the lexicon which resulted in an error rate of 15.27% thus giving a success rate of 84.73% .This is summarized in Figure 9.

Figure 9 : Results of testing In the testing set too, the number of new roots encountered have increased. In accordance to the training results, this too can be explained by examining the distribution of the corpus (see Figure 10).

Figure 10 : Verb distribution distribution

Because the high frequency verbs share common roots, the actual number of verb roots is less at the top of the corpus than at the bottom.

5.4 Evaluation The actual error rate of the system was found to be 32.73% (See section 5.3) while the predicted error was 29% (See section 5.2). Thus the actual error differs from the predicted error by 3.73%. Several reasons for this deviation could be deduced from the results. Firstly, since the corpus is organized in a frequency based structure it would ensure that the words at the top are the most frequently occurring words in the language. Furthermore, this frequency based structure ensures that the top part of the corpus contains different verb forms sharing the same verb root. Therefore, the number of rules learnt decreases as data from the lower portions of the corpus are trained, while the number of new roots learnt increases. Thus, the further a word is situated in the corpus; the least likely it is to be included in the lexicon. Irregular verbs in the language could be another culprit for the deviation. Actually it is not so much the existence of irregular verb roots, but more so is the fact that even normal verb roots often have irregular forms associated with them. For example the verb root දකි daki is a common and regular root in the language. Yet it too behaves in an irregular fashion in instances such as its Krudantha past tense form. The regular verb of this specification is given as දැක්කා dækka: while it also has another verb form දුටු dutu. The other verb roots in this verb root’s class such as අදි adi, අමදි amadi, පිරිමදි pirimadi do not display this behaviour either. Such complicated irregularities are not easy to be tackled by linguistic generalizations. The third cause for the deviation could be the negated and compound verbs. In Sinhala, negation occurs when a ‘no’ prefix is added to the verb root. However, as negation is not handled in the current implementation, negated verbs would fail at the analyzer even if that same verb form is accepted without the negation. For example, the verb කියා kiya: is correctly analyzed by the parser as kiyə+VNF+Pure+Pri. Yet the parser fails on its negated verb ෙනොකියා nokiya:. Furthermore, there is a debate whether certain verbs in Sinhala should be treated as two separate words or one word. For example, the verb කරෙගන kərəgenə is sometimes written as one word, while in some texts it is given as two words: කර ෙගන kərə genə. Since the convention used in the parser is to treat it as two words, the parser fails when it encounters instances in the corpus where it is given as one word.

6. Related work There have been several work carried out in morphological parsing in Sinhala and Sanskrit. In Sinhala, a parser has been developed by Herath et al [4] for nouns, which uses stemming for analysis. For Sanskrit, Goyal et al [9] and Anupam [10] have shown that by employing deterministic finite automata (DFA) to construct the rule base, it is possible to get an efficient parse for Sanskrit text and claims to parse for 10 verb classes. Goyal et al [9] has used a separate ‘Sandhi’ module to break up the sandhis before inputting the text to the DFA. Girish et al [11] has used a strictly rule based database and POS tagging approach for parsing Sanskrit without using DFAs and claims to have verb forms of commonly used 450 verb roots . All Goyal et al [9] , Anupam [10] and Girish et al [11]have used the Devanagari format and the Panini framework [8] for their construction of Sanskrit morphology. None however, claims to be able to function as a two way parser, i.e. generate as well as analyze.

7. Conclusions The research work on the project concludes with several points supported by its experimental results. Essentially the main conclusion derived is that the twolevel parsing method can successfully be applied to develop a morphological parser for Sinhala verbs. Also, it is evident that the Xerox finite state transducer tool can be used to implement the lexical transducers that encode the language morphology. At the time of writing this report, a tagged corpus for Sinhala does not exist. For the approach followed in this parser, a tagged corpus is not necessary as it is a rule based method. Therefore, it is clear that this is a suitable method for languages and applications where a tagged corpus is not available. Using the two-level morphology approach is beneficial as it supports both analysis and generation without explicitly needing to implement the rules both ways. It is also highly extensible since new roots and new rules can easily be added to the system. In addition it is obvious that extensive linguistic research and modeling is imperative for the success of this type of project. Although the primary goal of the research was achieved, there are a number of areas where improvements and extensions can be added: • The verb rules considered in this project are those that are documented in Prof. J.B.Disanayake’s book Kriya Pathaya [1]. For practical purposes, a thorough research needs to be done to derive a complete set of rules for the Sinhala verb. • Currently, the parser fails for unknown verb roots, i.e. verb roots that are not stored in the lexicon. However it may be possible to





incorporate a guesser to the system that could give an approximate result by segmenting the input and trying to locate patterns according to set of rules and known verb roots. Misspelled words and compound words are also not handled in the current implementation. A rule based segmenting algorithm could be developed for the division of such compound words. For misspelled words, the parser could suggest some known verbs which have the minimum edit distance from the misspelled string. The current system contains a single finite state network that encodes the morphological and grammatical information of Sinhala without concentrating on a specific domain. However, it would be beneficial to have a several networks with a core network which encodes the basic morphology. The other networks would extend the core network, and they could be used to focus on specific domains, multiple orthographies, multiple levels of strictness, etc.

Appendix A The Tag Set +VFM +VNF +Pure +Derived +Kru +Sg +Pl +1P +2P +3P +Pres +Past +Command +Invol +Mus +Fem +Cont +Mix +Pos +Neg +Cond +Avas +Lakshya +Pri +Bhv +Pra +Anan

= Finite = Non Finite = Pure = Derived = Krudantha = Singular = Plural = First Person = Second Person = Third Person = Present Tense = Past Tense = Command = Involitive = Masculine = Feminine = Continuous = Mixed = Positive = Negative = Conditional = Avasthika = Lakshya = Prior = Bhavaroopa = Prayojya = Ananthara

+Dec +Cau +Nom

members of the Language Research Laboratory for helping me clarify many linguistic aspects in Sinhala grammar and giving me access to many resources, and also to Mr. Harshula Jayasuriya for explaining the nuances of the LKLUG keyboard layout. Last but not the least, my special thanks to my family and all my colleagues for their encouragement and understanding.

= Declarative = Causative = Nominative

Appendix B The Transliteration Scheme

◌ං අ ඇ ඉ

H a æ I

◌ඃ ආ ඈ ඊ

M A Æ I

ෙ◌ ෛ◌ ෙ◌ෝ ◌ෟ

e å O -

උ ඍ එ ඓ ඕ ක ඝ ච ඣ ඦ ඩ ඬ ද ඳ බ ඹ ල ෂ ළ ◌ා ◌ි ◌ූ

u R e ã O ka Ga ca Ja ôa da Fa xa Va ba Sa la Za La A I U

ඌ ඎ ඒ ඔ ඖ ඛ ඞ ඡ ඤ ට ඪ ත ධ ප භ ය ව ස ෆ ◌ැ ◌ී ◌ෘ

U RR E o au Ka ña Ca qa ta Da wa Xa pa Ba ya va sa fa æ I ß

◌ෳ ෙ◌ේ ෙ◌ො ෙ◌ෞ ◌ෲ ග ඟ ජ ඥ ඨ ණ ථ න ඵ ම ර ශ හ ◌් ◌ෑ ◌ු

E o à õ ga Ya ja Qa Ta Na Wa na Pa ma ra za ha Æ u

Acknowledgment I am deeply indebted to Dr. A R Weerasinghe for conducting the supervision of this project and offering me valuable insight from time to time. Many thanks to Dr. Lalith Premaratne for his guidance throughout the course of this work as the examiner of my project. I am sincerely grateful to Dr. Chamath Keppetiyagame for co-ordinating the research projects and advising on the Sinhala Latex system. I am also thankful to Mr.Dulip Herath & all the

References [1]. J.B.Disanayake. Basaka Mahima: 11 - Kriya Pathaya. s.l. : S. Godage & brothers, 2001. [2]. A general computational model for word-form recognition and production. Koskenniemi, Kimmo. Stanford, California : Association for Computational Linguistics, 1984. [3]. Lauri Karttunen, Kenneth R. Beesley. Finite State Morphology. s.l. : CSLI Publications, 2003. [4]. Herath D.L, Weerasinghe A.R. A stemming algorithm to analyze inflectional morphology of sinhala nouns. unpublished. [5]. Two-level morphology: A general computational model for word-form recognition and production. Koskenniemi, K. s.l. : Association for Computational Linguistics, 1983. Vol. Publication 11. [6]. A short history of two-level morphology. Lauri Karttunen, Kenneth R. Beesley. s.l. : COLING 1992: The 15th International Conference on Computational Linguistics, 2001. [7]. Xerox research centre. [Online] http://www.xrce.xerox.com./. [8]. al, A. Bharathi et. Natural Language Processing : A Paninian Perspective. New Delhi : Prentice-Hall of India, 1996. [9]. Analysis of Sanskrit text : Parsing and semantic. Pawan Goyal, Vipul Arora, Laxmidhar Behera. Rocquencourt, France : s.n., 2007. [10]. Anupam. Sanskrit as Indian networking language : A Sanskrit parser. 2004. [11]. Inflectional Morphology Analyzer for Sanskrit. Girish Nath Jha, Muktanand Agrawal, Subash, Sudhir K. Mishra, Diwakar Mani, Diwakar. Rocquencourt, France : s.n., 2007.