A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION

A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION B. Hettige (08/8021) Degree of Master of Philosophy Department of Infor...
Author: Bruno Gilbert
24 downloads 2 Views 4MB Size
A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION

B. Hettige

(08/8021)

Degree of Master of Philosophy

Department of Information Technology

University of Moratuwa Sri Lanka

December 2010

A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION

Budditha Hettige

(08/8021)

Thesis submitted in partial fulfillment of the requirements for the degree Master of Philosophy

Department of Information Technology

University of Moratuwa Sri Lanka

December 2010

Declaration of the Candidate and the Supervisor I declare that this is my own work and this thesis does not incorporate any material previously submitted for a Degree or Diploma in any other University or institute of higher learning, without acknowledgement. It does not contain any material previously

published

or

written

by

another

person

except

where

the

acknowledgement is made in the text to the best of my knowledge and belief. Also, I hereby grant to University of Moratuwa the non-exclusive right to reproduce and distribute my thesis, in whole or in part in print, electronic or other medium. I retain the right to use this content in whole or part in future works (such as articles or books). Signed

………………………………

…………………..

Budditha Hettige

Date

Candidate

The above candidate has carried out research for the M. Phil. dissertation under my supervision.

……………………………………………..

…………………..

Prof. Asoka S. Karunananda

Date

…………………………………………….

……………..

Dr. i

Abstract Communication is fundamental to the evolution and development of all kinds of living beings. With no disputes, languages should be recognized as the most amazing artifacts ever developed by mankind to enable communication. Computer has also become such a unique machine, due to its capacity to communicate with humans through languages. It is worth mentioning that the languages understood by computers and humans are quite different, yet people can communicate with computers. This has been possible since the computer is fundamentally an artifact that can translate one language to another. Therefore, computers must be able to do language translations than any other computing task. Nowadays, computing is evolving to enable machine-machine communication with no or little human intervention, yet humans continue to face with what is called language barrier for communication. In particular, a vast collection of world knowledge written in English has been inaccessible to communities who cannot communicate in English. Such communities are unable to contribute to the development of world knowledge due to the language barrier. As a result many people have embarked into research in computer aided natural language translation. This area is commonly known as Machine Translation. Among others, Aptium, Bable fish, Google translator, SYSTRAN, EDR, Anusaaraka, AngalaHindi, AnagalaBarathi, and Mantra are some examples for popular machine translation systems. These systems use various approaches including Human-assisted, Rule-based, Corpus-based, Knowledgebased, Hybrid and Agent-based to translate from one language to another. However, due to inherent diversifications of natural languages, a generic machine translation approach is far from reality. This thesis presents a computational grammar for Sinhala language to develop English to Sinhala machine translation system with an underlying theoretical basis. This system is known as BEES, an acronym for Bilingual Expert for English to Sinhala machine translation. The concept of Varanegeema (conjugation) in Sinhala language has been considered as the philosophical basis of this approach to the development of BEES. The Varanegeema in Sinhala language is able to handle large number of language primitives associated with nouns and verbs. For instance, Varanegeema handles the language primitives such as person, gender, tense, number, preposition and subjectivity/objectivity. More importantly, Varanegeema allows deriving all associated word forms from a given base word. This enables to drastically reduce the size of the Sinhala dictionary. Since the concept of Varanegeema can be expressed by a set of rules, it nicely goes with rule-based implementation of machine translation systems. BEES implements 85 grammar rules for Sinhala nouns and 18 rules for Sinhala verbs. BEES compresses with seven modules namely English Morphological analyzer, English Parser, English to Sinhala base word translator, Sinhala Morphological Generator, Sinhala Parser, Transliteration module and Intermediate Editor. In addition to the main modules, system comprises of four dictionaries, namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary and the Concept dictionary. BEES primarily shares the features with the Rule-based, Context-based and Human-assisted approaches to machine translation. The BEES has been implemented using Java and Swi-Prolog to run on both Linux and Windows environments. The English to Sinhala Machine Translation system, BEES has been evaluated to test the hypothesis that concepts of Varanegeema can be used to drive English to Sinhala machine translation. The English to Sinhala machine translation system has been evaluated through three steps. As the first step, all the language processing primitives such as morphological analyzers, parsers, translator and the transliteration module have been tested through the white box testing approach. In order to test each module, several online testing tools ii

including English morphological analyzer, English parser and Sinhala word generator have been implemented. By using these online tools each module has been completely tested through a carefully created test plan. In addition, an online evaluation test bed has also been implemented to continuously capture feedback from online users. This online evaluation test bed gives facilities to make different types of sentences using a given set of words. Word Error Rate and the Sentence Error Rate were calculated by using these evaluation results. Finally the intelligibility and the accuracy tests have been conducted through the human support. In order to evaluate the intelligibility and the accuracy of the English to Sinhala machine translation system, following steps were followed. Two hundred sample sentences were collected and grouped into 20 sets (10 sentences per each set). Then each sentence was translated using the English to Sinhala Machine Translation system. Each set was given to the human translators and scored. The intelligibility and the accuracy were calculated through the above evaluation results. The experimental result shows that English morphological analyzer, English parser, English to Sinhala base word translator, Sinhala morphological generator and the Sinhala sentence generator successfully work with more than 90% accuracy. Overall result of the evaluation shows 89% accuracy with the word error rate of 7.2% and the sentence error rate of 5.4%. The BEES successfully translates English sentences with simple or complex subjects and objects. The translation system successfully handles most commonly used patterns of the tenses including active and passive voice forms.

iii

Acknowledgements This thesis is the result of four years of devoted work whereby I have been accompanied and supported by many people. It is a pleasant aspect that I have now the opportunity to express my gratitude for all of them. I am grateful to the University of Moratuwa especially to the faculty of Information Technology for providing me the opportunity to do a research study. The first person I would like to thank is my supervisor Prof. Asoka Karunananda for whom a few lines are too short to make a complete account of my deep appreciation. This study would not have been such a success without his commonsense knowledge and perceptiveness. I owe him lots of gratitude for showing me this way of research. Besides apart from being an excellent supervisor Prof. Karunananda has been an understanding teacher and he has provided me support in every aspect for the success of this research. I am also grateful to thank Dr. Sarath Bannayake, Head, Department of Statistics and Computer Science, University of Sri Jayawardenepura for assistance he has given to me during the research work. With the great pleasure and deep sense of gratitude, I acknowledge Mr. P. Dias former head; Senior Lecturer Department of Statistics and Computer Science, University of Sri Jayawardenepura for the great help provided me to make a method for evaluation. I would also like to thank Mr. Niranjan Bandara, Lecturer, Department of Sinhala and Mass Communication, University of Sri Jayawerdenepura for his valuable support to correct some Sinhala language issues. I would like to give my great pleasure and deep sense of gratitude to Venerable Kirioruwe Dhamananada thera, Venerable Kukulpane Sudassi thera and Venerable Matttumagala Chandanada thera for their valuable support given to me to solve Sinhala and English language problems by sharing their knowledge of Sinhala, Pali and Sanskrit Language structures.

iv

I am deeply indebted to Mr. Duminda de Silva, Head, Department of Mathematics and Computer Science, The Open University of Sri Lanka for the encouragement extended to me throughout this study. I wish to extend my sincere gratitude to Ms. G. S. Makalanda, Dr. T.G.I. Fernando and Dr. E. A. T. A. Edirisooriya, for their great support and encouragement extended to me throughout this study. My deepest gratitude goes to my mother and my wife for the unconditional support given and without their support, this would have been impossible. Again, I must give a big thank to my wife Lakshimi for tolerating my busy schedules due to the research work. Last but not least I thank all who supported me to make this work a success.

January 3, 2011

Budditha Hettige

v

Table of Contents Declaration of the Candidate and the Supervisor



Abstract

ii 

Acknowledgements

iv 

Table of Contents

vi 

List of Figures

xi 

List of Tables

xii  1 

Chapter 1 Introduction 1.1 Preamble



1.2 English to Sinhala Machine Translation



1.3 What are Machine Translation Systems?



1.4 Aim of the Research



1.5 Objectives of the Research



1.6 1.5 Scope of the Project



1.7 Hypothesis



1.8 Structure of the Thesis



1.9 Summary



Chapter 2 State of the Art of Machine Translations



2.1 Introduction



2.2 Fundamentals of the Natural Language Processing



2.3 Machine Translation Systems



2.4 Current Approaches to Machine Translation

10 

2.4.1 Human-assisted Machine Translation

10 

2.4.2 Rule-based Machine Translation

12 

2.4.2.1 Transfer-based Machine Translation

14 

2.4.2.2 Interlingua Machine Translation

15 

2.4.2.3 Dictionary based Machine Translation

16 

2.4.3 Statistical Machine Translation

17 

2.4.4 Example-based Machine Translation

18 

2.4.5 Knowledge-based Machine Translation

19 

2.4.6 Hybrid Machine Translation

20 

2.4.7 Agent-based Machine Translation

20 

2.5 Existing English to Sinhala Machine Translation Systems

21 

2.6 Concepts and Techniques for Machine Translation

22 

vi

2.6.1 Morphological Analysis

23 

2.6.2 Syntax Analysis

24 

2.7 Problem Definition

25 

2.8 Summary

26 

Chapter 3 Overview of the English and Sinhala Languages

28 

3.1 Introduction

28 

3.2 The English Language

28 

3.3 The English Language Morphology

28 

3.3.1 English Noun Morphology

29 

3.3.2 English Verb Morphology

30 

3.3.3 English Adjective Morphology

31 

3.4 Syntax of the English Language

32 

3.4.1 The English Sentence Subject

33 

3.4.2 The English Predicate

33 

3.4.3 Verb Tense

33 

3.4.4 The Complement

34 

3.5 Semantics of English Language

35 

3.5.1 Word Level Semantics

35 

3.5.2 Sentence Level Semantics

35 

3.5.3 The paragraphs Level Semantics

35 

3.6 The Sinhala Language

35 

3.6.1 Sinhala Alphabet

36 

3.7 Sinhala Language Morphology

38 

3.7.1 Sinhala Noun Morphology

38 

3.7.2 Sinhala Verb Morphology

41 

3.8 Syntax of the Sinhala Language

43 

3.9 Semantics of the Sinhala Language

44 

3.10 Comparison Between English and Sinhala

44 

3.10.1 Fundamental Differences

45 

3.10.2 Morphological Differences

45 

3.10.3 Syntax in the two Languages

46 

3.11 Language Issues

46 

3.11.1 Grammatical Issues

47 

3.11.2 Text Manipulation Issues

47  vii

3.12 Challenges in English to Sinhala Machine Translation

48 

3.12.1 Word and Sentence Segmentation

49 

3.12.2 Lexical Selection

49 

3.12.3 Conjugation

49 

3.12.4 Tense Detection

50 

3.12.5 Article Insertion

50 

3.12.6 Sentence boundaries

50 

3.12.7 Word Order

50 

3.13 Summary

51 

Chapter 4 Novel Approach to Machine Translation

52 

4.1 Introduction

52 

4.2 A Theoretical-based Approach to Machine Translation

52 

4.3 Computational Model of Grammar for Sinhala

53 

4.3.1 Computational Model for Sinhala Morphology

53 

4.3.2 Context-Free Grammar for Sinhala language

53 

4.4 Hypothesis

57 

4.5 Approach in a Nutshell

57 

4.6 Features of BEES

57 

4.7 Input for BEES

58 

4.8 Output of BEES

58 

4.9 Process of BEES

58 

4.10 Summary

59 

60 

Chapter 5 Design of BEES 5.1 Introduction

60 

5.2 Design of BEES

60 

5.2.1 English Morphological Analyzer

60 

5.2.2 English Parser

62 

5.2.3 English to Sinhala Base Word Translator

62 

5.2.4 Sinhala Morphological Generator

63 

5.2.5 Sinhala Parser

63 

5.2.6 Transliteration module

64 

5.2.7 Intermediate Editor

64 

5.2.8 Lexical Resources

65 

5.3 Supporting modules

66  viii

5.3.1 Dictionary Updater

66 

5.3.2 Sinhala Word Generator

67 

5.3.3 Online Search module

67 

5.4 Summary

68 

69 

Chapter 6 Implementation 6.1 Introduction

69 

6.2 Development Stages

69 

6.3 Implementation of the BEES

70 

6.3.1 English Morphological Analyzer

70 

6.3.2 English Parser

74 

6.3.3 English to Sinhala Bilingual Translator

77 

6.3.4 Sinhala Morphological Generator

78 

6.3.5 Sinhala Sentence Composer

81 

6.3.6 Transliteration Module

82 

6.3.7 Intermediate Editor

83 

6.3.8 Lexical Resources

84 

6.3.8.1 English Dictionary

84 

6.3.8.2 Sinhala dictionary

86 

6.3.8.3 English-Sinhala Bilingual dictionary

89 

6.3.8.4 Concept Dictionary

90 

6.4 Supporting modules

91 

6.4.1 Online Updater

91 

6.4.2 Sinhala Word Generator

92 

6.4.3 Online Search module

93 

6.5 Summary

94 

95 

Chapter 7 BEES in Action 7.1 Introduction

95 

7.2 BEES as an Online Translator

95 

7.3 BEES as a Web Page Translator

97 

7.4 BEES as a Selected Sentence Translator

100 

7.5 BEES as a Desktop Application

102 

7.6 Summary

106 

107 

Chapter 8 Evaluation 8.1 Introduction

107  ix

8.2 Evaluation of MT systems

107 

8.3 BEES Evaluation

109 

8.4 Stage1: Module Testing

110 

8.4.1 English Morphological Analyzer

110 

8.4.2 English Parser

111 

8.4.3 English to Sinhala Base Word Translator

112 

8.4.4 Sinhala Morphological Generator

113 

8.4.5 Sinhala Sentence Composer

114 

8.4.6 Transliteration Module

115 

8.5 Stage 2: Performance Testing

115 

8.6 Stage 3: Accuracy Testing

117 

8.7 Result of the Experiments

118 

8.8 Summary

121 

Chapter 9 Conclusion and Further Work

122 

9.1 Introduction

122 

9.2 Revisited Objectives

122 

9.3 Limitations

124 

9.4 Further Works

124 

9.5 Summary

125 

References

126 

Appendix A: English Morphological analyzer- Test plan

135 

Appendix B: Conjugation Table for Sinhala Language

137 

Appendix C: Context-Free Grammar for Sinhala Language

143 

Appendix D: Finite State Transducer for Sinhala Transliteration

145 

Appendix E: Sample Evaluation form

147 

Appendix F: Sample of evaluator’s Comments

148 

x

List of Figures Figure 2.1: Architecture for a rule-based machine translation system

13 

Figure 4.1: Finite State Automata for Kaputu Ganaya

54 

Figure 4.2: Parser tree for the sample sentence

56 

Figure 5.1: Design of the BEES

61 

Figure 5.2: FST for Vowels in model 1 transliteration

64 

Figure 5.3: Design of the three supporting module

67 

Figure 6.1: The Intermediate Editor

83 

Figure 7.1: Web based architecture for the BEES

95 

Figure 7.2: User interface of the Online BEES

96 

Figure 7.3: A web page translator

97 

Figure 7.4: BEES as a web page translator

100 

Figure 7.5: Selected sentence translator

101 

Figure 7.6: Desktop screen for selected sentence translation

101 

Figure 7.7: User interface of the BEES

103 

Figure 8.1: English Morphological analyzer with test results

111 

Figure 8.2: Sinhala word conjugator

114 

Figure 8.3: User interface of the evaluation test bed

116 

Figure 8.4: Online evaluation form

117 

Figure 8.5: Translation accuracy

121 

xi

List of Tables Table 2.1: Existing Machine translation systems

26 

Table 3.1: Regular and irregular forms of the English Noun

29 

Table 3.2: English Noun Morphological rules

30 

Table 3.3: English verb Morphology

31 

Table 3.4: Morphological rules for English Verbs

32 

Table 3.5: Tense patterns (Active voice)

33 

Table 3.6: The Sinhala Alphabet

36 

Table 3.7: Vocalic Stokes and their position

37 

Table 3.8:The consonant ‘l’ with vocalic stokes

37 

Table 3.9: Sample case makers in Sinhala

40 

Table 3.10: conjugation table for ‘we;a’ ganaya

41 

Table 3.11: Inflection form of the Sinhala verbs (Active)

42 

Table 3.12: Inflection form of the Sinhala verbs (Passive)

43 

Table 4.1: Paradigm table for Kaputu Ganaya

54 

Table 6.1: Grammatical notations for the English Dictionary

84 

Table 8.1: Sample test plan for English Morphological analyzer

110 

Table 8.2: Sample test plan for English parser

112 

Table 8.3: Sample Sinhala Morphological rules

113 

Table 8.4: Results for module testing

119 

Table 8.5: Human evaluation results

120 

Table 8.6: Accuracy results

120 

Table 8.7: Final evaluation results

121 

.

xii

Chapter 1 INTRODUCTION 1.1 Preamble A Natural Language is a kind of marvelous artifact ever invented by mankind. It is a cornerstone of all kinds of communications. Each natural language plays the role of describing thoughts of humans in a particular environment. As such, a natural language has a strong bearing on the culture and the environment within which a certain community of persons live. This is why we identify large number of different natural languages worldwide. Despite the differences in languages, people still want to communicate with persons who use different languages. Differences in languages have become a barrier for cross-cultural communications. In particular, many nations have not been able to access a huge reservoir of world knowledge written in English, unless those nations have a sound knowledge in English. On the other hand, people do not know English will not be able to contribute to the world knowledge. It is undisputable the importance of mother tongue for discovery and creation of new systems of knowledge. Consequently, this has resulted in what is called language barrier for communication. In fact, this issue is not only between English and other languages, but also between any two languages. Of course, people have been practicing a solution for the issue. That is nothing but translation between two languages by knowing the both languages. However, can we really expect everyone to know every language? Undoubtedly, this is impractical. The emergence of digital computer technology in early 1950s had postulated the concept of machine translation to seek assistance from computers to seek solutions for long felt language needs of humans. Since then hundreds of research works have been conducted to translate between natural languages. The machine translation has been a branch of Natural Language Processing, which comes under the broad area of Artificial Intelligence. It is commonly cited that machine translation has been one of 1

the least achieved area in Artificial Intelligence over the last sixty years. As such, a generic approach to machine translation has been an unrealized dream of researchers. Thus, machine translation approaches have become so much language specific.

1.2 English to Sinhala Machine Translation This thesis presents a research conducted to develop English to Sinhala machine translation system. Sinhala is one of the Indo Aryan family languages and it is the spoken language of 74% of the people in Sri Lanka. Sinhala has also been one of the constitutionally recognized official languages of Sri Lanka [53]. Numbers of Statistical results show that, more than 80% of Sinhala spoken community does not have the ability to read and write in English [46][126]. While encouraging the learning of English, one also cannot devalue the importance of mother tongue for discovery of knowledge for the betterment of mankind. In the Asian region, many countries including India, Thailand, Malaysia and Japan have conducted considerable amount of research in machine translation. Despite Sri Lanka has been working on various projects in machine translation, still little behind as compared with similar researches conducted in the Asian region. Weerasinghe [154] has pioneered machine translation research in Sri Lanka. Thus, this project will contribute to extend machine translation initiatives in Sri Lanka. The project presents a theoretical-based translation approach, which would also be beneficial to machine translation projects, which handles languages closer to Sinhala language. Before presenting the aim and objectives of the project, a brief introduction to field of machine translation is given in section 1.3.

1.3 What are Machine Translation Systems? The Machine Translation system refers to computer software that translates text or voice from one natural language into another with or without human assistance [73] 2

[154]. According to the design, each Machine translation system can be broadly categorized into two groups, namely, the direct translation system and the indirect translation system. The direct translation system translates source language into target language by using word-to-word or phrase-to-phrase mapping. In contrast, indirect translation systems use an Interlingua or some kind of transfer method. This approach starts with an analysis of source text and performs a synthesis to generate corresponding text in the target language. Figure 1.1 gives classic pyramid to show relationship between these two approaches to machine translation.

Figure 1.1: Relationship between direct and indirect translations

Under the above two broad areas, several approaches have been used to develop hundreds of machine translation systems all over the world. Among other approaches, Human-assisted, Rule-based, Statistical, Example-based, Knowledgebased, Hybrid, and Agent-based are commonly cited as the most successful approaches for machine translation. Comparing the existing machine translation systems and their approaches, many of these systems use sequential level architecture for Natural Language Processing and machine translation [59]. This sequence comprises of steps such as preprocessing, 3

morphological analysis, syntax analysis, semantic analysis, pragmatic analysis and post processing. Despite many attempts have been taken to develop machine translation systems, at present this area has achieved very little. In fact due to ever felt need of machine translation, some people have rushed to develop such systems without a proper conceptual or theoretical basis for their approaches. This has resulted in creating many machine translation systems that go through ad-hoc processes to translate between languages. This also amounts to constraint the development in the field of machine translation.

1.4 Aim of the Research This thesis proposes to design and develop English to Sinhala machine translation system with a theoretical basis.

1.5 Objectives of the Research In order to reach the above aim, the following key objectives have been identified. These objectives range from critical review of existing approaches to machine translation to evaluation of the proposed theoretical-based approach to machine translation. Objective 1 Critically review the existing systems, concepts and tools for machine translation. Objective 2 Develop a Computational grammar for Sinhala Language Objective 3 Design and develop English to Sinhala Machine Translation system 4

Objective 4: Evaluate the system

1.6 1.5 Scope of the Project The scope of the project is limited to develop a computational grammar for Sinhala language as per concept of Varanegeema to handle most commonly used 27-noun forms and 36 verb forms.

1.7 Hypothesis In order to achieve the above aim and objectives, the hypothesis employed in the thesis can be stated as concepts of “Varanegeema” (Conjugation) in Sinhala languages can be used to drive English to Sinhala Machine translation.

1.8 Structure of the Thesis The thesis has been structured with nine chapters. The following is the structure of the thesis with a brief explanation of the contents of each chapter.

Chapter 1 has provided an overall introduction to the whole research project. It briefly explained the research problem addressed in the thesis, overview for machine translation, aim, objectives and the hypothesis employed in the thesis. Chapter 2 reports on the literature survey on Machine Translation with a detailed description leading to highlight the problem addressed in the thesis. Also this chapter provides a detailed study about the state of the art Natural Language Processing by describing different approaches adapted.

5

Chapter 3 is on an overview of the English and Sinhala languages as per Morphology, Syntax and Semantic concerns of the both languages. This chapter also gives a compression between English and Sinhala languages by showing issues related to machine translation.

Chapter 4 discusses the novel approach taken to develop English to Sinhala machine translation system. It presents the hypothesis of the project in the first place. Then the chapter explains the mechanism of the translation process, nature of input, output and key features of the system. Chapter 5 is about the design of the proposed English to Sinhala Machine Translation system. Each and every module of the design model is explained separately by describing the functionality and relation among the modules. Chapter 6 presents the implementation of the English to Sinhala machine translation system. This chapter gives implementation details about prolog-based modules, java based user interface, Intermediate editor and ontology of the lexical databases. Chapter 7 presents how BEES works in practice when translating a given English text. This chapter also explains applications of BEES as, a standalone translator, an on demand translator, web page translator and selected text translator for machine translation. Chapter 8 reports evaluation of the English to Sinhala machine translation. The evaluation methodology, evaluation steps, participants and the result of the evaluation are also given in this chapter. Chapter 9 concludes the thesis by referring to achievement of each objective. The chapter also presents limitations and further work of the research conducted.

6

1.9 Summary This chapter provided an overview for the entire project by describing the problem to be addressed, aim, objectives and the hypothesis employed in the thesis. It briefly explained the proposed English to Sinhala Machine Translation. Structure of the rest of the thesis has also been presented in the chapter. The next chapter reports on critical review of the existing approaches to machine translation together with major machine translation systems that are based on these approaches.

7

Chapter 2 STATE OF THE ART OF MACHINE TRANSLATIONS 2.1 Introduction The previous chapter presented an overview of the thesis. This chapter gives the state of the art of Natural language processing with a special attention on the Machine Translation. Some of the related fundamental aspects in Machine Translation will also be discussed in this chapter.

2.2 Fundamentals of the Natural Language Processing The Natural Language Processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (Natural) languages [107]. It is also a sub field of Artificial Intelligence (AI) in the area of Computer Science [128]. According to many electronic resources, the history of the Natural language processing began with the Turing article named “Computing Machinery and Intelligence” [151]. It is known as the Turing test as a criterion of

intelligence. After that, In 1957 Noam Chomsky in the academic and scientific community as one of the fathers of modern linguistics, introduced the Syntactic Structures for grammar [31]. It is recognized as a most important text in the field of linguistics. After that, it becomes fundamental theory for Natural Language Processing and many of these Machine Translation systems use this syntactic structure [31][33]. The Natural language processing has come under broad area of the field of Artificial Intelligence. The NLP is used to do several tasks including machine translation, automatic summarization, Information retrieval, optical character recognition, speech recognition, text-to-speech etc [107][128][147]. Based on the task, the Natural Language Processing systems reserved several issues such as Natural language understanding, Natural language generation, Speech and text segmentation, Part-of-speech tagging and the Word sense disambiguation [84] 8

2.3 Machine Translation Systems Machine Translation system is a computer software to translate text or speech from one natural language to another [161][162]. The Machine translation is a sub area of the Natural language processing which is identified during early days of Artificial Intelligent (AI). Due to various reasons associated with complexity of languages, for more than last sixty years, Machine Translation has been identified as one of the least achieved areas in computing [74]. These issues range from Morphological to semantics of source and target languages. The history of Machine Translation dates back to late 1940s. A look-up dictionary at Birkbeck College in London has been cited as an early work of machine translation in 1948. After that, 1950 to 1960 many researchers attended to develop Machine Translation systems by using trial-and-error approach [75] especially for Russian to English language. In 1950 first machine translation system was developed to translate Russian sentences into English. In 1958 first practical machine translation system was implemented by the IBM Corporation to US Air force under direction of Gilbet King [76]. This system translates Russian text into English and it successfully works until 1970. In the meantime RAND cooperation distributed current linguistic theory and emphasized the Statistical analysis. They were prepared bilingual glossaries with grammatical information and the grammar rules with the first parser based on the dependency of grammar. In 1970, SYSTRAN [144] implemented a new Russian-English machine translation system which is the replacement of the previous system of the US Air force. This system translated more than 100000 pages per year. In the mean time, many researchers were attempting to develop machine translation systems. Among others, syntactic transfer system for English-French is one of the strong researches in the field. Further, principal experimental effect focused on the Interlingua approaches with more attention pays to the syntactic aspects [75].

9

In 1980, many computer companies attempted to develop computer-aided translations especially for Japanese-English. These systems are low level direct translation systems that are confined to morphological and syntactic analysis. After 1980 Machine translation researches were developed through many areas. Corpusbased machine translation approach is the most popular approach until now. However, due to the complexity of the natural languages, development of the machine translation systems has become a research challenge. In addition, many researchers have also noted that, Operational syntax, idioms and Universal syntactic categories are some completely unsolved linguistic problems in the machine translation [171].

2.4 Current Approaches to Machine Translation Considering the translation approaches, machine translation system can be classified into seven categories, namely, Human-assisted, Rule-based, Statistical, Example-based, Knowledge-based, Hybrid and Agent-based. Statistical, Example based, Knowledge based and Hybrid approaches are used copra for the machine translation. Therefore, these approaches are named as corpus-based approach. All of these machine translation approaches have their own strengths and weakness. Obviously, the success rate of a translation is depended on the approach. Each approach for the machine translation is discussed below.

2.4.1 Human-assisted Machine Translation Human-assisted machine translation approach is an approach for the machine translation particularly Indian families of machine translation. The human assisted approach uses human interaction for the pre editing, post editing and/or intermediate editing stages[85]. This approach uses human support for the semantic handling in the machine translation. Using this human assisted approach, numbers of machine translation systems have been developed.

10

In the Indian region a number of machine translation systems have used this approach, including Anusaaraka, ManTra, MaTra, Angalabarathi etc [133][38][146]. Anusaaraka [4] [7] is a popular Human-assisted translation system for Indian languages that makes text in one Indian language accessible to another Indian language. This system uses Paninian Grammar model [6] to its language analysis. The Anusaaraka project [16] has been developed to translate Punjabi, Bengali, Telugu, Kannada and Marathi languages into Hindi. English-Hindi Anusaaraka translates English text into Hindi. The approach and lexicon is general, but the system has mainly been applied for children’s stories [95]. MaTra is a human-assisted transfer-based translation system for English to Hindi [11]. This System uses general-purpose lexicons and applied mainly in the domains of news. MaTra follows a structural and lexical transfer approach for its machine translation. The MaTra aims to produce understandable output for wide coverage, rather than perfect output for a limited range of sentences. Mantra [106] is a machine assisted translation tool that, translates English text into Hindi in several domains. ManTra is based on the Tree Adjoining Grammar (TAG). The Mantra system was started with the translation of administrative documents such as appointment letters, notification and circular issued in central government from English to Hindi. Angalabharti [103] is also a human-assisted machine translation system used in India. Since India has many languages, there are a variety of machine translation systems. For example, Angalahindi [133] translates English to Hindi using machineaided translation methodology. Human-aided machine translation approach is a common feature of most Indian machine translation systems. In addition, these systems also use the concepts of both pre-editing and post-editing as the means of human intervention in the machine translation system. Chandrashekhar Research Centre [20] has developed a machine aided translation system for Tamil to Hindi.

Tamil to Hindi translator is based on Anusaaraka

Machine Translation System and the input text is in Tamil and the output can be seen 11

in a Hindi text. Stand-alone, API and Web-based on-line versions are developed. Tamil morphological analyzer and Tamil-Hindi bilingual dictionary are the byproducts of this system [133]. In addition to the above, KSHALT is a human assisted Machine Translation system that translates English to Korean language [85]. This translation system contains four phrases namely English Parser, English Analyzer, English to Korean transfer and the Korean generation.

2.4.2 Rule-based Machine Translation Rule-based approach is yet another approach for machine translation. This approach gives grammatical correct translation by using set of rules. Basically, the rule-based machine translation system contains a source language morphological analyzer, a source language parser, translator, target language morphological analyzer, target language parser and several lexicon dictionaries. Source language morphological analyzer analyzes a source language word and provides the morphological information. Source language parser is a syntax analyzer that analyzes source language sentences. Translator is used to translate a source language word into target language. Target language morphological analyzer works as a generator and it generates appropriate target language words for the given grammatical information. Also target language parser works as a composer and it composes a suitable target language sentence. Furthermore, this type of machine translation system needs minimum of three dictionaries namely the source language dictionary, the bilingual dictionary and the target language dictionary. Source language morphological analyzer needs a source language dictionary for morphological analysis. Bilingual dictionary is used by the translator for translating source language into target language; and the target language morphological generator uses the target language dictionary to generate target language words. Figure 2.1 can present general architecture of the rule-based machine translation system.

12

A number of machine translation systems have been designed through the rulebased approach. Among others Apertium [18] is a rule-based Machine Translation system, which translates related languages. This is an open–source system that can be used to translate any related two languages. The Apertium engine follows a shallow transfer approach and consists of the eight pipelined modules, such as deformatter, A morphological analyzer, A parts-of-speech (PoS) tagger, A lexical transfer module, A structural transfer module, A morphological generator, A postgenerator, and A re-formatter. Source language

Source Language Morphological

Source language

Analyzer

Dictionary

Source Language parser

Bilingual translator

Bilingual Dictionary

Target language Morphological generator

Target language Dictionary

Target language sentence generator

Target Language

Figure 2.1: Architecture for a rule-based machine translation system

Toshiba [145] is another Rule-based Machine translation system for English to Japanese vice versa. To translate a given source text, system uses Morphological analysis, Syntax analysis, translation word selection and structural transformation, syntax transformation and morphological generation steps. This system can translate open-domain written texts by using rule-based. This system uses three dictionaries namely common word dictionary, a technical-term dictionary and a user-defined 13

dictionary. The common word dictionary includes both English-Japanese and Japanese- English translation. The technical term dictionary includes domain-specific technical terms. They have used user defined dictionary to store user provided information such as unknown word information. Further, rule-based machine translation approaches can be categorized as three groups namely transfer-based, Interlingua and dictionary based. The transfer based and Interlingua approach has same idea for translation. Both two approaches used intermediate representation that captures the "meaning" of the original sentence [10][84][56]. The difference between both approaches is the interlingua-based system uses language independent intermediate representation and transfer-based system uses language dependent intermediate representation. Most of these machine translation systems include Morphological analysis, lexical categorization, lexical transfer, Structural transfer and Morphological generation. The dictionary based machine translation system uses dictionary for its machine translation with or without Morphological or syntax analysis. These type of Machine Translation systems ideally suitable to translate long lists of phrases. Numbers of machine translation systems have been developed under the above three border headings.

2.4.2.1 Transfer-based Machine Translation Lavie and others [96] have applied transfer based approach to the Hindi-to-English translation system named Xferand. It trained under the extremely limited data scenario. This Xfer system uses IIITMorpher (Morphological analyzer) [79] to analyze Hindi words with the root and the other features such as gender, number, and tense. The Xfer system uses 70 transfer rules including a rather large verb paradigm, with 58 verb sequence rules, ten recursive noun phrase rules and two prepositional phrase rules. They have noted that, this approach is particularly suitable for languages with very limited data resources. Arabic to English machine translation system has been developed through the Transfer-based approach [120]. This system is named as Npae-Rbmt. The Npae14

Rbmt is used an intermediate representation that captures the “meaning” of the original sentence in order to generate the correct translation. This system has evaluated through the 88 thesis titles and journals from the computer science domain. The accuracy of the result was 94.6%. Apertium platform follows a transfer-based machine translation model [18]. Using these shallow-transfer approach Swedish to Danish machine translation system has been developed [125]. Swedish to Danish machine translation system uses two morphological dictionaries to analysis and generation. This is the first free software translator of Swedish to Danish. Using Affix-Transfer-based approach, Tagalog-to-Cebuano [170] Unidirectional Machine Translator system has been developed. The morphological analysis is based on TagSA (Tagalog Stemming Algorithm) and is focused on an affix correspondence-based POS (parts-of-speech) tagger. Opentrad is an open source transfer based Machine translation system intended for related language pairs and not so similar pairs [3][48]. The Opentrad uses different translation methods according to each language pair. For related languages it uses shallow transfer, even though for nonrelated pairs the system uses deep transfer [49]. Opentrad also uses open-source machine translation engine[101] (Matxin) as the translation engine. OpenLogos is the Open Source version of the Logos Machine Translation System [122]. It is one of the earliest and longest running commercial machine translation products in the world. This system accepts documents in various formats and produces high quality translations [136]. OpenLogos translates from English and German to the major European languages, including Spanish, Italian, French and Portugese.

2.4.2.2 Interlingua Machine Translation The Interlingua approach gives language independent meaning representation for the source language to target language translation. The Interlingua gives one single meaning representation for all the languages and it has been reserved as an extremely 15

difficult task in practice [135]. However, there are several advantages in the Interlingua approach. Among others Interlingua gives more easy way to adding new language than all other methods. Also it seems several disadvantages. Meaning representation is the critical approach in Interlingua. If the meaning is too simple then meaning will be lost in the translation. On the other hand it is too complex and analysis and generation will be too difficult. Numbers of Machine translation system have been developed through the Interlingua approach. Abdelhadi and others have been developed English to Arabic machine translation system based on Interlingua approach [1]. They have used mapping system to Arabic to intermediate representation. This mapping system contains three steps namely, selecting lexical items for each Interlingua concepts, mapping the semantic roles and mapping the semantic features for each Interlingua concept to appropriate syntactic feature in the feature structure. Among others ICENT is the interlingua-based Chinese-English natural language translation system [167]. This system introduces the realization mechanism of Chinese language analysis, which contains syntactic parsing and semantic analyzing and gives the design of Interlingua in details. Tai to English machine translation system is another successful machine translation system for Tai to English [29]. This system translates the Thai sentences into Interlingua of a Thai LFG tree using LFG grammar and a bottom up parser.

2.4.2.3 Dictionary based Machine Translation The dictionary based machine translation systems are commonly used for crosslanguage retrieval systems [77]. This dictionary based approach uses dictionarybased method to generate the equivalent target query for the given source language query. Mandal and others [105] have been developed a cross-language retrieval system for the retrieval of English documents in response to queries in Bengali and Hindi. 16

This dictionary-based machine translation system uses to generate the equivalent English query out of Indian language topics. Thenmozhi and Aravindan have been developed Tamil-English Cross Lingual Information Retrieval System for Agriculture Society [149]. This system developed for the Farmers of Tamil Nadu which helps them to specify their information need in Tamil and to retrieve the documents in English. It uses a Morphological Analyzer to obtain the root terms of source query. This Machine Translation approach retrieves the pages with mean average precision of 95%.

2.4.3 Statistical Machine Translation Statistical machine translation approach is by far the most widely-studied machine translation method in the field of natural language processing. This approach tries to generate translations using statistical methods based on bilingual text corpora [84]. Using this statistical approach, large numbers of machine translation systems have been developed. Moses is a Statistical machine translation system that allows automatically train translation models for any language pair [108]. The Moses system has several features. It offers two types of translation models namely, phrase-based and treebased. Moses system uses factored translation models, which enable the integration linguistic and other information at the word level. Babel Fish [168] is a web-based application developed by AltaVista which translates text or web pages from one language into another. The translation technology for Babel Fish is provided by SYSTRAN [144], whose technology also powers the translator at Google and a number of other sites. It can translate among English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. A number of sites have sprung up that used the Babel Fish service to translate back and forth between one or more languages.

17

Bing Translator [112] is a service provided by Microsoft as part of its Bing services which allow users to translate texts or entire web pages into different languages. All translation pairs are powered by Microsoft Translation, developed by Microsoft Research; it uses Microsoft's own syntax-based statistical machine translation technology. Google Translator [51] translates a section of text, or a webpage, into another language. It does not always deliver accurate translations and does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis. In the Indian region, Udupa and Faruquie have developed an English-Hindi Statistical Machine Translation System [152]. This machine translation system is based on IBM Models 1, 2, and 3. The system has been tested through the EnglishHindi parallel corpus consist of 150,000 sentence pairs. Singh and Bandyopadhyay have been developed Manipuri-English bidirectional statistical machine translation system [133]. The system uses four useful translation factors namely case markers and POS tags information at the source side and suffixes and dependency relations at the target side. This translation system has been evaluated through the BLEU score.

2.4.4 Example-based Machine Translation The example-based machine translation system uses bilingual corpus with the parcel text for the machine translation. These systems are trained through the bilingual parallel copra, which contain sentence pairs. The example based approach is more useful for detecting the context from the text. Also this approach uses translation memories [13]. Using this approach number of machine translation systems have been developed all over the world.

18

Among others, OpenMaTrExis one of the open source Example-based machine translation systems which is freely available on the OpenMaTrEx web site [121]. OpenMaTrEx has been developed through the marker hypothesis, which is compressed on marker-driven chunker, a collection of chunk aligners and two engines. Kyoto-U is a successful Example based machine translation system that translates English-Japanese [119]. This system uses a morphological analyzer and dependency analyzer to detect Japanese sentence structures and converted into dependency structures. In addition, Japanese and English parsers and bilingual dictionary were used as external resources. At present many researchers are researching to develop example-based machine translation systems by using World Wide Web as parallel corpora [55]. The wEBMT is an example-based machine translation (EBMT) system that uses the World Wide Web as the parallel corpus [13].

2.4.5 Knowledge-based Machine Translation Knowledge-based machine translation approach uses knowledge for machine translation. This is an extended idea of the example-based machine translation. This approach uses linguistic and computational instructions, which are supplied by a human. Numbers of commercial quality Machine Translation systems have used this knowledge-based approach. Among others EDR[150] and KANT [86] are the major knowledge-based machine translation systems. EDR (Electronic Dictionary Research) [114], by Japanese, is the most successful machine translation system. This system has taken a knowledge-based approach in which the translation process is supported by several dictionaries and a huge corpus [115]. While using the knowledge-based approach, EDR is governed by a process of statistical machine translation. As compared with other machine translation systems, EDR is more than a mere translation system but provides lots of related information.

19

KANT (Knowledge-based Accurate Natural-language Translation) is a knowledge based machine translation system for specific domain [86]. Prototype of the KANT architecture translates French, German, and Japanese successfully. KANT is currently being extended in a large-scale commercial application [118]. The KANT prototype has been implemented in the domain of technical electronics manuals, and translates from English to Japanese, French and German.

2.4.6 Hybrid Machine Translation The Hybrid machine translation system uses combine method in rule-based and Statistical machine translation approaches. This hybrid approach has several advantages. Among others, SYSTRAN is the market leading provider of language translation software products and solutions for the desktop, enterprise and Internet that facilitate communication in 52 language combinations and in 20 vertical domains [124]. Introducing combination of self-learning and linguistic technologies SYSTRANS has been developed hybrid machine translation system [144] named as a SYSTEMS Enterprise server 7. The English to Arabic machine translation system has also been developed through the hybrid approach, which is combined between rule-based and example based approaches [133].

2.4.7 Agent-based Machine Translation Agent technology, more specifically multi-agent systems, have also been used to handle machine translations. This Multi-agent system provides tools for building artificial Complex Adaptive Systems [131].

In general any multi agent system

contains four key components, namely Multi-Agent Engine, Virtual world, Ontology and Interfaces [130][131]. The multi agent engine provides a run time support for agents. The engine starts as the first step of the system.

Virtual world is the 20

environment of the multi agent systems. Using this Virtual world, agents are cooperated and competed with each other as they construct and modify the current scene. The Ontology contains conceptual problem domain knowledge of each agent. There are a number of NLP systems that have been developed using multi agent system technology [175][129][130][113][36]. Most of these systems use agents to handle semantics in the translation. Minakow and others [113] have developed a Multi Agent-based text understanding system for car insurance domain. This system uses Multi agent system based approach to understand a given text. The system uses four steps to text understanding namely morphological analysis, Syntax analysis, semantic analysis and pragmatic analysis. To analyze the whole text is divided into sentences. Then first three stages are applied to each sentence. After analyzing each paragraph text is passed to pragmatic analysis. Stefanini and others have developed a Multi-agent based general Natural language processing system named Talisman [141]. Talisman agents can communicate with each other without the central control. These agents are able to directly exchange information using an interaction language. Linguistic agents are governed by a set of local rules. The TALISMAN deals with ambiguities and provides a distributed algorithm for conflict resolutions arising from uncertain information.

2.5 Existing English to Sinhala Machine Translation Systems During the past few years many Sri Lankan researchers contributed to develop Machine Translation systems for local languages. Among others University of Colombo has recorded a significant research to develop English to Sinhala and Sinhala-Tamil machine translation system with several Local language resources such as Sinhala corpus [99][159], Sinhala text to Speech system [160], Parts of Speech Tagger[45] and OCR system for Sinhala language [158]. As a first attempt Weersinghe and others have been researching to develop Sinhala to Tamil machine translation system through the corpus based approach [157]. This translation system 21

evaluates through the BLUE score matrix [123] and reasonable result were achieved. At present they are researching to develop English to Sinhala machine translation system through the translation memories[156]. They have designed translation tool named OpenTM, which is based on the translation memories. They have mentioned that this OpenTM is suitable for any language pairs around the world, where at least one language requires complex script support. Further, many other local researchers have developed several prototype English to Sinhala machine translation systems through several approaches. In 2003, Vithanage and others have developed English to Sinhala machine translation systems for weather forecasting domain [153].

Vithanage’s translation system can translate

simple sentences and works on the limited set of words and the limited sentence patterns. This translation system is fundamental rule-based and it has used Paragraphs and sentence tokenization, simple parsers (English and Sinhala), translators and Sinhala sentence generators for English to Sinhala translation. In 2008, Fernando and others have developed English to Sinhala machine translation system using Artificial Neural Networks [47]. A Probabilistic Neural Network is used to identify the English grammar and it is based on Bayesian classifiers. This system has been achieved 50% accuracy in the grammatical translation. It has been tested through 84 test cases including 12 tenses and it only capable to translate only the simple sentences. In addition to above, some people all over the world have attempted to develop machine translation system for Sinhala. Among others, Hearth and others have attempted to develop translation system for Japanese to modern Sinhalese [57]. The system has a limited vocabulary and it handles translations only within its domain.

2.6 Concepts and Techniques for Machine Translation In the previous section the author has discussed several existing approaches for Machine Translation. Many of these machine translation systems have used the Morphological analysis and the syntax analysis to analyze the source language. This 22

Morphological analysis and syntax analysis is done by Morphological analyzers and parsers. Morphological analyzers and parsers act the major task in any machine translation. Therefore the following sub section gives brief description about Morphological analysis and syntax analysis.

2.6.1 Morphological Analysis The morphological analysis is the identification (analysis) of the structure of morphemes and other units of meaning in a language like words, affixes, and parts of speech [84][162][176]. Historically, the first attempt made for the morphological analysis, was done by the ancient Indian linguist Panini, who formulated the 3,959 rules of Sanskrit morphology (Vyakarana). This Panini grammar [24] is the basis of all the Indian families of language including, Hindi, Sinhala, Pali, Sanskrit etc. Using this Panini grammar model, many researchers have developed number of morphological analyzers for their language analysis [5][6]. The Morphological analyzers for English language have been developed by many researchers. Koskenniemi’s two-level morphology was the first practical and most general model in the history of computational linguistics for the analysis of morphologically complex languages [92][93]. Koskenniemi’s Pascal implementation of morphological analysis was quickly followed by others. The most influential of them was the KIMMO system by Lauri Karttunen and his students at the University of Texas. PC-KIMMO is yet another morphological analysis tool, which was based on Koskenniemi’s work and implemented in C [87]. Among others, PC-KIMMO is supposed to be the only available free English morphological analyzer with a wide coverage [34]. The lexicon used in PC-KIMMO considers verb, pronoun, noun, prepositions, adverbs and adjectives. The current version PC-KIMMO is implemented in C and can be run on a PC [93]. The PC-KIMMO accepts an input word from a user, and provides all possible morphological details of the word. In addition, many European and Scandinavian countries have developed morphological analyzers for their languages.

These countries have exploited real power of

computer technology for machine translation. 23

Asian countries including India, Japan and Thailand have also developed morphological analyzers for computer-based natural language processing [5][6]. For example, Anusaaraka system has developed morphological analyzers for six Indian languages [16]. Anusaaraka has been designed to translate among major Indian languages and its morphological analysis is based on the paradigms. The Paradigm is used both for word analysis as well as word generation. Also Akshar Bharati and others have developed a Generic Morphological Analysis Shell that can be used to develop morphological analyzers for different minority languages [5]. This Shell uses finite state transducers with features to give the analysis of a given word. Further, it integrates paradigms with augmented FSTs. The current model has been developed for sample data of Hindi, Telugu, Tamil and Russian. The above generic Morphological Analysis Shell uses dictionaries, s paradigm table and paradigm classes.

2.6.2 Syntax Analysis Syntax analysis is used to analysis structure in the text and is used to determine whether or not a text conforms to an expected format [84][91]. In the Machine Translation point of view, this syntax analysis is done by the Parser, which is used to analyze the given text (sentences). To analyze the given text Parsers use several techniques coming under Top-down and Bottom-up parsing. The Top-down parsers are analyzing the input source left to right and searching for parse trees using a top-down expansion [162]. Using this top-down parsing approach there are several types of Parsers that are also developed including Recursive descent parser, LL parser, Earley Parser and the X-SAIGA parser. These parsers have demonstrated their own properties in addition to the top-down parsing features. The Recursive descent parser is the straightforward forms of top-own parsing [97]. The LL Parser is also used top-down parsing and parses the input from Left to right, and constructs a leftmost derivation of the sentence. The ANTLR [148] is the popular LL parser, especially for compilers. The LL(k) parser uses the above techniques to parse the sentences without backtracking. The Earley parsers are 24

especially suitable for ambiguous grammars and use for parsing the computational linguistics. Many of these parsers are already implemented through the C, Java, Perl and Python languages. The X-Saiga parsers are developed under the X-Saiga project to create algorithms and implementations which enable the construction of language processors such as recognizers, parsers, interpreters, translators, etc. they have implemented several algorithms, at various stages to develop X-Saiga [166]. The bottom-up parser attempts to identify the most fundamental units first. Then it attempts to build trees upwards the start. These parsers are mainly used to analyze both natural languages and computer languages. Using this bottom-up parsing approach several types of Parsers are also developed including Operator Precedence parsers, LR parsers and the CYK parsers. The operator precedence parser is a bottom-up parser that interprets an operatorprecedence grammar [162]. The LR Parser [132] is also used bottom-up parsing and parses the input from Left to right, and constructs a rightmost derivation of the sentence. The CYK Parsers are used Cocke–Younger–Kasami algorithm and parsing techniques are based on the bottom-up parsing. The CYK parsers operate on contextfree grammars given in Chomsky normal form (CNF) [31][32]. In addition to the above Parsers are developed by using several computer languages especially prolog [25] and number of tools are used to develop parsers including ANTLR, Yacc, JavaCC etc.By using these programming languages and development tools numbers of parsers have been developed by many people for several Natural languages as well as computer programming languages.

2.7 Problem Definition The existing Machine translation systems that use the stated approaches are not directly able to translate English text into Sinhala. Since each natural language is built on its own building blocks and structures, two languages may not be able to handle in the same manner. Despite some Indian languages may have common features with Sinhala, they are not identical. On the other hand such systems do not 25

provide an underlying theory to generalize machine translations. As such, it is impossible to figure out which building block or the structure should be exactly customized to create English to Sinhala machine translation system. Therefore, lack of theoretically-based approach to machine translation has led to develop ad-hoc translation systems.

2.8 Summary This chapter gave a detailed discussion about Machine Translation systems and the approaches used. The table 2.1 shows selected successful machine translation systems with language pair, approach and system type.

Table 2.1: Existing Machine translation systems System

Language pair

Approach & Type

Anusaaraka

Among Indian languages

Angalabarath

English

AngalaHindi

to

Human-Assisted, Application

Indian Human-Assisted, Rule-based,

languages

Application

English to Hindi

Machine-aid, Rule-based/ examplebased, Web based

ManTra

English to Hindi

English to Urdu English to Urdu

Human-aided, web based Example based, Application

MT Matra

English to Hindi

Human-aided, transfer-based Application

Google TR

Several languages

Statistical, Web-based

Bable fish

Several languages

Systran technology, Web based

Yahoo TR

Several languages

Statistical, web-based

Aprtium

Related languages

Rule-based, Application

EDR

English/Japanese

Knowledge based, Application

26

According to the literature survey, the author has identified that human assisted and rule-based approaches are more suitable for none-related language pairs such as English and Sinhala. Next chapter reviews features of English and Sinhala languages with a view to identify issues related to machine translation from English to Sinhala.

27

Chapter 3 OVERVIEW OF THE ENGLISH AND SINHALA LANGUAGES 3.1 Introduction The previous chapter discussed in detail about the Machine Translation systems. The author has pointed out issues in adapting an existing translation system for constructing English to Sinhala machine translation system. The literature review also revealed that the development of the Machine Translation system absolutely depends on the structure of the source and the target languages. Therefore, this chapter studies about language primitives and structures of English and Sinhala languages. This study would help to provide an insight about how the translation from English to Sinhala can be done.

3.2 The English Language English is the international communication language and more than 53 countries are already using it as an official language. It is a West German language that originated from the Anglo-Frisian and Old Saxon dialects brought to Britain [162]. English language contains 26 letters with 5 vowels [116]. The English language has eight parts of speech such as Noun, Adjective, Pronoun, Verb, Adverb, Preposition, conjunction and Interjection [8][165]. Rest of the section describes Morphology, Syntax, and Semantics of the English Language.

3.3 The English Language Morphology Morphology is the study of the way words are built up from smaller meaning bearing units called morphems that often define as the minimal meaning-bearing unit in a language [84]. For example the word boy consists single morpheme and the word boys consists two morphemes namely boy and the -s.Furher, in the Morphological view point there are two types of morphemes such as stems and affixes. In the 28

previous example a morpheme boy is a stem and the –s is an affix. These stems and affixes are participated both inflection and derivation of the word which is called word formation [109].The Inflection provides various forms of any single word such as Singular, Plural etc. (E.g. singular man, plural men in English). Derivation creates new words from old ones. (E.g. the creation of dogcatcher from ‘dog’, ‘catch’ and ‘er’ is a derivational process) [117][84]. Comparing the other Indo-European languages, English grammar has minimal inflections. Therefore, the English morphology is simpler than the other Indo-European languages. With the exception of pronouns, English words have relatively few forms.

3.3.1 English Noun Morphology English Noun contains two types of inflections such as number and possessive case. Nouns generally have only two forms for Number inflection such as singular and plural. In the possessive case, the words usually end in ( ’s ) or ( ’ ) for example boy’s and boys’.

The English noun participates regular and irregular inflections. The regular inflection gives general forms of the singular, plural and possessive cases. Table 3.1 shows regular and irregular nouns with the inflection forms.

Table 3.1: Regular and irregular forms of the English Noun Grammar rule

Regular

Irregular

Singular

boy

Man

Plural

boys

Men

Singular Possessive

boy's

man's

Plural Possessive

boys'

men's

Considering the morphology of the English noun, it has very limited number of rules for noun inflections. The table 3.2 shows some morphological rules for the 29

English Noun. Basically, the plural noun is formed by adding some suffixes to the singular noun such as s, es, ies, ves etc. The posessive case is formed by adding ‘s or s’.

Table 3.2: English Noun Morphological rules English Noun Morphology No

Morphological structure

Base word

Example

1

Singular noun

Boy

boy

2

Plural Base + s

Boy

Boys

3

Plural Base + es

Class

Classes

4

Plural Base –y + ies

Baby

Babies

5

Plural Base – f + ves

Knife

Knives

6

Singular Possessive Base + ’s

School

School’s

7

Plural Possessive Plural + ’

Boy

Boys’

3.3.2 English Verb Morphology English verb contains five types of inflection namely Infinitive, simple present, past tense, past participle and present participle. In regular verbs, 3rd person singular ends with ‘s’, past tense and past participle ends with ‘ed’ and the present participle ends with ‘ing’. Note that English has a large number of irregular verbs and these verbs do not fit with this pattern. The personal pronoun has different forms depending on number (singular and plural), case (subject, object, possessive, etc.), and person (1st, 2nd and 3rd person). In the 3rd person singular, there is gender too. The table 3.3 shows the entire verb forms available for the English verb play (Regular) and eat (Irregular). The Morphological point of view, English regular verbs have several morphological rules. The table 3.4 shows Morphological rules for English verb. Most of the English regular verbs have simple inflection rule. However, Irregular 30

verbs use different patterns. Then the regular verbs expect simple present (adding s) and the Present Participle (adding ing) forms.

3.3.3 English Adjective Morphology Adjectives have comparative and superlative forms namely comparative adjectives are end with 'er') and the superlative adjectives end with 'est'). For example; higher and highest are the comparative and superlative forms of the adjective ‘high’. Other parts of speech; adverb, preposition, conjunction and Interjection do not show inflections. Table 3.3: English verb Morphology English Verb Morphology Morphological structure

Regular verb

Irregular verb

Infinitive

play

eat

Past

played

ate

Present Participle

playing

eating

Past Participle

played

eaten

I

play

eat

You

play

eat

He, She, It

plays

eats

We

play

eat

You

play

eat

They

play

eat

Present:

31

Table 3.4: Morphological rules for English Verbs English Verb Morphology No

Morphological structure

Regular verb

Irregular verb

1

Infinitive verb (Base verb)

play

eat

2

Simple present (base + s)

plays

eats

3

Past(base + ed)

played

ate

4

Present Participle (base + ing)

Playing

eating

5

Past Participle (Base +ed)

played

eaten

3.4 Syntax of the English Language The syntax is the study of the rules that gives the structure of the sentences [162]. English Language has its own format and it differs from the Sinhala language syntax. The below section gives a brief description about English sentence syntax, which is based on the scientific psychin web site [172][174]. English language contains four main sentence types namely declarative, Interrogative, Imperative and conditional. The English sentence may be simple or compound. The compound sentences consist of two or more simple sentences joined by conjunctions. The declarative sentence consists of a subject and a predicate. The subject may be a simple subject or a compound subject. A simple subject consists of a noun phrase or a nominative personal pronoun. Compound subjects are formed by combining several simple subjects with conjunctions. All the sentences in this paragraph are declarative sentences. Interrogative sentences are used to form questions. One form of an interrogative sentence is a declarative sentence followed by a question mark and there are several ways available for Interrogative sentences that start with what, who, which etc. The Imperative sentences are commands; consist of predicates that only contain verbs in infinitive form. Generally, imperative sentences are terminated with an exclamation mark instead of a period. 32

The Conditional sentences are used to describe the consequences of a specific action, or the dependency between events or conditions. Conditional sentences consist of an independent clause and a dependent clause. In addition to the above, deep structural analysis needs to develop machine translation for English source sentence analysis specially, subject, object, predicate and sentence patterns. These information are very useful to develop English Phrases.

3.4.1 The English Sentence Subject The subject is the part of the sentence that performs an action or which is associated with the action. The subject may be simple or compound. The Simple subject may be a noun phrase or a nominative personal pronoun. (The nominative personal pronouns are: I, you, he, she, it, we and they)

3.4.2 The English Predicate The predicate is the part of the sentence that contains a verb or verb phrase and its complements. English has three main kinds of verbs: auxiliary verbs, linking verbs, and action verbs.

3.4.3 Verb Tense Verb tenses are inflectional forms of verbs or verb phrases that are used to express time distinctions [8]. The table 3.5 shows the structure of some common tenses.

Table 3.5: Tense patterns (Active voice) Tense Simple present

Example I write a book The boy sings a new song

Present

I am writing a book 33

continuous

The boy is singing a new song

Present perfect

I have written a book The boy has sung a new song

Present perfect continuous

I have been writing a book

Past tense

I wrote a book

The boy has been singing a new song The boy sang a new song

Past continuous

I was writing a book The boy was singing a new song

Past perfect

I had written a book The boy had sung a new song

Past perfect continuous

I had been writing a book

Future tense

I will write a book

The boy had been singing a new song The boy will sing a new song

Future continuous I shall be writing a book The boy will be singing a new song Future perfect

I shall have written a book The boy will have sung a new song

Future perfect continuous

I shall have been writing a book The boy will have been singing a new song

3.4.4 The Complement The predicate consists of a verb or verb phrase and its complements, if any. A verb that requires no complements is called intransitive. A verb that requires one or two complements is called transitive.

34

3.5 Semantics of English Language Semantics is the study of the meaning. It typically focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for [162]. Semantics can be classified as three groups namely, word level meaning sentence level meaning and the paragraph level meaning.

3.5.1 Word Level Semantics Word level semantics means semantics may define by the words in the sentence. As an example consider the following sample sentences, “This is a red rose”, “this paper is red”, and “the supervisor flashes the red light for his student”. The word ‘red’ gives different meaning in each sentence.

3.5.2 Sentence Level Semantics The sentence level semantics refers to the meaning that depended on the sentence. Analyzing the sentence level semantics of the sentence is very important for many areas [37].

3.5.3 The paragraphs Level Semantics The paragraphs level semantic analysis [173] is a solution for the word sense ambiguity [80]. Further, many of the researchers have done researches to analyze paragraphs level semantics [127].

3.6 The Sinhala Language The Sinhala Language is constitutionally recognized as the official language of Sri Lanka, along with Tamil. Sinhala is the mother tongue of the Sinhalese. Sinhala language has its own writing system, which is an offspring of the Brahmi script [22]. 35

Maldives, Dhivehi are the closest relative languages to Sinhala. Further, Sinhala scripts are the world’s 16th most creative alphabet among today’s functional languages [35]. The Sinhalese most historical book Mahavansa [102] noted that, the prince Vijaya and his entourages who came from India in the 5th century BC were merged with the native Hela tribes known as Yakka and Naga who spoke Elu language (the ancient form of the Sinhalese language) and the new nation called ‘Sinhala’ came to exist with the Sinhala language. Further, Sinhala differs from all other Indo-Aryan languages. It contains a pair of vowel sounds that are unique to it, such as short vowel: ‘we’ – ae and Long vowel: ‘wE’ – aae. Also Sinhala contains a set of five nasal sounds known as “half nasal” or “prenasalized stops”. These sounds as represented in modern Sinhala writing and their Romanized notations are as follows: Õa (nng), `ca (ndj), `â (nnd), |a (nd), ò(mb) [88]. The next sub section briefly describes the Sinhala alphabet, morphology and the syntax of the Sinhala language.

3.6.1 Sinhala Alphabet The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and 2 semi-consonants [40][22].These symbols represent 40 sounds: 14 vowel sounds and 26 consonant sounds. This is quite similar to other Indic alphabets, as all of them appear to be offshoots of the Sanskrit alphabet [50]. Table 3.6 shows the Sinhala alphabet.

Table 3.6: The Sinhala Alphabet Letter Type Vowels

Sinhala Letters w, wd, we, wE, b, B, W, W! ,Ì, Ï iD, iDD, t, ta, ft, T, ´, T! l, L, . , >, V, Õ, p, P, c, Cv [, {, P, g, G, v, V, K,

Consonants

Ë, ; , : , o, O, k, |, m, M, n, N, u, U, h, r, ,, j, Y, I, i, y, Result = 'finish', add_eng_sen_results('sucess') ; Result = 'error',

74

add_eng_sen_results('error') ).

The prolog predicate named eng_sentence_syntax_analysis/1 is used to analyze the sentence. Before the analysis, EPA consults the ema_out.pl file by using the following code. :- consult('c:/bees7/ema_out.pl').

Then the EPA clears all the variables and the previous data on the epa_out.pl file. The following rules are used to analyze the simple sentence and the complex sentence. english_sentence(Out, NL, []) :simple_sentence(Out, NL, []). english_sentence(Out, NL, []) :compound_sentence(Out, NL, [])

The compound sentence may be two simple sentences with the conjunction compound_sentence(Out, Sen, End)

:-

simple_sentence(O1, Sen, S1), conjunction_simple_sentence(O2, S1, End), append(O1, O2, Out), add_eng_sentnce_info('compscs', Out).

The Simple sentence may be the four types namely declarative, interrogative, imperative or conditional. The following code are shows the implementation. simple_sentence(Out, NL, End) :declarative_sentence(Out, NL, End). simple_sentence(Out, NL, End) :interrogative_sentence(Out, NL, End). simple_sentence(Out, NL, End) :imperative_sentence(Out, NL, End). simple_sentence(Out, NL, End) :-

75

conditional_sentence(Out, NL, End).

The English Parser analyzes the English sentence with the following information 1. Type of the sentence 2. Tense of the sentence 3. Subject, Complemant, verb and the predicate The following results are given for the English sentence “A good boy and his friend read the books everyday” eng_sen_verb([5000008]). eng_sen_complement([3000003, 1000004, 3000016]). eng_sen_subject([3000001,

3000004,

1000001,

3000027,

4000004,

1000011]). eng_sen_predicate([5000008, 3000003, 1000004, 3000016]). eng_sen_type(declarative). eng_sen_ekeys([3000001,

3000004,

1000001,

3000027,

4000004,

1000011, 5000008, 3000003, 1000004, 3000016]). eng_sen_tence(simplepresent). eng_sen_result(sucess).

The Prolog predicate eng_sen_verb/1 gives the verb of the sentence. This verb id is equal to the verb in the morphological analysis. eng_verb([5000008], if, 'read').

The

Prolog

predicate

eng_sen_complemant,

eng_sen_subject

and

eng_sen_predicate are given information about complement, subject and the predicate of the input sentence. eng_sen_type(declarative). eng_sen_tence(simplepresent). eng_sen_result(sucess).

76

The above three code gives the type, tense and the result of the analysis. Note that, these information are used to generate the corresponding Sinhala sentence.

6.3.3 English to Sinhala Bilingual Translator The English to Sinhala Bilingual translator is the prolog based module which used to get suitable Sinhala base-word for the given English base-word. The Bilingual translator (BA) uses output result of the English morphological analysis, output result of the English syntax analysis, English-Sinhala-bilingual dictionary, context dictionary and transliteration module have been used to find the appropriate Sinhala base word. The following code shows how Bilingual translator consults the above source. :- consult('c:/bees7/ema_out.pl'). :- consult('c:/bees7/epa_out.pl'). :- consult('c:/bees7/dic/eng_sin_word_dic.pl'). :- consult('c:/bees7/plsource/dtrSource.pl'). :- consult('c:/bees7/dic/eng_sin_cons_dic.pl'). :- consult('c:/bees7/dic/eng_sin_usage_dic.pl').

The bilingual translator stores all the output results of the base-word translation into the file named “est_out.pl’ To identify the corresponding Sinhala based word the Bilingual translator uses the following three rules. eng_to_sin_word_all(H, S, Type,EW, SW) :eng_cons_word(H, S, SubID), subject_form_avlable(SubID), esw(_, H, S, Type, EW, SW). eng_to_sin_word_all(H, S, Type,EW, SW) :esu(EW, H, S, Type, SW, _). eng_to_sin_word_all(H, S, Type,EW, SW) :-

77

esw(_, H, S, Type, EW, SW).

The Prolog predicate eng_to_sin_word_all/5 is used to generate appropriate Sinhala based word by searching the three dictionaries. The following result shows the Bilingual translator output of the given English sentence “A good boy and his friend read the books everyday” estrwords(1001, 3000001, 3000000, dt). estrwords(1002, 3000004, 3000004, aj). estrwords(1003, 1000001, 1000001, na). estrwords(1004, 3000027, 3000027, cn). estrwords(1005, 4000004, 4000004, na). estrwords(1006, 1000011, 4000011, na). estrwords(1007, 5000008, 5000008, vb). estrwords(1008, 3000003, 3000000, dt). estrwords(1009, 1000004, 1000004, na). estrwords(1010, 3000016, 3000016, av).

6.3.4 Sinhala Morphological Generator The Sinhala morphological generator is the key module of the system and it is implemented by Using SWI-Prolog. The Sinhala morphological analyzer uses Sinhala dictionary and the result of the Bilingual translator. The following code shows how Sinhala morphological generator consults the Sinhala dictionary :- consult('convertor.pl').

The Sinhala morphological generator generates appropriate Sinhala words for the given grammar information. The Sinhala morphological generator generates Sinhala Nouns, verbs, adjectives, adverbs and prepositions.

78

To generate the Sinhala nouns SMG uses the get_sin_noun/8 prolog predicate. The prolog predicate get_sin_noun/8 uses Sinhala base word id, person, number, sex live, DI-code and case to generate a Suitable Sinhala noun. snoun([1000001], td, sg, ma, li, dr, v1,'පිරිමි ළමයා').

To generate the Sinhala noun, it uses Sinhala rules. The following code shows how Sinhala Morphological generator generates the Sinhala noun. Case 1: The Sinhala noun can directly get form the Sinhala dictionary (No need to generation) get_sin_noun(WID,P, N,S, L, DI, VB, NW) :sn([WID], P, N, S, L, DI, VB, NW).

Case 2: Generate the Sinhala Noun through the Noun generator To generate a noun, generator uses Word information from Sinhala dictionary, the word generation rules from rule dictionary and case rules form rule dictionary. The following code shows how it generates a noun. get_sin_noun(WID, PE, sg, S, L, dr, v1, OUT) :sn([WID], PE, S, L, NID, C1, _, _, WD), get_sin_noun_baseform(WD, L, NID, BASE), validate_sds(BASE, L, NID, NW), ensure_loaded('c:\\bees7\\dic\\sin_rules_dic.pl'), noun_vib_postfix(C1, v1, AV, RV), atom_chars(NW, WDL),atom_chars(AV, ABL), atom_chars(RV, RBL),append(WDL1, RBL, WDL), append(WDL1, ABL, NWDL),concat_atom(NWDL, OUT).

The noun generation is done through the three steps; 1. Get noun information from Sinhala dictionary sn([WID], PE, S, L, NID, C1, _, _, WD)

2. Generate the base form of a noun 79

get_sin_noun_baseform(WD, L, NID, BASE)

3. Generates the required word form validate_sds(BASE, L, NID, NW)

4. Get the suitable case form for the generated noun noun_vib_postfix(C1, v1, AV, RV)

5. Generate a Sinhala noun All the rules are implemented by using language basics for the noun generation (Sinhala Noun Gana). In addition to the Sinhala Noun, Sinhala verb generator is used to generate the Sinhala verb. The Sinhala verbs may be irregular or regular. The irregular verbs are directly identified from the Sinhala dictionary. The following code shows how Prolog identifies these words from the Sinhala dictionary. get_sin_final_verb(Skey, Type, P, N,Tence, SW) :sfv([Skey], P, N, Type, Tence, SW), write(SW).

Same as the Sinhala Nouns Sinhala regular verbs are generated through the set of Sinhala rules. The following code shows some rules to generate Sinhala regular verbs get_sin_final_verb(Skey, ps, P, N, fu, SW) :sfv([Skey], _,_, _,_,_, APR, _, WD), verb_posfix(APR, ps, P, N, fu, ADD, REM), atom_chars(WD, WDL),atom_chars(ADD, ABL), atom_chars(REM, RBL), append(WDL1, RBL, WDL), append(WDL1, ABL, NWDL), concat_atom(NWDL, SW), write(SW).

The above code shows how prolog generates the Sinhala verb; As a first step Sinhala verb and the conjugation form have been identified through the Sinhala dictionary. After that, Conjugation rules are identified from the Sinhala rule dictionary. Finally, using all these information Final Sinhala word is generated. 80

6.3.5 Sinhala Sentence Composer Sinhala Sentence composer is used to generate grammatically correct Sinhala sentence. To generate the Suitable sentence Composer and the Sinhala word generator works together.The Sinhala composer uses all the previous information for the sentence generation including Sinhala morphological generation, English sentence analysis, English morphological analysis and the English to Sinhala translation. This Sinhala sentence composer works as the following stages 1. Generate subject and object separately 2. Use the structure of the original sentence and re-generate the correspondent Sinhala sentence To generate the subject, object and the verb, the generator uses a different mechanisms.

The

prolog

predicates

namely

‘translateSubject’,

‘translate_simple_sentence_verb’ and ‘translateComplement’ are used to translate English subject, object and the verb into Sinhala. The following sample code shows how translates the English subject into Sinhala translateSubject :load_english_sentence_subject(Sub), clearsinsubject, clear_sinsubject_word, create_sample_code_out(F), close(F),set_di_code_default, set_sin_sub_pncode_default, set_case_code_default, set_previous_word_default, set_sin_complex_sub_pncode_default, loadSubwordByword(Sub), appendSinhalaSubject, appendSinhalaSubjectWD.

81

6.3.6 Transliteration Module The Transliteration module has been implemented by Using SWI-Prolog as a Finite State Transducers (FST). The Prolog file dtrsource.pl is the source file of the transliteration module. Using FST author has developed two modules for Sinhala transliteration. The prolog predicate named eng_to_sin_dtr/2 is used to transliteration. The following code shows rule of the eng_to_sin_dtr/2 Predicate eng_to_sin_dtr(In,Out) :convert_word_list(In, ANL), printList(ANL,Out).

As the first step, trsnliteration module converts given set of words into a list. After that, it transliterates the given word by word by using FST. In addition to this the module uses character encoding system for FST. The following sample code shows the some rules in the FST to represent the Sinhala Vowels letters. initial(1). final(99). % *************************************************************** % Finite State Automata for Sinhala Vowels % *************************************************************** arc(50, 62, a, []). arc(62, 70, e, []). arc(70, 99, e, [e]). arc(62, 99, a, [c]). arc(62, 99, e, [d]). arc(62, 99, i, [p]). arc(62, 99, u, [s]).

82

6.3.7 Intermediate Editor The Intermediate Editor has been implemented using Java. The Intermediate editor uses to make the better translation through the human support. The following header is used to implement the intermediate editor.

public class IEETool extends JFrame implements Runnable {

In addition to the above, the intermediate editor uses two xml files namlely “reldata.xml” and “trasdata.xml” to store relations and the translated data. The figure 6.1 shows the user interface of the Intermediate editor including sample data.

Figure 6.1: The Intermediate Editor

83

6.3.8 Lexical Resources The English to Sinhala machine translation system uses four dictionaries namely English dictionary, Sinhala dictionary, English-Sinhala bilingual dictionary and the concepts dictionary. Each dictionary implementation is given below. 6.3.8.1 English Dictionary The English dictionary has been implemented through the 5 prolog data files namely ‘eng_irr_noun.pl’,

eng_reg_noun.pl,

eng_irr_verb.pl,

eng_reg_verb.pl

and

eng_irr_word.pl. The eng_irr_noun.pl file contains English irregular noun information. To represent the irregular noun information author used prolog predicate name eiw/7 prolog predicate. The prolog predicate eiw/7 represents word ID, word type, person, number, sex, case and the English word. As an example, the following Prolog predicate shows lexical information for the English word ‘I’. eiw(4000001, na, fs, sg, co, sb, 'i').

The table 6.2 shows codes, which are used to implement grammar notation in the English dictionary.

Table 6.1: Grammatical notations for the English Dictionary Criteria Person

Number

Sex

Case

code

Meaning

fs

1st person

sc

2nd person

td

3rd person

sg

Singular

pl

Plural

ma

Masculine gender

fe

Feminine gender

co

Common gender

no

Neuter gender

sb

Nominative case

ob

Objective case

po

Possessive case

rf

Reflexive (pronoun)*

84

Verb type

Determination

Adjectives

If

Infinitive

pa

Past

pp

Past Participle

rp

Present Participle

sp

Simple present

dr

Direct

id

indirect

P

Passive

c

Comparative

s

Superlative

The following code shows sample data for the English irregular words eiw(4000001, na, fs, sg, co, sb, 'i'). eiw(4000001, na, fs, sg, co, ob, 'me'). eiw(4000001, na, fs, sg, co, po, 'my'). eiw(4000001, na, fs, sg, co, po, 'mine'). eiw(4000001, na, fs, sg, co, rf, 'myself').

The English regular nouns are stored on a prolog file namely “eng_reg_noun.pl” using the erw/4 prolog predicates. The erw/4 represents the Word ID, word type and the sex. The following two samples are the regular nouns that are stored in the eng_reg_noun.pl erw(1000001, na, ma, 'boy'). erw(1000002, na, fe, 'girl').

The English morphological analyzer reads files prolog predicates and uses to analyze the English word. The English irregular verbs are saved in a file named eng_irr_verb.pl. This file contains English irregular verbs, which are available on the prolog predicates named eiw/4. It represents word id, word type, tense of the verb and the English irregular verb. eiw(5000001, vb, if, 'eat'). eiw(5000001, vb, pt, 'ate'). eiw(5000001, vb, pp, 'eaten').

85

The English regular verbs are stored in a prolog file named ‘eng_reg_verb.pl’. This file contains English regular verbs in erw/3prolog predicates format. The following code shows how prolog represents the English regular verbs. erw(2000001, vb, 'play').

The erw/3 prolog predicate uses word id, word type and the word for the strong regular word information. The English morphological analyzer uses this information to analyze English regular verbs. In addition to the above all, the other parts of speech such as adjectives, adverbs, propositions, conjunctions and interjections are stored on the prolog file named ‘eng_irr_word.pl’. The prolog predicate named eiw/4 is used to store all the words. The following code shows each words how store in the eiw/4 format. The special notation is used to identify each word type (na-noun, vb-verb, dt-determinations, ajadjective, av-adverb, pp-proposition, cn-conjunction and uv for auxiliary verbs) eiw(3000001, dt, id, 'a'). eiw(3000004, aj, p, 'good'). eiw(3000014, av, p, 'badly'). eiw(3000026, pp, v5, 'to'). eiw(3000027, cn, 0, 'and'). eiw(3000029, vb, uv, 'will').

By using online update module, this English dictionary can be updated automatically. This is the main purpose of the separating English dictionary into several files.

6.3.8.2 Sinhala dictionary The Sinhala dictionary is used to store all the Sinhala words, grammar information and rules which are used to generate Sinhala words.

The Sinhala dictionary

compress the with the prolog type files namely sin_reg_nouns.pl, sin_irr_nouns.pl, sin_reg_verb.pl,

sin_irr_verb.pl,

sin_irr_words.pl,

sin_case_rules.pl

and

sin_rule_dic.pl. The file sin_reg_nouns.pl contains the Sinhala regular noun information. The prolog predicate sn/9 is used to store all the information in the regular noun. The 86

following sn/9 prolog predicate shows information about ‘පිරිමි ළමයා’ it shows word id person, sex, live, and conjugation rules for Singular direct, singular indirect, plural and the case. The relevant rules are stored in the sin_rule_dic.pl file and the sin_case_rule.pl file. sn([1000001],td,

ma,

li,

s900004,

s910000,

s910000,

s910000,

'පිරිමිළමයා').

The Sinhala irregular nouns are also stored in the prolog file name ‘sin_irr_nouns.pl’ with the use of sn/8 prolog predicate. The sn/8 prolog predicate shows word id, person, number, sex, live, direct/indirect form, case and the Sinhala words. The Sinhala dictionary uses Sinhala Unicode (Sinhala Unicode) to store all the Sinhala words. The following code shows samples for the Sinhala irregular words. sn([4000001], fs, sg, co, li, dr, v1, 'මම'). sn([4000001], fs, sg, co, li, dr, v2, 'මා'). sn([4000001], fs, sg, co, li, dr, v3, 'මාවිසින්').

The Sinhala noun contains nine cases and these cases are represented v1 to v9 code. The Sinhala regular verbs are stored in the prolog file named ‘sin_reg_verb.pl’ with the use of the prolog predicate named sfv/9. It represents word id and the verb forms for the active and passive voice forms and other verb (Moods) forms. sfv([5000001],

s910001,

s910002,

s910001,

s910001,

s910001,s910001,s910001, 'කනවා').

The Sinhala irregular verbs are stored in the prolog file named sin_irr_verb using the prolog predicate sfv/6. The sfv/6 represents Word id, person, number, voice, tense and the Sinhala verb. The following code shows samples for the Sinhala irregular verbs. sfv([8000002], fs, sg, at, pr, 'සිටිමි'). sfv([8000002], fs, pr, at, pr, 'සිටිමු').

All other Sinhala words namely Sinhala adjectives, adverbs and prepositions are stored in a prolog file named ‘sin_irr_word.pl’ using the prolog predicate siw/4. The 87

siw/4 prolog predicate represents the Sinhala word id, type, property and the Sinhala word. The following sample code shows the Sinhala words in the dictionary. siw([3000034], aj, p, 'අලුත්'). siw([3000015], av, p, 'ෙහමින්'). siw([3000033], pp, v3, 'මගින්').

To generate Sinhala noun several rules are needed. These rule are stored in the ‘sin_rule_dic.pl’, These rules are used to generate appropriate Sinhala noun form from its base form. The following sample rules are used to generate Sinhala word ‘කපුටා’. These rules represent the implementations of the Sinhala kaputu ganaya (කපුටු ගණය). In the Sinhala_rule_dic.pl has been implemented by using more than 100 rules to generate appropriate Sinhala noun. noun_posfix(s935001, li, bas,

'ටු', 'ටා').

noun_posfix(s935001, li, sds,

'ටා', 'ටු').

noun_posfix(s935001, li, sdo,

'ටා', 'ටු').

noun_posfix(s935001, li, sis,

'ෙටක්', 'ටු').

noun_posfix(s935001, li, sio,

'ෙටකු', 'ටු').

noun_posfix(s935001, li, pds,

'ෙටෝ', 'ටු').

noun_posfix(s935001, li, pdo,

'ටන්', 'ටු').

The noun_posfix/5 is the rule format for the Sinhala noun and it represents rule id, live_code and, noun type, add and remove code. These rules are the implementation of the Sinhala noun palromdrim (Conjugation table). In addition to the above, the case rules are used to generate complete Sinhala noun with the case effect. The case rules are stored in a prolog file name sin_case_rule.pl. The following code shows the sample case rules.

noun_vib_postfix(s910001, v1, '', ''). noun_vib_postfix(s910001, v2, '', ''). noun_vib_postfix(s910001, v3, ' විසින්', ''). noun_vib_postfix(s910001, v4, 'ෙයන්', ''). noun_vib_postfix(s910001, v5, 'ට', ''). noun_vib_postfix(s910001, v6, 'ෙයන්', ''). noun_vib_postfix(s910001, v7, 'ෙයන්', '').

88

noun_vib_postfix(s910001, v8, ' ෙකෙරහි', ''). noun_vib_postfix(s910001, v9, '','').

The prolog predicate named noun_vib_postfix/4 gives the rule id, case, add part and the remove part of the word. The Sinhala morphological generator uses all of these rules to generate grammatically correct Sinhala terms. The sin_rule_dic.pl also stores the rules which are used to generate Sinhala verb. The prolog predicate verb_posfic/7 is used to store rule id, voice, person, number, tense, add part and the remove part of the Sinhala verb. The following sample code shows the sample rule for Sinhala verb generation.

verb_posfix(s910001, at, fs, sg, pr, 'මි', 'නවා'). verb_posfix(s910001, at, fs, pr, pr, 'මු', 'නවා'). verb_posfix(s910001, at, sc, sg, pr, 'හි', 'නවා'). verb_posfix(s910001, at, sc, pr, pr, 'හු', 'නවා'). verb_posfix(s910001, at, td, sg, pr, 'යි', 'නවා'). verb_posfix(s910001, at, td, pr, pr, 'ති', 'නවා').

6.3.8.3 English-Sinhala Bilingual dictionary English to Sinhala bilingual dictionary is used to identify appropriate Sinhala base word for given English word. The following code shows syntax used for storing information in the English-Sinhala Bilingual dictionary.

esw(6000006, 1000001, 1000001, na, 'boy', 'පිරිමිළමයා'). esw(6000006, 1000002, 1000002, na, 'girl', 'ගැහැණුළමයා').

The esw/6 prolog predicate is used to store appropriate Sinhala base word for a given English base word. The esw/6 prolog predicate gives id, English word id, Sinhala word id, word type, English word and the Sinhala word. Using the above predicate all the Sinhala and English words are combined through the English-Sinhala bilingual dictionary. 89

6.3.8.4 Concept Dictionary The concept dictionary is used to store context information and relevant semantic information for each word. All the context information are stored in a two prolog data files namely eng_sin_cons_dic.pl and the eng_sin_uase_dic.pl The eng_sin_cons_dic.pl file contents context information that are used in the intermediate editor. the prolog predicate eng_cons_word/3 is used to store these context details. The following sample code shows how data are stored in the concept dictionary. eng_cons_word(e1000000, s1000000, e1000000).

The eng_sin_usage_dic is used to store most usable terms on the web. This dictionary is automatically updated by the online update module to store usable words. Same as the eng_cons_word/3, the eng_usage_word/3 prolog predicate is used to store these usage information. In addition to above all Sinhala resources Sinhala corpus is used as a supporting resource to find available word forms. The Sinhala corpus information are stored in a prolog predicate named ‘sc/1’ and all information are stored in a prolog file name ‘sinhalacop.pl’ sc('අද'). sc('අෙප්'). sc('ජාතික'). sc('ක්රීඩාව'). sc('ඒ').

In the present corpus uses 18613180 words and these resources were collected from the UCSC Sinhala corpus (LTRL). The Sinhala word generator uses these resources to identify the suitable Sinhala word forms directly.

90

6.4 Supporting modules Three supporting modules have been developed for the update lexical resources namely online updater, Sinhala word Generator and online search module. The implementation details of the each module is given below.

6.4.1 Online Updater The Online updater module is used to update each lexical resources. This module can update English dictionary, Sinhala dictionary and English-Sinhala bilingual dictionary automatically. This module also gets the support from Sinhala word generator and the online search module to do the update task. As the first step online updater load all the dictionaries by using the following predicates consult_eng_dic:consult('c:\\bees3.2\\dic\\eng_reg_nouns.pl'), consult('c:\\bees3.2\\dic\\eng_reg_verbs.pl'), consult('c:\\bees3.2\\dic\\eng_irr_nouns.pl'), consult('c:\\bees3.2\\dic\\eng_irr_verbs.pl'), consult('c:\\bees3.2\\dic\\eng_irr_words.pl').

Then updater uses online search module and get the grammar information by using set of online resources. For example, online search module uses madhura online dictionary, Cambridge dictionary, sensagent online dictionary and yahoo search engine to get relevant English grammar information. Online updater get the relevant word information such as word type (regular Noun, irregular Noun, regular Verb, irregular verb, Adjective etc.) then system update each information. The following sample code is used to update English regular noun. update_eng_reg_noun(Word, ID) :write('try to update regular noun'), consult('c:\\bees3.2\\dic\\eng_reg_nouns.pl'), ( erw(ID, na, _, Word)

91

-> write('English regular noun avilable ('), write(ID), write(')'), nl ; consult('c:\\bees3.2\\updateinfo.pl'), (new_noun(Word, re, Word, _) -> get_new_eng_reg_noun_key(ID), open('c:\\bees3.2\\dic\\eng_reg_nouns.pl', append, File), write(File, 'erw('), write(File, ID), write(File,', na, no, \''),

write(File, Word),

write(File, '\').'), nl(File), close(File) ; update_eng_irr_noun(Word, ID) ) ).

The prolog predicate “get_new_eng_reg_noun_key ” is used to get new key value for the regular noun. In addition to the above the following code shows how does the module use java program to get online information search_cmb_dic(Word, Out)

:-

use_module(library(jpl)), write('Call : http://dictionary.cambridge.org ..... '), jpl_new( 'SearchCambDic', [], F), jpl_call( F, searchDic, [Word], Out), write(Out), nl.

6.4.2 Sinhala Word Generator Sinhala word generator is implemented to generate appropriate Sinhala word form. The following sample code is used to generate base form of a given noun. This

92

Sinhala word generator can generate all the word form for the given Noun or Verb. These word forms are need validate the requires rules. validate_baseform(WD, P, NP,BASE)

:-

ensure_loaded('c:\\bees3.2\\dic\\sin_rules_dic.pl'), noun_posfix(NP, P, bas, AB, RB), atom_chars(WD, WDL), atom_chars(AB, ABL), atom_chars(RB, RBL), append(WDL1, RBL, WDL), append(WDL1, ABL, NWDL), concat_atom(NWDL, BASE), write(BASE), nl.

6.4.3 Online Search module Online search module gets relevant information from web resources. The following sample java program is used to search Sinhala word on the yahoo search engine. public static String searchWeb(String word) { String outstr ="f"; try{

//System.setProperty("http.proxyHost", "10.32.193.254"); //System.setProperty("http.proxyPort", "3128"); System.out.println("Connecting to http://search.yahoo.com/"); FileOutputStream fout FileOutputStream("tmp\\yahoo_search.html");

=

new

BufferedWriter out = new OutputStreamWriter(fout, "ISO-8859-1"));

BufferedWriter(new

String uu = "http://search.yahoo.com/search?p="+word ; String resultString = new String(uu.getBytes("UTF-8"));

93

String str = sendGetRequest( resultString , ""); int index1 = -1; index1 = str.indexOf("We did not find results" ); if( index1 >= 10){ outstr = "n"; } out.write(str); out.write(word); out.close(); } catch(Exception e){ System.out.println("Connection Error ........"+e); outstr ="e" } System.out.println("Result : " + outstr ); return outstr;

}

6.5 Summary This chapter reports implementation of all the modules and dictionaries completely. To implement all modules, author has used Java and prolog technologies. The next chapter will be discussed how does the BEES work on the four environments namely desktop application, online translator, webpage translator and selected text translator.

94

Chapter 7 BEES IN ACTION 7.1 Introduction The previous chapter described implementation of all the modules and dictionaries. The BEES has been implemented through several online and standalone applications. This section describes various applications of BEES. The English to Sinhala machine translation system has been implemented through the four applications namely 1. BEES as an online translator 2. BEES as a webpage translator 3. BEES as a selected text translator 4. BEES as a Desktop Application

7.2 BEES as an Online Translator BEES has been developed as an online translator. This development is primarily based on the use of Prolog Server Pages [23]. The architecture of the web-based BEES (English to Sinhala machine translation system) is shown in Figure 7.1.

Figure 7.1: Web based architecture for the BEES The web-based system contains four modules, namely, web client; Apache web server [17], PSP (Prolog Server Pages) module and the Prolog based core translation system. Note that, prolog based core translation system is a rule-based machine translation system which is developed using all the functional modules of the BEES. 95

The web browser is the user interface of the system. Apache web server handles all the web-based transaction of the system. PSP provides facilities to run Prolog-based system through the web. Prolog-based system is the core of the machine translation system. Through the PSP scripts, the core system reads input English sentence that comes from the web client. After the translation, the core machine translation system returns the output Sinhala sentence to the web client. Figure 7.2 shows user interface of the online BEES [72].

Figure 7.2: User interface of the Online BEES

96

7.3 BEES as a Web Page Translator BEES has been improved as a web page translator, which can be used to translate a given web page [66]. This section describes how System translates a given English web page into Sinhala. Figure 7.3 shows user interface of the web page translator.

Figure 7.3: A web page translator

This system translates a given English web page into Sinhala and it shows output of the translation by using a web browser. Figure 7.4 shows translated output of the Sample web page. Process of the translation is given below. Assume that the system reads following simple HTML document. As a first step HTML parser [66] analyzes the document and identifies the tags and the text. Consider the following simple html document part. The Rabbit The Rabbit is a small and herbivorous animal. It lives in the jungle. Rabbit has long and powerful legs.

97

This HTML source contains several HTML tags and text. “The rabbit” is a text identifies by the HTML parser. Then the parser sends this text into the translation module. Translation module reads the above text and tries to translate. In the sentence analyzing stage, the English parser rejects the input text, because it is not a sentence. Therefore, the system tries to identify it as a noun phrase. The English parser recognized the input text “The rabbit” as a noun phrase. Then the translation module uses English to Sinhala word translator, Sinhala morphological analyzer and the Sinhala parser, and generates the appropriate Sinhala translation as “ydjd”. This is the time to show how translation module works for given complete sentence. Assume that, translation module reads a sentence “The Rabbit is a small and herbivorous animal” as an input text. Then English morphological analyzer reads the input sentence and returns the following. eng_detm([e1000002], dr, 'the'). eng_noun([e1000077], td, sg, ma, sb, 'rabbit'). eng_verb([e1000057], if, 'is'). eng_detm([e1000001], id, 'a'). eng_adjv([e1000074], p, 'small'). eng_conj([e1000020], 0, 'and'). eng_adjv([e1000076], p, 'herbivorous'). eng_noun([e1000059], td, sg, co, sb, 'animal').

eng_detm/3, eng_noun/6, eng_verb/3, eng_adjv/3 and eng_conj/3 are the prolog predicates to represent English words. Then English parser reserves above information and analyzes the English sentence. The English parser returns the following predicates. eng_sentence_type(simple,if). eng_sen_verb([e1000057]). eng_sen_complement([e1000001, e1000074, …]). eng_sen_subject([e1000002, e1000077]). eng_sen_ekeys([e1000002, e1000077, …]).

This English parser identifies the subject, verb and complement of the sentence. It stores these information using prolog predicates such as eng_sen_verb/1, 98

eng_sen_complement/1 and eng_sen_subject/1. After successful syntax analysis, word translator translates corresponding Sinhala root word for given input root word. The word translator returns the following predicates.

estrwords(1001, e1000002, s1000000, dt). estrwords(1002, e1000077, s1000078, na). estrwords(1003, e1000057, s1000059, vb). estrwords(1004, e1000001, s1000000, dt). estrwords(1005, e1000074, s1000076, aj). estrwords(1006, e1000020, s1000018, cn). estrwords(1007, e1000076, s1000077, aj). estrwords(1008, e1000059, s1000060, na).

The estrwords/4 prolog predicates represent bilingual information for each English root words. By using these entire information Sinhala morphological generator generates suitable Sinhala words for corresponding English words. snoun([s1000078], td, sg, ma, li, dr, v1,'ydjd'). sin_fverb([s1000059], td, sg, pr,'h'). sin_adjv([s1000076],'l=vd'). sin_conj([s1000018],'iy'). sin_adjv([s1000077],'Ydl NlaIl'). snoun([s1000060], td, sg, co, li, id, v1,'isjqmdfjla').

Using all these information Sinhala parser generates appropriate Sinhala sentence as “ydjd l=vd iy Ydl NlaIl isjqmdfjla h'”. After the successful translation, HTML parser reads this translated text and composes a corresponding web page. Using this interface user can see the original English web page and the translated Sinhala web page separately. Figure 7.4 shows the output web interface of the web page translator.

99

Figure 7.4: BEES as a web page translator

7.4 BEES as a Selected Sentence Translator As an improved version of the online BEES, the author has developed BEES as a selected sentence translator. The Select sentence translator is a client application that runs on the client machine and translation process run through the online translator [61]. This client tool has been implemented through the VB application and online connection created through the Winsock client. Figure 7.5 shows the user interface of the selected sentence translator. Using this tool user can translate a sentence just only to select it. This application is very useful to readers to translate a sentence while it is being read. Figure 7.6 shows a desktop that show how this tool gives the translation for the selected sentence “we gave a new book to your friend”. 100

Figure 7.5: Selected sentence translator

Figure 7.6: Desktop screen for selected sentence translation

101

7.5 BEES as a Desktop Application The BEES has been designed as a desktop application. In this section describes, how the system works for a given input sentence. The translation system uses 7 modules to process the translation. To start the translation system reads input sentence form GUI and start the translation process. As the first step, the English Morphological Analyzer reads the input English sentence word by word and provides the Morphological information for each word. Then English parser analysis the Input English Sentence by reading the above morphological information and the input English sentence. Consequently, the English to Sinhala Base Word Translator translates the English base words into appropriate Sinhala based words. This process is rather complex and it uses two supporting dictionaries namely, the English-Sinhala bilingual dictionary and the Concept dictionary. As the first step, English to Sinhala Base Word Translator uses English-Sinhala bilingual dictionary and reads the available Sinhala based words for the given English base word. If there are multiple words available in the Bilingual dictionary, then system lookup the relevant information from concept dictionary to indentify the most suitable Sinhala base word. The concept dictionary is used to store concepts information for each Sinhala word. Otherwise, English to Sinhala Base Word Translator gives most usable Sinhala based word for the given English based word. After successful base word translation, the Sinhala parser (Sentence composer) generates appropriate Sinhala sentence with supporting the Sinhala Morphological generator. The Sinhala Morphological Generator generates appropriate Sinhala words by using the translated Sinhala based word for the given grammar information. The Sinhala Parser uses above generated Sinhala word to generate grammatically correct Sinhala sentence. The figure 7.7 shows the user interface of the BEES. Translation system works on the two modes namely user mode and the expert mode. If the system runs as an expert mode then it assumes as user is expert for the both languages. Therefore, The Intermediate editor automatically provides facilities to change the sentence through

102

the intermediate editing, as it needs to semantic handling. In addition, system also updates the lexical resources automatically.

Figure 7.7: User interface of the BEES

The translation system runs on the user mode, the Intermediate editor appears only for the user ask to change the sentence. The following sample data is shown how translation is processed. Assume that system reads “The good boy and his old mother are reading books” as an input sentence. Then English Morphological Analyzer returns the following output. % Auto generated

output

% ********************** eng_input_sen_list(['the',

'good',

'boy',

'and',

'his',

'old',

'mother', 'are', 'reading', 'books', []]).

eng_detm([3000003], dr, 'the'). eng_adjv([3000004], p, 'good'). eng_noun([1000001], td, sg, ma, sb, 'boy'). eng_noun([1000001], td, sg, ma, ob, 'boy').

103

eng_conj([3000027], 0, 'and'). eng_noun([4000004], td, sg, ma, po, 'his'). eng_noun([4000004], td, sg, ma, po, 'his'). eng_adjv([3000035], p, 'old'). eng_adjv([3000062], p, 'mother'). eng_noun([1000025], td, sg, no, sb, 'mother'). eng_noun([1000025], td, sg, no, ob, 'mother'). eng_verb([5000026], if, 'are'). eng_verb([3000030], uv, 'are'). eng_verb([5000008], rp, 'reading'). eng_noun([1000004], td, pr, no, sb, 'books'). eng_noun([1000004], td, pr, no, ob, 'books').

After the syntax analysis, English parser returns the following; eng_sen_verb([3000030, 5000008]). eng_sen_complement([1000004]). eng_sen_subject([3000003,

3000004,

1000001,

3000027,

4000004,

3000027,

4000004,

3000035, 1000025]). eng_sen_predicate([3000030, 5000008, 1000004]). eng_sen_type(declarative). eng_sen_ekeys([3000003,

3000004,

1000001,

3000035, 1000025, 3000030, 5000008, 1000004]). eng_sen_tence(presentcontinus). eng_sen_result(sucess).

By using these entire information English to Sinhala base word translator returns the suitable Sinhala terms. The following code displays the result of the English to Sinhala base word translator. estrwords(1001, 3000003, 3000000, dt). estrwords(1002, 3000004, 3000004, aj). estrwords(1003, 1000001, 1000001, na). estrwords(1004, 3000027, 3000027, cn). estrwords(1005, 4000004, 4000004, na). estrwords(1006, 3000035, 3000035, aj). estrwords(1007, 1000025, 1000045, na).

104

estrwords(1008, 3000030, 3000030, uv). estrwords(1009, 5000008, 5000008, vb). estrwords(1010, 1000004, 1000004, na).

Then Sinhala Morphological generator generates suitable Sinhala word with full grammatical information. The output of the Sinhala Morphological generation is as follows. sin_adjv([3000004],''). snoun([1000001], td, sg, ma, li, dr, v1,' '). sin_conj([3000027],''). snoun([4000004], td, sg, ma, li, dr, v7,''). sin_adjv([3000035],''). snoun([1000045], td, sg, no, nl, dr, v1,''). sin_sub_info([3000004,

1000001,

3000027,

4000004,

3000035,

1000045]). sin_sub_word([,

, ,

, , , []]).

sin_fverb([5000008], td, pr, pr,' '). sin_veb_info([5000008]). sin_veb_word([ , []]). snoun([1000004], td, pr, no, nl, dr, v2,''). sin_cmp_info([1000004]). sin_cmp_word([, []]).

Finally Sinhala parser generates corresponding Sinhala sentence “ෙහොද පිරිමි ළමයා සහ ඔහුෙග් වයසක මව ෙපොත් කියවමින් සිටිති”.

105

7.6 Summary This chapter described how BEES works on four environments namely as an online application, as a web page translator, as a selected sentence translator and desktop application. The next chapter reports how evaluate our system to find the accuracy of the English to Sinhala machine translation.

106

Chapter 8 EVALUATION 8.1 Introduction The approach and the implementation stages were discussed in the preceding chapters. The evaluation of the approach is described in this chapter based on hypothesis formulated to the test whether the BEES is able to translate English text into Sinhala. This chapter also reports existing evaluation methodology for the machine translation and our approach to evaluate the English to Sinhala machine translation.

8.2 Evaluation of MT systems Evaluation of the Machine Translation system has been received significant attention in the past few years. In general, the Machine translation system can be evaluated through several ways such as comparison with human, comparison of multiple machine translation systems etc.

To evaluate the machine translation systems,

several methods are used. These evaluation methods can be categorized into two groups namely the automated evaluation and the human supported evaluation [98]. Numbers of standard evaluation matrices (methods) are available for automated machine translation system evaluation such as BLEU [123], NIST [111] and METRO [21] etc. These evaluation metrics do not use the human support for the evaluation process. These metrics are much faster, easier and cheaper than the human evaluation [2]. Most of these techniques are based on n-gram metrics evaluation [90]. The BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text, which has been machine-translated from one natural language to another [123]. It is one of the most commonly used evaluation matrices for Statistical machine translation systems. However, it does not provide sentence level scores [169]. 107

METEOR is another evaluation matrix that automatically evaluates the output of machine translation engines by comparing them to one or more reference translations [21]. It has been designed to explicitly address the weakness in the BLEU matrices. On the other hand, Round-trip translation [139] is a traditional approach to evaluate machine translation systems. The Round-trip translation is the process of translating a word, phrase or text into another language then translates the results at least more than once without reference to the original text, until it ends up back in the language it started in [162]. Note that, many researchers agreed that, these automated evaluation techniques are more suitable for closely related language pairs such as Sinhala-Tamil [157], English-German etc. However, the BLUE types of automated evaluation techniques are not suitable for structurally different language pairs such as English-Hindi [12]. In addition to that, Goyal and others [52] have noted that, Hindi type of languages need more criteria for evaluating purpose than the single question evaluation (“Is the translation good” Yes/No). They have mentioned that, the answers are needed for several questions to complete the evaluation such as Gender/Number is properly translated or not, Tense in the translated sentence is proper or not, and Voice of a sentence (i.e. active or passive) is properly translated or not etc. Further, the Sinhala language is closely related to the Hindi language and both languages have same linguistic properties. Therefore, the evaluation methodology of the BEES is based on the above factors. Traditionally, the evaluation of the Machine translation system has performed by using human support. It is complex and a time waste process. However, the result of the human evaluation is perfect than the automatic evaluation. Therefore, many machine translation system developers have used black-box and white-box based testing techniques to evaluate their machine translation systems through the human support. Among others, Goyal and Lehal [52] proposed human supported approach to evaluate their Hindi to Punjabi machine translation system. To evaluate their machine translation system, they have selected more than 100,000 sentences from newspaper articles, official language quest and blogs. They have used 50 people and 108

scoring has been done based on the degree of intelligibility and comprehensibility. Four point scale has been made for their evaluation. Highest point has assigned to the perfect translation and the lowest point has assigned to the unintelligible sentence. Error analysis is one of the important factors for evaluation of the machine translation systems. Error is analyzed through the Word Error Rate (WER) and the Sentence Error Rate (SER). Word error rate is a common matrix of the performance of a speech recognition or machine translation system. Word error rate and sentence error rate can then be computed as:

Considering the above facts, author has developed an evaluation methodology for our English to Sinhala machine translation system.

8.3 BEES Evaluation The English to Sinhala machine translation system has been evaluated through the following three stages; 1. Conducted a white box testing approach and tested each module in the machine translation system through the developed testing tools (Module testing) 2. Evaluated the system performance and calculated the error rate through the evaluation test bed (Performance testing) 3. Intelligibility and accuracy test was conducted through the human support. (Accuracy testing) [61]

109

8.4 Stage1: Module Testing The English to Sinhala Machine Translation system contains six modules that are directly supported for the translation namely English Morphological analyzer, English parser, English to Sinhala bilingual base-word translator, Sinhala morphological generator, Sinhala Sentence generator and the transliteration module. Author has designed and developed test tools for each module and tested each of them. These tools have been developed as online systems that are available on the BEES web site [72].

8.4.1 English Morphological Analyzer The English Morphological analyzer analyzes English words and gives the morphological information for each word. To test the English Morphological analyzer, author has implemented the online version, which gives morphological information for the given English word(s). Using this online English morphological analyzer, author has tested each type of word through the created test plan. The complete evaluation test plan is attached in the appendix A. Using more than 50 test cases, the English Morphological analyzer has been successfully tested. Table 8.1 shows a sample test plan for the English regular nouns. The complete test plan has been attached at the end of the thesis. The figure 8.1 shows user interface and the output result of the English morphological analyzer.

Table 8.1: Sample test plan for English Morphological analyzer No 1 2 3 4 5 6 7 8

Test case

Morphologi cal rules for English Noun

Grammar Singular noun Plural noun Plural noun Plural noun Plural noun Singular Possessive Plural Possessive Singular noun

Morphological structure

Base word

Examp le

Base word Base + s Base + es Plurals Base –y + ies Plurals Base – f + ves Base + ‘s

boy boy class baby knife Home

boy Boys Classes Babies Knives Home’s

Plural + ‘ Verb Base + er

boy play

Boys’ player 110

9 10

Plural noun Singular noun

Verb Base + ers Verb Base + ment

play Pay

11

Plural noun

Verb Base + ments

Pay

players paymen t paymen ts

8.4.2 English Parser The English Parser analyzes the English sentence by using the output result of the English Morphological analyzer. The online test tool has been developed to test all the functionality of the English Parser. The English parser has been tested through the created test plan. This parser is able to handle all the simple as well as complex sentences of declarative, interrogative, imperative types and it returns syntax information of the given English sentence. Author has successfully tested more than 500 sentence patterns through the developed test tool. Some selected test cases are shown in table 8.2.

Figure 8.1: English Morphological analyzer with test results 111

Table 8.2: Sample test plan for English parser No

Pattern

Example

1

Simple Present

A boy reads a book

2

Present Continuous

I am writing a new book

3

Present perfect

Good boys have read the books

4

Present Perfect Continuous

I have been writing a book

1

Simple past

I gave a book

2

Past Continuous

I was giving a book

3

past perfect

I had written a book

4

Past Perfect Continuous

The boy had been giving a book

8.4.3 English to Sinhala Base Word Translator English to Sinhala Base Word Translator provides suitable Sinhala base word for the given English base word. The translator uses following rules to generate appropriate Sinhala base word; •

Find the suitable Sinhala base-word from bilingual dictionary with the full grammatical mapping (Two or more words available in the bilingual dictionary system uses concepts dictionary to find the suitable Sinhala base-word)



If the grammatical mapping is not satisfied, then the system uses Intermediate editor.



If there are no any correspondent Sinhala words for the given English base word in the bilingual dictionary, then the system uses corresponding Sinhala transliteration.

112

To evaluate English to Sinhala base word translator, author has implemented a test tool to test the functionality of the bilingual translator. The English to Sinhala bilingual base-word translator has been tested through the created test plan.

8.4.4 Sinhala Morphological Generator Sinhala Morphological Generator is the key module of the English to Sinhala translation system. It generates required word form for a given Sinhala base word. By using this Sinhala Morphological Generator, a testing tool has been created to generate all the forms of a given Sinhala base word. Further, the Sinhala language contains large number of conjugation forms for the nouns and the verbs. Our Sinhala Morphological generator handles 85 grammar rules for the Sinhala nouns and 36 grammar rules for the Sinhala verbs. Sample Conjugation table for Sinhala nouns is attached in the appendix B. All these rules are implemented by using fundamentals of the Sinhala grammar such as Prakurthi, and Nama and Kriyagana [41] [88]. Table 8.3 shows the sample palindrome table for the Sinhala noun form “Ethganaya” (we;a .Kh).

Table 8.3: Sample Sinhala Morphological rules .Kh

,sx.h

m%lD;sh

ksh; tal

example

Add

rem

example

mq

ඇත්

D

a

ඇතා

mq

ෙකොක්

D

a

ෙකොකා

we;a

mq

ෙගොන්

D

a

ෙගොනා

.Kh

mq

නslම්

D

a

නිකමා

mq

කිඹුල්

D

a

කිඹුලා

mq

මිනිස්

D

a

මිනිසා

To test the Sinhala morphological generator author has implemented a “Sinhala word conjugator” which gives all the Sinhala words form for the given Sinhala word. The 113

figure 8.3 shows how Sinhala word conjugator runs in the swi-prolog [143] interface. The complete set of rules, which are used to implement the Sinhala word generation, is attached at the end of the thesis.

Figure 8.2: Sinhala word conjugator

8.4.5 Sinhala Sentence Composer Structures of the Sinhala and English sentences are different from each other. Therefore, each English sentence cannot directly map into the Sinhala sentence especially for the passive voice and perfect forms. The Sinhala sentence composer is composed grammatically correct Sinhala sentence for the given Sinhala subject, object, verb phrase and tense pattern. Each of the corresponding sentence patterns for the English is tested through the test plan.

114

8.4.6 Transliteration Module The transliteration modules are used to transliterate the English text into Sinhala. To test all the functionality of the transliteration module, the online tool has been implemented. By using the transliteration tool, the transliteration module has been tested. 8.5 Stage 2: Performance Testing After evaluating each module in the English to Sinhala machine translation system, the evaluation test bed has been implemented as an experimental setup [68]. The evaluation test bed contains limited number of words (100 nouns, 50 verbs, 50 adjectives 50 adverbs, determiners, and some auxiliary verbs for tenses). Using the evaluation test bed, performance of the translation system, the Word Error Rate and the Sentence Error Rate of the system has been calculated. Figure 8.4 shows the user interface of the evaluation test bed. Using the evaluation test bed, anyone can make a sentence by using the available words. After generating a Sinhala sentence, the evaluating test bed shows the evaluation form. The evaluation form contains the following questions to evaluate the translation. •

Subject verb agreement (correct/incorrect)



Tense of the sentence (correct/incorrect)



Word conjugation (all correct/some are correct/ all incorrect)



Word order in the sentence (correct/incorrect)



Meaning of the translated sentence 0– Error 1 - Meaningless 2 - Basically OK 3 - Perfect 115

This evaluation form is used to evaluate the English to Sinhala machine translation system. Figure 8.5 shows online evaluation test bed and the figure 8.6 shows the user interface of the online evaluation form.

Figure 8.3: User interface of the evaluation test bed

116

Figure 8.4: Online evaluation form 8.6 Stage 3: Accuracy Testing The previous two stages are used to check each module of the translation system and calculated the performance of the system. To evaluate accuracy and the intelligibility of the translation system, following three steps are followed. 1. 200 sample sentences are collected and group them into 20 sets (10 sentences for each group) 2. Each sentence is translated using BEES

117

3. Each set of sentences is given to the human translator and scored for each sentence with the following criteria (Same as the evaluation form of the evaluation test bed) •

Subject verb agreement (correct/incorrect)



Tense of the sentence (correct/incorrect)



Word conjugation (all correct/some are correct/ all incorrect)



Word order in the sentence (correct/incorrect)



Meaning of the translated sentence 0– Error 1 - Meaningless 2 - Basically OK 3 - Perfect

The accuracy and the performance of the system have been calculated though all the above results.

8.7 Result of the Experiments To get the results, 200 sample sentences were used. The following list shows some sample sentences and the Sinhala translation of the each sentence. Sample evaluation form and sample evaluator’s comments attached in appendix C and D 1. I write books uu fmd;a ,shñ 2. I am writing a new book uu w¨;a fmd;la ,shñka isáñ 3. I have written a new book uu w¨;a fmd;la ,shd we;af;ñ 4. We have written new books wms w,q;a fmd;a ,shd we;af;uq 118

5. A good boy and his mother have been reading new books olaI msrsñ

Suggest Documents