English to Sanskrit Translator and Synthesizer (ETSTS)

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume...
6 downloads 0 Views 453KB Size
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

English to Sanskrit Translator and Synthesizer (ETSTS) Sarita G. Rathod1, Shanta Sondur2 1,2

Information Technology Department, VESIT, Mumbai University, Maharashtra, India In this we use the dictionary rule based approach of machine translation for translator of English to Sanskrit and formant synthesis method for converting text into speech. If we compare the grammar for both English and Sanskrit then English sentences are always in order of subject-verb-object format while Sanskrit has free word order. For e.g. the order of English sentence (ES) and its equivalent translation in Sanskrit sentence (SS) is given below. ES: He read book (SVO) SS: Saha pustkam pathati (SOV) OR Pustkam Saha Pathati (OSV) OR Pathati Pustkam Saha Thus Sanskrit sentence can be written using SVO, SOV and VOS order. Speech is the primary means of communication between people. Speech synthesis, automatic generation of speech waveforms .Synthesis procedure consists of two main phases. The first one is text analysis, where the input text is transcribed into a phonetic or some other linguistic representation, and the second one is the generation of speech waveforms, where the acoustic output is produced from this phonetic and prosodic information. These two phases are usually called as high- and low-level synthesis.

Abstract— The development of Machine Translation system for ancient language such as Sanskrit language is much more fascinating and challenging task. In the proposed algorithm we integrate traditional dictionary rule based approach for translation which translate source English sentence into target Sanskrit sentence. It has two models Text to Text Translator module and Text to speech synthesizer module. Due to morphological richness of Sanskrit language, this system uses morphological markings to identify Subject, Object, Verb, Preposition, Adjective, Adverb sentences. This paper presents English to Sanskrit approach for translating well-structured English sentences into well-structured Sanskrit sentences. Keywords— Bilingual dictionary, Formant Synthesizer, Natural language processing, Machine translation, Parser, Rule based dictionary approach, Synthesizer, Translator.

I. INTRODUCTION Machine Translation has been defined as the process that utilizes computer software to translate text from one natural language to another. This definition involves accounting for the grammatical structure of each language and using rules and grammars to transfer the grammatical structure of the source language (SL) into the target language (TL).Machine translation in Sanskrit is never an easy task because of structural vastness of its grammar but the grammar is well organized and least ambiguous compared to other natural language. The module present concerns with the Machine Translation domain of Natural Language Processing. This area of Artificial Intelligence is very useful in providing people with a machine, which understands diverse languages spoken by common people. It presents the user of a computer system with an interface, with which he feels more comfortable. In the proposed methodology, for translation we decode the meaning of the source input text in its entirety, the translator must interpret and analyze the text, a process that requires deep knowledge of the grammar, semantics, syntax, idioms, etc., of the source language and target language. The translator needs the same in-depth knowledge to re-encode the meaning in the target language. And for synthesizing we convert the generated text output which is in Sanskrit language into the waveform and gives the voice output.

II. APPROACHES USED For translation generally there are three main approaches used as follows, 1. Rule based, 2. Statistical based, 3. Example Based Rule Based method for machine translation is divided in to another three types, 1. Transfer based machine translation: - Type of machine translation based on idea of Interlingua and is currently one of the most widely used areas. It is necessary to have an intermediate representation that captures the “meaning” of the original sentence in order to generate the correct translation. 2. Dictionary based machine translation: - Machine translation can use dictionary based approach, which means that the words will be translated as they are by a dictionary. 379

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012) 3. Interlingual based machine translation: - Machine translation can use dictionary based approach mean the text to be translated, is transformed into an Interlingual, i.e. source or target-language-independent representation. The target text e is then generated out of the Interlingua. Rule based translation consists of 1. Process of analyzing input sentence of a source language syntactically and or semantically 2. Process of generating output sentence of a target language based on internal structure each process is controlled by the dictionary and the rules. The strength of the rule based method is that the information can be obtained through introspection and analysis. The weakness of the rule based method is the accuracy of entire process is the product of the accuracies of each sub stage.

C. Pronoun According to Paninian Grammar and investigations of M. R. Kale, Sanskrit has 35 pronouns. These pronouns have been classified into nine classes. Each of these pronouns has different classes as personal, demonstrative, relative, interrogative, reflexive, indefinitive, correlative, reciprocal and possessive. Each of these pronouns has different inflectional forms. D. Adverb Adverbs are either primitive or derived from noun, pronouns or numerals. TABLE I COMPARATIVE STUDY OF ENGLISH AND SANSKRIT ON DIFFERENT BASIS

Basis Alphabet Number of Vowels Number of Consonants Number

III. ENGLISH AND SANSKRIT GRAMMAR English is well known and well structured language and Sanskrit is ancient language and it is mother of all Indian language. The English language has Subject+verb+object format of sentence and Sanskrit is free ordered language. This language does not lead to any ambiguity, thereby maintaining a grammatical meaning for every sentence obtained by the change in the ordering of the words in the original sentence. The example has already explained in section I.

Sentence Order Tense

A. Alphabet The alphabet, in which Sanskrit is written, is called Devnagari. The English language has twenty-six characters in its alphabet while Sanskrit has forty-two character or varanas in its alphabet. The English have five vowels (a, e, i, o, u) and twenty one consonants while Sanskrit have nine vowels or swaras (a, aa, i, ii, u, uu, re, ree and le) and thirty three consonants or vyanjanas.

Mood

English 26 character Five vowels

Sanskrit 42 character Nine vowels

Twenty one

Thirty three

two: singular and plural SVO(subject-verbobject) Three: present, past and future

three: singular, dual and plural free order

Five: indicative, imperative, interrogative, conditional subjunctive

Six: present, aorist, imperfect, perfect. 1st future and 2nd future Verb Four: imperative, potential, benedictive and conditional

E. Verb There are two kinds of verbs in Sanskrit: primitive and derivative. There are six tenses (Kaalaa) and four moods (Arthaa). The tenses are as present, aorist, imperfect, perfect, first future, and second future. The moods are as imperative, potential, benedictive and conditional. The ten tenses and moods are technically called the ten Lakaras in Sanskrit grammar.

A. Noun According to Paninian grammar, declension or the inflections of the nouns, substantive and adjectives are derived using well defined principles and rules. The crude form of a noun (any declinable word) not yet inflected is technically called a pratipadikai.

F. Voice There are three voices: the active voice, the passive voice and the impersonal construction. Each verb in Sanskrit, whether it is primitive or derivative, may be conjugated in the ten tenses and moods. Transitive verbs are conjugated in the active and passive voices and intransitive verbs in the active and the impersonal form. In each tense and mood, there are three numbers: singular, dual and plural with three persons in each.

B. Gender Any noun has three genders: masculine, feminine, and neuter; In Sanskrit, there are three numbers: singular, dual, and plural. And in English has two numbers: singular and plural.

380

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012) D. Morphological Analysis process The lexicon or the database developed plays a very important role for Morphological analysis. As it searches through the lexicon to gather the information. This process takes the tokens as input and gathers grammatical information on that token.

IV. OVERALL ARCHITECTURE OF ETSTS The careful analysis of this module and its possible solutions leads to the following design of the system. The main idea behind dictionary based Machine Translation is that input text sentence can be converted in to output sentence by carrying out the simplest possible parse , replacing source word with the target language equivalents as specified in a bilingual dictionary, and then using grammar rules of target language re-arranging their order. The overall system is diagrammatically shown below in Fig.2, and following steps are involve in translation and Synthesizer of System,

E. Translator process This module performs the actual translation. The input to this module is the parse. It also interacts with the parser/generator module to get the parse of each word. It then generates appropriate equivalents in English for the morphological details of each word and ultimately presents sentence in the correct order. F. Parser /generator and mapping This process checks whether the input sentence is grammatically correct or not. The information gathered from above mentioned process helps in analyzing the grammatical aspect of the sentence and on the basis of the rules the assessment is done. Mapping is done purely on the basis of the information passed from the morphological module. Parser Generator Module contains a set of transducers built for individual Sanskrit words and transforms strings to partial words, which are used by the Vichheda module and dictionary based approach. It also gives the parse of the words, which are used by the sentence former to give the output in a structurally correct sentence.

Figure 1 .Overall system design of Translator & Synthesizer

A. Input process This process is the first small step towards translation process. It takes sentence as input in a text box developed in a GUI within the software developed for the translation process.

G. Wave generation After getting the target text it converts into the wave and it plays using some output devices i.e speakers. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity.

B. Token generation process This module splits the given sentence into chunks of strings delimited by spaces. These strings may be simple words or compound words coalesced by the rule of English Grammar. By applying the rules of English grammar assign appropriate category to words like (noun, verb, noun phrase etc.) Generate a parse tree using grammar rules of source language.

V. ALGORITHM The algorithm developed for the software is as follows, A. Algorithm for Text to Text Translator Module Step 1: Give the input as sentence in English. Step 2: Split the sentence and generate the tokens from the sentence based on white space. Step 3: Store the tokens in the data structure like vector to generate the loop. Step 4: Generate the parse tree using grammar rules of source sentence.

C. Vichheda Module process The Vichheda module gets help from the set of transducers to identify words and forms words through the word generator. The word generator in turn takes the help of the sandhi rules module wherever necessary. The remaining string after the basic word is generated is sent back to the Vichheda module.

381

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012) Step 5: After generation of basic word, the remaining words sent back to Vichheda module. Step 6: Analyze the on the basis of subject, verb and object. Step7: Map the tokens with the grammar of target language and generate the output in Sanskrit sentence.

4: Considering grammar rules of Sanskrit language generate the Sanskrit sentence and after getting the proper output sentence it forward to next synthesizer module. Target sentence: Saha pustkam Pathati He -> Saha (Subject) Book -> Pustkam (Object) Read -> Pathati (Verb)

B. Algorithm for Text to Speech Synthesizer Module Step 1: The output from the first module i.e. translator acts as the input to second module i.e. synthesizer. Step 2: Analyzes the text and linguistics. Step 3: Apply the phonetic alphabets to analyzed text. Step 4: Generate the waveform and gets the output in speech format.



VI. EXAMPLE



He

< noun >

< VP >

< VP >



pathati

pustakam

Figure 3 .Tree generated in target language

5: After getting the target sentence, by applying techniques of synthesizing, convert sentence into waveform and gives the output in voice format. VII. RESULT The objectives described for the ETSTS system above may be implemented using different approaches. We use here only one dictionary base approach. The separate lexicons for English sentence and Sanskrit sentence may be maintained in a database with morphological details stored in the form of logics in the programming language used. A bilingual dictionary will also have to be maintained in this case. Rules are formed for tokenization and parsing. Tokenization and parsing is implemented using java language. Expected output screen after getting translation means convert the English sentence to Sanskrit sentence is given in Fig 4. Expected output screen after synthesizer means convert the Sanskrit sentence into waveform and it play using output devices i.e. speakers is given in Fig 5.



read

< noun >

Saha



< VP >



Consider the Example, ES: - He read book SS: - Saha pustkam pathati 1: Token generator separates each word from sentence according to English grammar and it also consider the space between two words. He read book Token1 Token2 Token3 2: Syntax tree represent the syntactic structure of a sentence according to general grammar. In a tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the grammar. A program that produces such type of ordered tree is called a parser. 3: After creating parse tree we get each word with proper tagging then find the meaning of each word in English dictionary and Sanskrit dictionary. If meaning of word is not available i.e. dictionary its gives error massage.

< NP >

< NP >

book

Figure 2 .Tree generated in source language

382

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012) This work will very useful for the sharing the worldwide knowledge with Indian. Traditional approach for translation is just dictionary based which translate sentence by wordto-word comparison. In dictionary based approach of rule base method of MT is possible it required dictionaries of both the languages along with morphological databases of both the languages. Speech synthesis has been developed steadily over the last decades and it has been incorporated into several new applications. For most applications, the intelligibility and comprehensibility of synthetic speech have reached the acceptable level. Formant TTS voices are typically not as natural-sounding as concentrative TTS voices and we use formant approach synthesizer for this stage, but both provide capabilities that pre-recorded audio cannot provide. REFERENCES Figure 4. Expected output screen after translator

[1]

[2]

[3]

[4]

[5] [6]

[7] Figure 5 . Expected output screen after Synthesizer

[8]

VIII. CONCLUSION

[9]

The translation of sentences from English to Sanskrit is the main aim of the present paper. This may be implemented using different approaches, we use here only dictionary based approach To translate simple English texts involving the need for Tokenize, applying grammar rule create parse tree into corresponding appropriate sentences in Sanskrit. Designing the Computer based translator is possible using the natural language processing .Its followed the rule-based approach for translation.

[10] [11] [12] [13] [14]

383

R.M.K. Sinha, A. Jain “AnglaHindi: English to Hindi MachineAided Translation System”. International Journal of Computer application Vol. 3, No. 3, Oct 2008. Mr.Sandeep Warhade and Mr.Prakash R.Devale “DESIGN OF PHRASE-BASED DECODER FOR ENGLISH-TO-SANSKRIT TRANSLATION” Journal of Global Research in Computer Science, Vol 3 (1), January 2012, 35-38 Mishara Vimal, Mishara RB. "Study of Example Based English to Sanskrit Machine Translation" department of computer engineering .Institute of technology, Banaras Hindu University, Varanasi India. Khaled Shaalan “Rule-based Approach in Arabic Natural Language Processing” International Journal on Information and Communication Technologies, Vol. 3, No. 3, June 2010. R. Carlson, B. Granström, S. Hunnicut, "A multi-language Text-ToSpeech module", ICASSP 82, Paris, vol. 3, pp. 1604-1607. R. Belrhali, V. Auberge, L.J. Boe, "From lexicon to rules: towards a descriptive method of French text-to-phonetics transcription", Proc. ICSLP 92, Alberta, pp. 1183-1186. "Speech and Language Processing- an introduction to Natural Language Processing" by Daniel Jurafsky and James Martin, reprint 2000. "A Higher Sanskrit Grammar" by M. R. Kale, Delhi M.Banarassidas Publisher, 1961. "Natural Language Processing" by Aksar Bharati, Vineet Chaitanya, Rajeev Sangal. Prentice Hall of India, New Delhi, June 1994. “Text to Speech Synthesizer” by Paul Taylor, Cambridge University Press,2009 "Sandhi Viveka" by A. Varadaraj. "Introduction to Computer Theory" -second edition by Daniel I.A.Cohen http://en.wikipedia.org/wiki/Speech_synthesis. http://americanhistory.si.edu/archives/speechsynthesis/ss_home.htm.

Suggest Documents