Nepali-English parallel sentences fragmentation

Nepali-English parallel sentences fragmentation Kedrova, Galina Lomonosov MSU Moscow Russia [email protected] Abstract We present a new approach to f...
2 downloads 2 Views 147KB Size
Nepali-English parallel sentences fragmentation Kedrova, Galina Lomonosov MSU Moscow Russia [email protected] Abstract We present a new approach to fragmentation of sentences of the source and the translated text. Intervals, not the words itself in the bilingual space enables matching multi-word units. Implementation of the Viterby algorithm enables creation of automatic dictionary of fragments for use in Example Based Machine Translation (EBMT). Keywords bilingual space, fragment, matching, dynamic programming. 1.


Proper fragmentation of parallel bilingual texts is an essential and the first priority problem to be solved within the complex problem of machine translation based on examples. The process of fragmentation of twin texts can be treated with various degrees of granularity - from the paragraph matching to the word by word fragments matching. This paper deals with the fragmentation of the beforehand selected parallel sentences. It is assumed that the problem of extraction of parallel sentences is already solved in general. Such fragmentation is used in particular in commercial systems such as translation memory, etc. The obvious limitation of such systems usage is connected with the fact that those are practically applicable only in case of stereotype texts – such as the statutory documents or contracts. On the other hand similar fragments can occur in parallel texts

Potemkin, Sergey Lomonosov MSU Moscow Russia [email protected] very often, hence the need for the proper allocation of parallel fragments in two sentences, one of which is the "correct translation" of the other one. (The correct translation is understood as a translation performed by a qualified translator, and perhaps repeatedly verified, such as translations of the Bible (NAUMOVA, 2005). This paper presents a new approach to the fragmentation of the sentences based on the lexical and structural mapping of the fragments of the original and translated sentences. In contrast to the known methods, we use the intervals between the words as the matching points instead of the words themselves. This approach enables comparing the word compound of the source sentence with a word or a phrase of the translated sentence. Then the method of dynamic programming is used in searching the best fragmentation of each sentence. Selection of the weighting factors for each matched interval enhances the quality of fragmentation. The algorithm and experimental results for the parallel texts without morphological and syntactical markup are presented. 2.

Bilingual space

The parallel corpus can be mapped onto two-dimensional space (MELAMED, 1999), one axis of which represents the words of the source text, and the other axis represent the words of the target text, or translation. The distance from the zero point to some

ठू ला

great/ big/ large

मु ा कोष

money/ currency/ posture/ exchange fund/ cell/ base/



move/ take/ the/ from


now(currently)/ today





over/ most/ more



over/ more/ greater



the/ that/ of/ is/ same

ितफल ा गिररहे का छन्


1 1

1 1 1


return/ reward/ compensation/ result/ consideration received/ obtained/ acquired/ from/ gain doing/ working/ do/ who/ those are/ have/ been








The words of two sentences, the source and the target one are placed in rows and col-

अिहले ९ % भन्दा बढ नै ितफल ा गिररहेका


Matching matrix



The source sentence is The top money funds are currently yielding well over 9 % .and the target one: ठू ला मु ा कोष ह ले


Situation is rather different when we conceder the mapping of individual sentences. For sufficiently distant languages the word orders in the source and in the target sentence usually do not coincide.

umns of a rectangular table (matrix), which assessing the closeness of these sentences. (KEDROVA, POTEMKIN, 2005). The sentences for this example below are chosen from (URDUNEPALIENGLISHPARALLELCORPUS) ling_resources/UrduNepaliEnglishParallel Corpus.htm


token is the number of the preceding tokens from the beginning of the text. The size of the token depends on the level of granularity – paragraph, sentence, word. As usual the order of sentences in both texts coincides so the mapping from the source to the target text on this level is monotonous.


Fig. 1: Matching matrix for two sentences

On Fig.1 the words of the source sentence are placed along axis X, and the words of the target one - along axis Y. Each word is placed according to its position count from the beginning of the sentence. For example, money is the third word of the source text, so it is located in column 3. A cell of the matrix at the intersection of a source word Ws and a target word Wt is filled with 1 if and only if the pair Ws, Wt is fixed in the bilingual dictionary, otherwise it is 0. For example a cell (2, 2) indicates a pair of words {अिहले, currently}. In the general case, the value of the cell belongs to the interval [0, 1], depending on the "similarity measure" or “semantic distance” between the source and the target word, as we’ll discuss later. 4.

Separators as the coordinates

In contrast to (MELAMED, 1999), we’ll use the separators (blanks) between the words, not the words themselves, as the matching points in the bilingual space. With this approach, the mapping of the source word onto the target word is a segment with coordinates {(x1, y1); (x2, y2)} where x1, x2 - the beginning and the end of the target word, and y1, y2 - the beginning and the end of the source word, or vice versa. In case of word by word matching x2 = x1 + 1, y2 = y1 + 1. Now consider the opportunity to match not only single words, but also equivalents of type (word phrase), (phrase word) and (phrase phrase). In our example such equivalents might be (भन्दा बढी well over).

So, the first generalization of the previously used paradigm – we match sentence segments instead of matching single words

and immediately get the opportunity to align phrases, not words only. Now the matrix is transformed into a set of segments that define the mapping of words and phrases. Conflict arises when some segments overlap horizontally or vertically (i.e. not a oneto-one mapping). For example, the word of the source text over matches two words of the target text भन्दा and बढी. The most often tokens involved in collisions are functional words and punctuation marks. Resolution of conflicts, i.e. exclusion of all words in the conflict except one is a necessary part of the proper fragmentation. 5.

Segment weight

As we have stated, the measure of (semantic) proximity between pair of words is a normalized value lies between 0 and 1. For the words, one of which is the most statistically probable translation of another one, (९, 9) in our example, this measure is high and for the rare equivalents (ठू ला, top) - is low. We use the lexical database (LDB) with a superimposed semantic metric (KEDROVA, POTEMKIN, 2004; POTEMKIN, 2004) to evaluate the semantic proximity between (Russian and English) words. The essence of our method for determining the measure of proximity between two words is calculation the normalized scalar product of two words, represented as a pair of vectors in the space of the bilingual dictionary. The same method could be used for other language pairs, say Nepali – English. Within the paradigm in which the coordinates are separators (blanks), we replace the notion of proximity measure with the segment weight. For the segment mapping the word onto the word, its weight is the semantic proximity of two words. For the

segments which make up a continuous chain, one should take into account the cumulative effect of the merger. Indeed, if two adjacent words of the source text are mapped onto two adjacent words of the target text, the confidence of such mapping is greater than if the same words are found separately, and therefore, the weight of such interval is greater than the sum of the weights of its components. We’ll assume even more confidence to a segment mapping of the source phrase onto the target phrase fixed in the bilingual dictionary. 6.


Giving weight to all mapping segments, one can perform fragmentation that is the mapping segments of the original sentence onto the segments of the target sentence, which lie between the already-mapped segments. This is called an interpolation between the mapped segments. Among all the possible fragmentations it is necessary to choose the best one according to some criterion such as: a) maximizing the weight of the segments b) minimization of the total length of the new-found segments c) maximize the number of segments, etc. The words of two sentences could be matched in different sequences. The same sentence can be translated as in the direct and in reverse word order, and both translations are correct. A more general case where some groups of words are translated in direct order, the other - in reverse, and the groups themselves are matched randomly. The number of variants of fragmentation can be estimated as O(n!), where n-number of words of the source or target sentence. However, if we consider only the monotonous mapping (i.e., the word order of the source and the target sentences is the same), the task falls into the class of prob-

lems solved by dynamic programming method. Indeed, the set of segments of fragmentation can be considered as a path from point (0,0) to point (m, n), where m and n is the length of source and target sentences accordingly. Then the most weight path will correspond to the best fragmentation. We should keep in mind the basic structure of the sentences in different languages. Nepali is a language with Subject-ObjectVerb (SOV) basic structure while English is Subject-Verb-Object (SVO) basic structure. The favorable feature for us is that Subject-Object orders in both languages coincide. So we can eliminate the main verb from both sentences (are in our example) and try to match Nepali Subject with English Subject and N-Object with EObject. 7.

Dynamic rithm



&& W [] - array containing weights of mapping segments wp [0,0] = 0 for i = 0 to m for j = 0 to n if w [i, j]> 0 wpmax = 0 for i1 = 1 to for j1 = 1 to j if wp [i1, j1]> 0 & wpmax 0 & j> 0 && Coordinates of interpolating fragments = {(pp [i, j], ppy [i, j]); (I, j)} i = ppx [i, j] j = ppy [i, j] endwhile Upon completion of this algorithm the critical path is constructed as shown as a chain of arrows in Fig. 1. The algorithm eliminates conflicts, because always only one conflicting word is chosen. Note that we allow the violation of one-to-one mapping, i.e. segments parallel to axis X or Y permissible. The meaning of such segments is that in the source sentence there exists a word without a corresponding word in the target sentence and vice versa. The fragment The does not match any fragment of the target sentence. This fragment, although it has a match नै, is not a part of the critical path and is omitted in translation. The latter case should be processed during fragments merge process described below. 8.

Partial inversion

As a rule, the source sentence and its translation even with the coinciding word order, contain a number of inverse fragments. Consider the fragment well over 9 % of our example and its translation ९ % भन्दा बढ. This fragment pair are inverted fragments. Such partial inversion should be included in the critical path, but the above algorithm does not allow such inclusion. The algorithm was improved to cope with such partial inversed fragments before execution of the general fragmentation algorithm. The number of these inverse fragments in some sentences is large enough and we have to set the upper limit for the length of the fragment, say four words, and the lower limit of the ratio of their lengths, say 0.5.

Both of these parameters are specified in the program. 9.

Fragments merger

Critical Path splits the original sentence into the following fragments: The top money funds currently yielding well over 9 % a) ठू ला The top b) मु ा कोष c)


money funds currently

d) ९ % भन्दा बढ

e) नै ितफल ा

well over 9 %

गिररहे का

Fragment e) is not matched to any source fragment. If we merge fragments d) and e) the result will be more meaningful. That is,

९ % भन्दा बढ नै


गिररहे का

yielding well over 9 % While deciding whether to merge the fragments we adhere to two criteria: •

the lengths of the source and the target fragments should not differ greatly; and

the weight ratio of segments on the critical path to the total weight of mapping segments found inside the fragments, should not be too small.

We deliberately use vague definition because the value of these thresholds should be selected experimentally. These weighs are the parameters of the program. 10. Conclusion The article suggests a strategy of fragmentation of the parallel sentences. Compared with the previous works we propose delimiters (spaces) between the adjacent words as the coordinates in the bilingual space, not the words themselves. This allowed extending matching the boundaries of the

fragments and use word to phrase or phrase to phrase alignment. The weight of the segments was defined with taking into consideration those semantics. Partial inversion was handled effectively by incorporating inverted fragments in the critical path search. The resulting fragmentation is assessed by structural and semantic criteria. In case of violation of one of them we merge adjacent fragments (in the ultimate case all the fragments merge to form the original pair of sentences). Our experiments show that the method makes it possible to match unknown fragments of two sentences. The development of this approach will allow compiling a Dictionary of fragments for use in the system of Example-Based Machine Translation (BROWN, 1996) .

Reference BROWN, R. D., 1996. Example-Based Machine Translation in the Pangloss System. Proceedings of the Sixteenth International Conference on Computational Linguistics (COLING-96), p. 169-174 (vol 1). KEDROVA G.E., POTEMKIN S.B., 2004. Semantic division homonyms using a bilingual dictionary and thesaurus, II International Congress of Russian researchers "Russian language: historical destiny and Modernity", Reports. KEDROVA, G.E., POTEMKIN, S.B., 2005. Automatic evaluation of the quality of machine translation based on the semantic metrics. Bulletin of Lugansk Shevchenko NPGU, 95(15),p.35-41. NAUMOVA, I.O., 2005. Cognate English and Russian phraseological units as historical traces of European intercultural communication, Bulletin of Lugansk Shevchenko NPGU, 95(15), p. 41-47. MELAMED, I.D., 1999. Bitext Maps and Alignment via Pattern Recognition. Computational Linguistics, 25(1) p.107-130. POTEMKIN S.B., 2004. Lexical database with superimposed semantic metric, II International Congress of Russian researchers "Russian language: historical destiny and Modernity", Reports 2004 M URDUNEPALIENGLISHPARALLELCORPUS ces/UrduNepaliEnglishParallelCorpus.htm