Ulrich Apel National Institute of Informatics Hitotsubashi Chiyoda-ku Tokyo Japan

Electronic Data for the Description of Japanese Kanji — The Analyses of Brush Strokes, Stroke Groups and their Position and the Building of Path Data ...
Author: Ami Blankenship
2 downloads 5 Views 602KB Size
Electronic Data for the Description of Japanese Kanji — The Analyses of Brush Strokes, Stroke Groups and their Position and the Building of Path Data to Display and Search Kanji Ulrich Apel National Institute of Informatics Hitotsubashi 2-1-2 Chiyoda-ku Tokyo 101-8430 Japan [email protected]

Julien Quint National Institute of Informatics Hitotsubashi 2-1-2 Chiyoda-ku Tokyo 101-8430 Japan [email protected]

Abstract Learning to read and to write Japanese is a difficult venture for Japanese and foreigners alike. While Japanese study these culture techniques in a period of many years at school, foreigners who mostly try to learn them beside job or other classes would rely on good teaching material and easy means to look up Japanese characters. Most kanji dictionaries assume that their users already have a lot of information on the character—its stroke count, its radical or pronunciation. This project tries to put together data for easier look up of Japanese characters and better teaching material.

1

Introduction: Kanji as Main Obstacle for Learning Japanese

The modern Japanese writing system is considered as one of the most complex in the world. To describe it Japanese often call it kanji kanamajiri bun ( ). That means, Japanese uses normally the originally Chinese characters—kanji—and kana, i.e. hiragana and katakana, which are cursive or shortened versions of former kanji and which now represent only the sounds of syllables. Hiragana are used for example for grammatical particles and function words. Katakana are used mainly for non-Chinese

loan words and names. Each set of kana consists of 48 different characters. They can be combined with diacritical marks. It is possible to write Japanese totally in kana, but because of the big number of homophones in Japanese this may lead to misunderstandings. While lecture, Japanese professors write kanji on the black board to clarify what they are talking about and Japanese TV makes extensively use of subtitles. Nowadays Latin characters and Roman and Arabian numerals are used along with kanji and kana. In Japanese schools about 1000 kanji are taught with their correct stroke order at pri). mary school (kyôiku kanji, When pupil leave compulsory school, they should know about 2000 kanji (jôyô kanji, ). It is said that educated Japanese know about 3000 kanji. Introduction of electronic writing systems has had two contradictory effects on the usage of kanji. The ability to write kanji by hand has worsened, because now handwriting is only one way to write. On the other hand side, it has become very easy to write even very difficult and exotic kanji if one can key in their pronunciation on the keyboard. Today mainly passive use of kanji is needed. Even for Japanese kanji are difficult. While a longer stay outside of Japan the kanji ability of Japanese decreases. If Japanese are asked to write difficult kanji they are

–1–

not sure about the correct stroke order. This uncertainty is recognisable in the script. For foreigners learning Japanese it is very difficult to learn, remember, read and write kanji. The subject is of course difficult, but to make things worse, teaching material on kanji in western languages has often too little information how kanji are actually written. This concerns for example stroke direction, possible variations of character forms or their components.

2

History of Kanji and Their Main Building Principles

There are kanji dictionaries with tens of thousands of kanji. The number of kanji which where in actual use in Japan didn't exceed 5,000 or 6,000. The kanji set of older computer fonts contains about 6,400 kanji. Their are supplementary character sets like JIS X 0122 or JIS X 0213, which contain additionally several thousand kanji. Chinese characters or kanji in Japanese were developed in the second millennium before Christ. Kanji were introduced in Japan in the fourth century AD. Together with the kanji many Chinese loan words came to Japan with their original pronunciation adapted to the Japanese phonetical system. Sometimes the same kanji were introduced several times in different compounds from different eras or different regions in China with different pronunciations. Further, in Japan kanji are very often used with their meaning only, and the pronunciation of the corresponding Japanese word is assigned to them. That means that the same kanji may be pronounced differently according to its usage. Kanji are often called ideographs, meaning that the characters refer to an idea rather than their pronunciation. Actually that is not the case for a the great number of kanji. A typology of Chinese characters from the Later

Han Dynasty distinguishes six categories of ). The first four consider kanji (rikusho, the way the kanji are built. Pictographs (shokei, ) are pictures of the thing they are indicating ( “sun”, “moon”, “tree”, “bird” etc.). Diagrammatic characters (shiji, ) represent their meaning symbolically ( “one”, “two”, “above”, “below”). Characters with combined meanings (kaii, ) combine simple characters to indicate a meaning, for example “sun” ( ) and “moon” ( ) combined mean together “bright” ( ). Phonetic characters (keisei, ) have one element that suggests the general area of meaning and the other indicates the sound; for example the character (ume) contains the kanji for “tree” to indicate the meaning and the kanji (“every”) for the sound mai. This is the biggest group of kanji and seems to be still productive in China and Taiwan for characters used in names. The two not yet explained categories deal with extensions of the original meanings (tenchu, ; for example the meaning of “bad“ was extended to mean also “hate“) and characters that are used with a borrowed meaning (kasha, ; for example the character depicting a scorpion is used for the meaning of “10,000”). There are also a few kanji which are invented in Japan, often called kokuji ( ) for example: (tôge, mountain pass), (sakaki, Cleyera japonica), (tsuji, “cross road”, (hata, hatake, “plowed field”). Such characters normally have no Sino-Japanese reading. Developments after WW II lead to a separation of kanji forms in Japan, China, Taiwan and Korea. Japan and mainland China invented shortened forms of kanji (often different forms for different kanji), whereas Taiwan and Korea still use the traditional kanji forms (in Korea use of kanji of course is quite limited). For example in Japan and China (gaku, manabu) became , and

–2–

(ten) became ; in Japan and (ben) are now replaced by . In Japan the Ministry for Education (Monbushô) selected a number of around 2000 kanji for official and general public use. 1958 it published instructions on the stroke order and kanji forms, that should be taught at school for about 900 kanji. Traditionally there was a certain freedom in stroke order and kanji forms, but the Ministry of Education selected one “official” form. Although the Monbushô tried to generalise and use rules for stroke order they still seem quite arbitrary. Calligrapher also stress, that the proposals of the Monbushô don’t use the mainly used character forms or stroke orders (Emori 2003: 8–16).

Most lexica give no information on the stroke order and if they do, only for a small number of characters. Some dictionaries with stroke order don’t use the kanji forms and stroke order considered as the standard by the Monbushô. Normal kanji lexica have little information on variations of the components of kanji. This might be one further reason for the bad handwriting of many western learners of Japanese. Ordinary paper lexica are of little help if one only recognises parts of kanji. One has to recognise the whole kanji to be able to look it up.

4 3

Kanji Strokes, Stroke Groups and Path Project

Problem Description

Although nowadays kanji are very often written on computer or typewriter (wapuro), it is still important to be able to write them by hand. In Japan too, for example letters written by hand are more personal than printed ones, and people with a nice handwriting are considered as smart and well educated. For learners of Japanese writing kanji by hand is still one of the best ways to memorise them. To recognise Japanese written by hand or inscriptions and computer fonts which use characters that were written by hand as models, it is important to be able to recognise the original strokes because in handwriting strokes are often joined up and hardly to recognise as individual strokes. In small fonts sizes and on the computer screen strokes are often hard to distinguish. As kanji lexicons use the stroke count as one criterion to index its contents, it becomes more difficult, to find a certain kanji if one isn’t able to count the strokes.

The project tries to put together data that might lead to better teaching material, easier and if needed more complex ways to look up kanji. It will further allow a number of new ways to display informations on kanji. For this project, kanji are considered as graphic and are analysed in different ways. This new data is combined with existing data. As a very abstract and basic way to analyse characters one could use a graphetic approach which would lead to the recognition of graphemes. Graphemes are the smallest meaning distinguishing units. In the case of kanji this can be for example stroke length and or and ), angle of the (as in stroke (as in , , and if one includes also kana ) or stroke direction (as in and ) or the ending of stroke (as in and ). A more concrete analyses which also takes the act of writing into account would use the brush or pen strokes as basic units of kanji. That is what we have done in this project.

–3–

The stroke form of kanji is predetermined by the tool used in former times, the brush. There are no long strokes from right to left or from the bottom up, what would be against the direction of the hairs of the brush when one writes with the right hand (it doesn’t only to seem more difficult to write kanji with the left hand, left handed persons have even trouble using graphical character recognition for PDAs or computers). The analyses of stroke uses 25 basic forms of strokes. The analyses considers stroke direction, bending of the strokes, their endings (blunt or with a short bend) and so on. Strokes can be grouped together not only to build full kanji but also to smaller units. Kanji dictionary give radicals, but there are other stroke groups too, which frequently occur in kanji. Even in Japanese, there seems to be no technical term for them. A Japanese graphic program for building non-standard kanji ( , gaiji) for example calls them just parts ( , pâtsu). I will call them grapheme elements. Analyses of the strokes and grapheme elements is combined. If one can recognise grapheme elements within a kanji one can use existing data to get the description of its component strokes. Further, one can look for stroke combinations to find grapheme elements. The analyses of the grapheme elements uses mostly existing kanji or given radicals. This might be extended. To display the collected information about a kanji and its components, path data is needed. This data is put together with a graphic program (Adobe Illustrator). The stroke order is identical with the input of the paths. The database for grapheme elements can be used to look up kanji’s components and the basic elements can be added together with copy and paste. The main work is then to put everything at the right place. For later review and to have more flexible data, num-

bers for the stroke order are put beside the strokes.

5

Possible Applications of the Data

The data allows new approaches for search of kanji: • Search for stroke forms • Input on the numbers block of the keyboard in matrix like stile • Search for strokes in the correct stroke order • Search for grapheme elements • Search for stroke forms, radicals and grapheme elements according to there position The data allows new ways to display kanji: • Kanji can be build up according to their stroke order, what might be used in dictionaries or teaching material. (see “Figure 1: Building up a kanji by its strokes according to their stroke order” on page 5) • Practising sheet to write kanji can be generated automatically (see “Figure 2: Example for a practising sheet with arrows for stroke direction and numbers for the stroke order” on page 5) • Animation can be achieved automatically. • Grapheme groups can be highlighted (see “Figure 3: Example for highlighting different grapheme elements with different colours” on page 6) etc. It is easy to put up data about variations in stroke order or kanji form, because most data already exists. The data can be used for better graphical character recognition. It could also be used for software that recognises input faults and explains the user how to corrects them.

–4–

Figure 1:

Building up a kanji by its strokes according to their stroke order

Figure 2:

Example for a practising sheet with arrows for stroke direction and numbers for the stroke order

1 2

5

6

4 7

8 9

12

10 11 13 3

–5–

Figure 3:

Example for highlighting different grapheme elements with different colours

6

Two Example Applications

6.1

Automatic Animation of Kanji Strokes

The path data created with Illustrator was exported into Scalable Vector Graphics format. SVG is an XML and provides a clear description of the graphical data that is very well suited for the task at hand. An introduction to SVG is well beyond the scope of this paper, but we can outline the most relevant features. The graphical description of a kanji consists mostly of an ordered list of strokes. In SVG, we represent a stroke by a path element; for instance, the first stroke of the kanji “ ” is: The d attribute of the path element contains the path data in a compact form. This data is a list of drawing commands that an SVG renderer will execute to draw the path. The path data for every stroke will consist of a sequence of Bézier curves, which are parametric curves defined by four control points. Several paths can be grouped together under a group element, which allows to associated groups of paths (i.e., lists of strokes) with every graphemic elements of a kanji. It

is then possible to deal directly with graphemic elements in the graphic representation of the kanji, in order to highlight such elements (as in figure 3) or link them to other SVG files (e.g. clicking on the left component of “ ” would link to the kanji “ ”.) The SVG data available so far is static. Our goal here is to present it in a dynamic fashion, showing strokes one by one, in the order and the direction in which they should be drawn. SVG provides many facilities for the animation of elements and attributes; namely, we will add an animated child element to every path in the static SVG file to create its animated counterpart. The animate element controls at what time the path is drawn, and what shape it should take. There are two things to consider here. First, the strokes must be shown in the correct order. This order is the one of the static SVG file: paths will be drawn in the order that they appear. Then, they must be drawn in the right direction; this is also given by the definition of the path. A stroke going from left to right will have its starting point at the leftmost position and will end at the rightmost position. The really tricky part is to draw the path progressively; in effect, there is nothing in SVG to do this automatically. A solution is to divide every path in several smaller one, and draw each segment one after the other, giving the impression of an invisible pen actually drawing the kanji. Our division strategy is to segment every curve in a path into a fixed number of elements. That number of element is set to a power of two because splitting Bézier curves in two is very easy to do. Longer strokes will consist of more curves than shorter strokes, and will take longer to draw; the distribution of the control points along the curves makes the animation look quite natural. In the end, animation is controlled by two parameters: the number of segments into which a curve is split and the time between

–6–

the drawing of two strokes. Modifying those values will make the drawing slower or faster, and more or less smooth. The first stroke of our example kanji will now look like shown below. The animation will start at time 0 and last for 0.45 seconds and will iterate over the values given by the values attribute. The d attribute in the path parent element will take these successive values over time.

6.2

Searching by Strokes

6.2.1 A Simplistic Model As outlined previously, there is a variety of search applications that are possible thanks to the kanji data. Most importantly, we describe a way to look up a kanji dictionary by using the stroke and stroke group information. In its simplest form, a kanji can be represented by a list of strokes, in the order in which they must be written. There are in the current database 26 different kinds of strokes, numbered from 1 to 26. Our dictionary is just a list of pairs of kanji and stroke list such as: (

known pattern matching techniques to apply the search request to the more than 6,000 kanji currently in the database: even on a modest computer by today’s standard the results are instantaneous. Although very simplistic, this method yields very interesting results. If one wants to lookup a kanji in a paper dictionary by counting the strokes and/or trying to recognize the radicals, there is often some trial and error. This lookup method doesn't make the problem go away, but its speed makes the trial and error process much faster. A typical request will consist of the first strokes of the character (i.e. a prefix on the stroke list), and the result set will usually show quickly if the user is on the right track by displaying kanji that appear as components of the desired one. If she is on the right track, she can try a new request with more strokes; if not, she can try to identify what went wrong and modify the request accordingly. For example, while looking for , we can start by inputting the first strokes, which are 2, 3, 2, and 2 (2 is a long horizontal bar, 3 is a long vertical bar.) However, we may not be sure about the correct order. Trying different combinations will either give very few results (e.g. 2, 2, 3, 2 or 2, 2, 2, 3)or many results that look very different from the target kanji (3, 2, 2, 2). We can quickly settle on 2, 3, 2, 2 and build up from here. Example. Searching for 2, 3, 2, 2 gives many results, starting with: [2 3 2 2] [2 3 2 2 1] [2 3 2 2 21] [2 3 2 2 2 3] [2 3 2 2 2 3] [2 3 2 2 3 2] [2 3 2 2 8 1]

, (2, 3, 2, 2, 4, 17, 2, 2, 4, 15, 11, 2, 2, 8))

Searching for a kanji consists of preparing a request string which is a list of strokes. If the list of strokes given as a request is contained in the list of strokes for a kanji, then this kanji matches the request. Multiple results can be sorted by increasing length (that is, increasing complexity). We can use well-



–7–

Compare to the first few results for 3, 2, 2, 2: [3 2 2 2 2 10 4 5] [3 2 2 2 3 2 2 2 2] [3 2 2 2 2 14 1 4 4 4] … Note that we display the list of strokes for all the results to help the user memorize the different types of strokes. Even though the result set for 2, 3, 2, 2 has a lot of noise, it is much better than 3, 2, 2, 2 which is clearly not the right direction. We can add a few options to the search, such constraining the matching from the beginning or the end of the list, or both, so that only kanji that match the request exactly are returned. In the example above, we have actually restricted the search from the start. Another option is to allow the user to type kanji in the request. Since complex kanji are composed of simpler ones, we can often recognize some of them. Instead of forcing the user to try to find the correct stroke combination for this component, the corresponding kanji can be input directly. In the simplistic model, a kanji is replaced by its list of strokes to create the request. Example: Searching for , corresponds to the request 3, 11, 2, 2, 4, 17, 2, 2 and yields the results: [3 11 2 2 4 17 2 2] [3 11 2 2 4 17 2 2] [3 11 2 2 4 17 2 2 3 11 3 3 2]

no “holes” in the query (strokes must be consecutive); and strokes must be exact. The first two problems can be solved by adding an option to match the strokes in any order; this increases the recall to the detriment of the precision. However, if the request increases in length, the result set narrows down. The third one can be solved in a similar way, by allowing several candidates for a given stroke (e.g. 1|2|3 would match strokes 1, 2 or 3.) All options can be accommodated by using regular expressions for specifying the request to match (although a much more limited language is available to the user). The result set is sorted by edit distance between the request and the result string. If the simplistic model is efficient and easy to use, it seems that its recall is lacking. For instance, in the example above (searching for ), the of second result is clearly wrong. This happens because we haven’t taken into account the boundaries between components of the kanji, which would effectively prohibit the second kanji from being returned (it has the same list of strokes, but boundaries differ.) Another remaining issue is the fact that some kanji look different in their typeset form and in their handwritten form, so some strokes do not match (this goes back to problem number 3 at the beginning of the section). Currently, the dictionary lists a few different acceptable shapes for some kanji (one being preferred), but this is far from enough. 6.2.3 Interface Considerations

6.2.2 Evaluation and Improvements Although no formal evaluation was conducted so far, the first approach clearly shows limitations. It is still important to specify the strokes in the correct order (as illustrated by the first example); there can be

The last problem is the way that the request must be input. Typing list of stroke numbers is difficult and error-prone, as the used must have a list handy, and may need to leaf through it to find the correct stroke among 26 possibilities, or try to guess by running successive requests. This can be very annoying,

–8–

and is a big obstacle for a first-time user of the system. In a web interface, which is a perfect setup for this kind of project in the framework of Papillon, this could be improved by a graphical user interface, where strokes are selected from a grid. Hyperlinks provide more information about the usage of each stroke, as well as examples where the stroke is used. But more interesting input strategies can be imagined, from “composing” the stroke on a numeric keypad (e.g. going from 4 to 6 to make a horizontal stroke, 7 to 9 to 3 to make a “corner” stroke, etc.) to building a primitive hand-writing recognition system which would need only to recognize the stroke shape (as written with a stylus on a PDA or a table, or a mouse).

7

Prospects of the Project

The data uses now only Japanese characters, but the approach can be used as well for the characters of a Chinese font set, for traditional characters and for the shortened ones, The data can be extended to the kanji of JIS level III and IV or other character sets.

Bibliography Hadumod Bußmann. 1990. Lexikon der Sprachwissenschaft. Kröner, Stuttgart. Emori Kenji. 2003. Kai gyô sô—hitsujun jitai jiten. Sanseido Tokyo Eduardo Fazzioli. 1987. Gemalte Wörter. 214 chinesische Schriftzeichen – Vom Bild zum Begriff. Lübbe, Bergisch Gladbach. John Ferraiolo, Fujisawa Jun and Dean Jackson, editors. 2003. Scalable Vector Graphics (SVG) 1.1 Specification. http:// www.w3.org/TR/SVG11/ Adrian Frutiger. 1978. Der Mensch und seine Zeichen: Schriften, Symbole, Signete, Signale. Fourier, Wiesbaden. Wolfgang Hadamitzky. 1995. Langenscheidts Handbuch der japanischen Schrift – Kanji und Kana 1. Handbuch. Langenscheidt, Berlin et al. Kôdansha Encyclopedia of Japan. 1998. Kôdansha, Tokyo Berthold Schmidt. 1995. Einführung in die Schrift und Aussprache des Japanischen. Buske, Hamburg.

Conclusion Neither the analyses of kanji into strokes or grapheme elements nor animated kanji for the stroke order or the highlighting of grapheme elements are totally new. The problem until now was that for every one of these different approaches one had to build new data more or less from scratch. Our data seems to be very adaptable and can be used for very different application. Probably many of which we haven’t thought ourselves until now.

–9–

Suggest Documents