Text Conversion Tool: Non Unicode to Unicode Text

Text Conversion Tool: Non Unicode to Unicode Text Jatan Patel, Pushkar Gahlaut Dhirubhai Ambani Institute of Information and Communication Technology,...
13 downloads 0 Views 978KB Size
Text Conversion Tool: Non Unicode to Unicode Text Jatan Patel, Pushkar Gahlaut Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar [email protected] [email protected]

On-campus Mentor: Prof. Sanjay Chaudhary [email protected] Abstract – Though the technology has evolved immensely over the years, there has not been much satisfactory development in the field of information interchange and content searching in Regional Languages. One of the reasons is font dependent glyph codes of typing stuff in these languages. Another problem that exists is that the research and development in the field of regional languages has never followed a particular standard. Many developers have worked in this field without adopting one particular standard, rather they create their own fonts and with so many different fonts it is natural to have different problems with their encoding and rendering, clearly depicting the lack of a standard which is a setback from the perspective of future development. As a result, there exists a chaos in electronic form of regional languages. So, the need of the time is to set up and maintain a standard for the future development. Many solutions like ISCII (a coding scheme for representing various writing systems of India) and other new fonts have been introduced but none of them worked efficiently to reduce the chaos. 'Unicode' is the idle solution with its unique advantages and global acceptance on all kind of electronic devices. A Unicode conversion tool is the key to achieve better results in Desktop Publishing (also known as DTP) and ultimately to develop digital library, magazines, newspapers, newsletters etc. supporting the fundamental features like searchable text, graphical representation as well as information interchange in order to effective publications of the e-content in Indian regional languages. Keywords– Indian Regional Language, Unicode, Desktop Publishing, Text Conversion, JAVA, Hash Map, Software Development I. INTRODUCTION The development of digital catalogue and libraries in Indian regional languages require effective electronics publishing. It includes generation, maintenance and diffusion of electronic data using computers and information networks and is defined as “publishing in which all aspects of preparation carried out electronically”. E-publishing aims to make available the world wide electronic distribution of peer reviewed literature completely free and with unrestricted access to users. Desktop Publishing softwares like Design Studio, Scribus, Adobe InDesign, Adobe PageMaker, Microsoft Publisher, Ventura,

Quark Express, Corel Draw etc. have played major role to accomplish the goals of effective e-publishing. Desktop publishing draws from many disciplines which include traditional principles of graphics design, the graphic arts, Web design and Font typography. Efficient desktop publishing involves the combination of typesetting (fonts and the layout of content), page layout, and at last printing the document. Desktop publishing is important as a tool for Indian regional languages that can enhance quicker communication and efficiently e-publish online or on-screen printed documents. These softwares allow the users to rearrange text and graphics plus changing the fonts and displaying the final design before printing it on paper. The primary requirement for any DTP software to develop proficient publishing is a document established on highly reliable and largely supported platforms. A huge amount of typed content prepared in localized electronic publications of different regional languages is not being utilized due to technical obstacles like various independent versions of fonts, compatibility issues of 2decade previously typed e-contents with latest and consistently advancing technologies. For example, Gujarati alone has a variety of randomly prepared MS-DOS based fonts like Krishna, Avantika, Gopika, Ghanshyam, Saral and many more. It is almost impossible to efficiently interchange the information generated using one type of font to other font of the same language due to lack of standardization. These languages have many literary documents including e-books, articles and other works typed in MS-DOS based fonts. Such useful documents are under-utilized and unavailable to a major part of the population due to incompatibility of the fonts with latest operating systems. So, it became necessary in order to reveal these e-contents with reliable and global technologies as well as keeping in mind the future advancement of those. One way of doing it is to convert these e-content into Unicode fonts of that language from their respective old fonts because Unicode meets our purpose. That leads to the development of ‘Non-Unicode to Unicode Conversion Tool’. II. CHALLENGES Many software packages have been offered for specific Indian Regional languages till now but however they were not able to achieve the goals of searchable e-content or information interchange or both on the platform of computer networks like

Internet due to the limitations of 'Requirement of same software packages and fonts' in the system to enable the smooth sailing. The major difficulties faced by the developers are as under:

Before 1991, the fonts were developed as cast lead type for printing presses; mainly MS-DOS based fonts were used. Publishers of Indian Regional languages came up with many printing fonts like 'NATARAJ' fonts for Hindi and 'KRISHNA' fonts for Gujarati.

A. Encoding ASCII (American Standard Code for Information Interexchange) is an industry standard assigning numbers, letters and other characters. The range is 8-bit code, equal as 256 slots. The ASCII Encoding table is divided in 3 sections. i. ii. iii.

Non printable, 0 to 31. Lower ASCII, 32 to 127: 7-bit character table of American System Higher ASCII, 128 to 255: Programmable portion, characters depends on the program or operating system; foreign letters are also placed in this section. Extended ASCII uses 8-bits instead of 7-bits, which adds 128 additional characters.

Each publication used to prepare their own fonts for particular language and typed their content in those fonts. So if your device have one kind of font for a language then it is not necessary that you can read e-contents of other fonts of the same language due to the different character sets of those fonts instead of same language. So standardization of fonts became essential for these languages. However many articles typed in MS-DOS based fonts become inappropriate in latest operating systems of electronics devices like Windows 7, XP, Windows-8 etc. So, features like information interchange and searchable e-content are not possible without conversion of fonts.

The developers in The Bureau of Indian Standards programmed the extended ASCII 8-bits to add Indian Regional languages' character which resulted in ISCII, Indian Script Code for Information Interchange to solve the purpose of standardization of Indian Scripting. ISCII has 7 & 8-bit character sets. In the 8-bit environment, the lower 128 characters are same as defined in IS10315:1982 (ISO 646 IRV) while the 7-bit coded character set is same as ASCII character set. The ancient Brahmi script characters as Indian scripting are defined in the top 128 characters of ISCII. Although ISCII convertors were developed the effort did not get expected success. Since, text processing is being involved in ISCII based applications; many difficulties arise at the execution time regarding the issues of difference between internal text representation and represent display of rendered text. Technical solutions have been recommended to handle this issue but the results have been satisfactory. In Addition, ISCII did not get support from other linguistic developers.

B. Font The aim of font development is to attain smoothness in printing and rendering. In simple words, the set of characters containing letters, numbers, punctuation marks, glyphs and special symbols of the same design is called typeface while specific style, pitch and size of typeface is known as fonts. In 1991, Apple and Microsoft jointly developed an outline font technology introduced as 'True Type Font' (ttf) to enhance already built True Type Support into all Windows and Macintosh operating systems, Documents using True Type Fonts can be created by the user of these operating systems. Commonly used font file extensions include .OTF, .TTF, and .FNT (Windows Font File), .FON (Generic Font File) etc.

Figure 1: Font table of Krishna & Natraj – I

B. Unicode Encodings Unicode has 3 encoding formats: i.

UTF-8

UTF-8 uses 1 to 4 bytes of the byte sequences for representing the entire UCS. The number of bytes required for representation depends upon the range in which the codepoint lies. It was essential for Unicode characters which are also part of ASCII necessarily encoded exactly as in ASCII character set using code units starting from 0x00 to 0x7F. ii.

UTF-16

UTF-16 uses either 1 or 2 16-bits code units to represent UCS. Similar as UTF-8, number of bytes required to represent the code point depends upon the range in which it lies. iii.

UTF-32

Every codepoint in UTF-32 is encoded using a 32-bit integer equal to the scalar value of the codepoint.

Figure 2: Font table of Krishna & Natraj – II

III. UNICODE The Unicode® Consortium developed an industry standard character set encoding named 'Unicode' with an aim to have a Universal Character Set (UCS) that supports all characters from worldwide scripts and symbols; that are in common use worldwide today or in the past and scope to add characters in future. The UCS has the capacity can support over 1,000,000 characters and presently around 96,000 characters are being supported to represent a large number of scripts. The benefits of a single UCS and implementation of Unicode have made it a dominant and omnipresent standard. A. Goals Unicode is designed to achieve following goals:    

Universal standard creation that targets to cover all writing system around the globe.. Efficient encoding that supports single processing of text Uniform encoding width, 16-bits are encoded with each character Generating unambiguous encoding such that the 16bit value represents the same character irrespective of its occurrence in the data. [5]

In Table 1, BOM = the byte order, determined by a byte order mark, if present at the beginning of the data stream, otherwise it is big-endian. The most generally used encoding on the web is UTF-8 while Java and Windows use UTF-16. Linux and many UNIX systems use UTF-8 and UTF-32 encoding types. All the encodings have algorithmic conversions to interchange the information among them. The conversions are fast and do not tolerate any data loss. As a result, any particular Unicode encoding is being used for the internal storage or text processing; the conversions make it easy to reinforce data input or output in multiple formats of encoding flexibly [4].

TABLE I PROPERTIES OF UTFs Unicode Encoding Type

UTF-8

UTF-16

UTF-32

Minimum code point

0000

0000

0000

Maximum code point

10FFFF

10FFFF

10FFFF

Size of Code Unit (No. of bits)

8

16

32

Byte order mark

N/A

BOM

BOM

Lowest bytes per character

1

2

4

Maximum bytes per character

4

4

4

C. Unicode Fonts The Unicode Font is defined as a font that conforms to the Unicode standard. Like any other fonts characters are mapped as specific numbers following the Unicode standards. The Unicode fonts have rules on displaying characters. For example, ‘Shruti’ is a famous Unicode font for Gujarati language which is developed by Microsoft. All Open Type and AAT (Apple Advance Typography) fonts are Unicode fonts. Standards of Unicode fonts may be followed by TTF to some extent are not capable of proper rendering Indian scripts using Unicode, so it is advisable to avoid TTF for Indian regional languages.

IV. TEXT CONVERSION TOOL The unending motivation behind the development of such a language translation tool is the preservation of ancient literature of the state. Gujarati is a language rich in old texts and literature in the form of various articles, books, newspapers, publications, etc.

The biggest problem that persists in this field is that the hard copies of these literary works got demolished over time and the documents used for printing in the earlier eras were in different type of DOS supported fonts. These fonts are not supported by Windows and their encoding scheme also differs from the present standards. So the situation is that Computer When a document is prepared using Unicode font, the text can Science technologies have evolved over time but not much be displayed in many Unicode Fonts. In the network attention has been paid in the development of tools for operations like sending, uploading or putting the document translation of regional languages which indirectly implies developed in Unicode font, the readers can read it with no injustice to the literature and could probably lead to its difficulty of the requirement of the same font in which the demolition. Thus the need of the time is to prepare a tool document is developed in the presence of any other Unicode efficient enough to translate these old DOS font in ANSI font which is capable enough to display those characters. encoding in accordance with the Unicode standards to make old literary works available to the today’s world. There have been many works in this field but people have not adopted the D. Why Unicode? Unicode standards due to some or other reasons, in some cases due to lack of knowledge while in some others due to Including Microsoft, HP, IBM, Sun, Unisys operating systems, inability to adapt with the new technology. This language Oracle, Progress databases, and many others support Unicode. conversion tool should pave the way to more research and Therefore it provides platform for application development further development. and vendor independence. Due to UCS, e-content in any language can be exchanged worldwide including content in Indian Regional Languages. Unicode have only one way to V. DESIGN process text which reduces development and support costs plus it allows a single version of code. Separate releases in The design aspects to develop Gujarati DOS based font regional market can be easily eliminated because one version ‘KRISHNA’ into Unicode Font for Gujarati ‘Shruti’ are as of the product can be utilized globally. Unicode Standards follows: insure interoperability and portability by prescribing conformant behaviour, so Applications process text A. Logic consistently and conformance is verifiable. In the latest version of Unicode, there are 12 ‘Indic script’ character sets in Unicode including Devanagari, Bengali, Kannada, Tamil , Limbu, Malayalam, Oriya, Gujarati, Sinhala, Syloti Nagri, Gurmukhi and Telugu [1]. As far as the efficient digitalization of Indian Regional Languages is concerned, the e-content in Hindi, Gujarati, and Tamil etc. can swiftly communicated in Unicode and it is straightaway readable in the same language without the headache of installing any fonts for it at the recipients' end. Irrespective of operating system or application to be used for the text processing, the documents developed in Unicode is directly available for further computations or processes. The Unicode inherent capabilities can allow us to achieve Global Searching, Online forums and databases, sorting in alphabetical order as well as data filtering in these languages.

Our objective was to translate the old Gujarati text file written in Krishna Font (A Non-Unicode Gujarati Font) encoded in ANSI to UTF-8 encoding and the output to be rendered in Shruti Font (A Gujarati Unicode Font) which was achieved following the given sequence of steps : 



The input file was in ANSI encoding so it has to be decoded with the same format using "ISO8859-1" encoding format supported by java. There are two types of character mapping that are stored in two different HashMaps: One of them stores the Krishna(in ANSI format) to ASCII mapping and the other one stores ASCII to Shruti(in UTF-8 format) mapping. Now the input file is read character by character using iteration, the ASCII code of the read character is obtained through HashMap where the mapping has been stored as (key, value) pairs.

  

Our next step is to find the character (in Shruti) in the second HashMap corresponding to the ASCII value obtained from the above. This character is then written to the output file generated in UTF-8 format. Now the most important task is the Reconstruction of words from characters. While some of the Reconstruction is achieved implicitly with Unicode Rendering Rules, the rest is done explicitly with the help of programming logic.

B. Pseudo Code 1. 2. 3.

Create Input / Output File Streams. Create HashMap object. Specify the datatype of the values that it maps : first HashMap , second HashMap< Integer, String> 4. Generate the HashMaps. 5. Read the input file character by character. 6. Find the ASCII code mapping of the read character from the Non-Unicode to ASCII HashMap. 7. If the mapping is found, find the character from the ASCII to Unicode HashMap corresponding to the ASCII code of the above read character and write the output. 8. As the given input file may also contain some characters that may not exist in mapping, rather they exist in the input file as the ,for example ,,,etc. so for such characters, extract their ASCII code enclosed within “< >” and map it directly to the value in the ASCII to Unicode HashMap and write the output. 9. Implicit Reconstruction of the output takes place abiding the Unicode Rendering Rules. 10. For Explicit Reconstruction: (i)

if the present character is ‘િ ’ swap the position of the present character with the next character;

(ii)

if the present character is ‘ ’ and the next

Fig.3 - Flowchart

character is ‘ ’ replace both characters with ‘ ’. It must be kept in mind that writing the logic for the explicit reconstruction requires analyzing the output carefully so that the characters creating problems can be studied and coded accordingly. 11. Terminate the Input / Output File Streams. Note:- The HashMaps are generated only once at the time of adding a new script(mapping) and then stored for future use, thereby reducing runtime.

C. Complexity The project required extensive information gathering about the different type of encoding formats like ANSI, ASCII, ISCII and UTF-8 for reading the old font supported by DOS and converting it to the new Unicode standards and also their alternatives encoding schemes that are similar to them supported by Java. Another important part of the project is reconstruction of the words from the sequence of characters obtained after Unicode Font mapping.

stored only once for a particular font mapping. VI. IMPLEMENTATION A. Programming Language The software is programmed in Java following the OOP principles and utilizing some of the humongous Java libraries; their Classes and functions. The tool used for the development process is JCreator PRO.

Running the program for the first time will ask you to provide the mapping and from next time onwards the program works with the stored mapping. One can easily add font mapping just for once in the code and the output could be achieved in the desired font. VII. CONCLUSION & FUTURE WORKS

The best practice for composing digital texts in Indian regional languages could be to follow the Unicode standard B. Packages/Libraries and to use the Unicode fonts. This practice will ensure that the document be easily accessible to the readers and following a Some of the packages, interfaces, Classes and methods used in standard will also promote future works in this field. Font coding are as follows:technology will continue to evolve as new display engines and fonts are continually developed. In order to reveal old i. Java.io.*-Contains FileInputStream, which is used to literatures of these languages, Unicode is the key. If we create an input stream for character set mapping, convert texts according to the Unicode standards, readers will input file (file to convert).Contains FileOutputStream, easily be able to take advantages of evolutions in font which is used to create an output stream to output file. technology that will not only improve the display of the Contains BufferedReader, which uses documents but could as well be smoothly processed for FileInputStream objects to enforce decoding: Desktop Publishing or other Effective Electronics Publishing, “ISO8859-1”. Contains BufferedWriter, which uses resulting into E-books, newspapers, magazines, newsletters on [3] FileOutputStream objects to enforce encoding: any electronics devices. “UTF-8”. ii. Java.util.HashMap –HashMap object is used to store The future aspects of this project could be to extend this tool the character set mapping. by adding the Gujarati lexical analyzer for spell check and C. Data Structure HashMap is the class implementing the Map Interface in java.util package. HashMap is the data structure used for storing the character mappings in the form of pairs using put(k,v) method and retrieving the mapped values using the get(k) method. The containsKey(k) is used to check if the key exists in the map and prevents the code from returning NullPointerException. The code generates and stores two HashMaps per font, one for mapping and the other one for mapping. D. Challenges The most challenging part was the Reconstruction of words from the character sequence. Once the mapping is done and every character has been translated to Unicode font, now is the time to combine these characters. The characters that appear in the sequence comprise of a variety of Independent Vowel Letters, Dependent Vowel Signs, Live Consonants, Dead Consonants, and Half-Consonant Forms, which need to be concatenated properly with logic so as to render the text tidily on the screen. Furthermore, coding efficiently so that the software employs proper utilization of available classes and methods provided in the Java libraries and uses minimum processor speed was also a challenge. The software has been prepared keeping in mind that Hash Maps are created and

grammar corrections to improve the quality of text. Further, once these old literary works are translated to Unicode, they can be converted to .epub format so that they can be read as an e-book. Thereafter searching for a word could also be made possible in the e-pub document by importing the Gujarati font files and the lexical analyzer as well for even better reading and formatting experience.

ACKNOWLEDGMENT We are very much grateful to our on-campus mentor Prof. Sanjay Chaudhary for his motivation, generating resourceful environment, assistance, inestimable guidance and creative inputs to improve the project better and better throughout the BTP. We would also like to thank Mr. Kartik Mistry and Mr. Apurva Ashar for their technical support, guidance and giving their valuable time at various phases of the project.

REFERENCES [1]

[2]

[3]

[4] [5]

[6]

Sean Pue, E-Journal of the South Asia Language Resource Center, The University of Chicago http://salpat.uchicago.edu/index.php/salpat/article/view/3 3/29 Article on ‘Desktop Publishing in the 21st Century’ http://desktoppub.about.com/od/gethelp/a/DesktopPublis hing.htm The API specification for version 6 of the Java™ Platform, Standard Edition http://docs.oracle.com/javase/6/docs/api/ Questions related to UTFs or Encoding Form http://www.unicode.org/faq/utf_bom.html E-book ' Understanding Unicode™ - I' & E-Book on 'Guidelines for Writing System Support: Technical Details: Encodings and Unicode: Part 2' by Sean Pue http://scripts.sil.org/IWS-Chapter04a http://scripts.sil.org/WSI_Guidelines_Sec_6_2 Standards for language technology industry, Technology Development for Indian Languages http://tdil.mit.gov.in/FAQs.aspx