A Spell Checker for Esperanto

A Spell Checker for Esperanto Project Report for Stage 1 (December 2008 to February 2009) by Marek Blahuš This report describes the development of th...
2 downloads 3 Views 51KB Size
A Spell Checker for Esperanto Project Report for Stage 1 (December 2008 to February 2009) by Marek Blahuš This report describes the development of the project “A Spell Checker for Esperanto” accomplished during its first stage, i.e. in Months 1 to 3 (December 2008 to February 2009). The state described is that valid on March 1, 2009, i.e. the date of Checkpoint 1, as specified by the project’s plan posted on its website. “A Spell Checker for Esperanto” is a project in the field of Natural Language Processing, aiming at designing and implementing a spell checking dictionary for the Esperanto language using the Hunspell spell checker, which is financed in its current phase by the Students' Research and Development Projects scholarship of the Faculty of Informatics at the Masaryk University in Brno, Czech Republic. The goals of the first stage were enhancing the spell checker’s functionality and performing preliminary research on the possibilities of its integration in the OpenOffice.org office application suite. The outcome is being presented as an input-output description of the new behavior of the spell checker, i.e. general descriptions of each enhancement made along with examples of words, word groups or sentences for which the previous version did not provide correct result but whose treatment has been fixed in the new version. Also, a summary of the preliminary research on the process that needs to be followed in order to make the spell checker an official part of OpenOffice.org is being given. At last but not at least, a brief account is being given on several public presentations of the project performed since the last report (i.e. since defense of the bachelor’s thesis).

Functionality Enhancements Firstly, the recent development in Hunspell has been followed: Since version 1.2.2 which was existing at the time of writing the bachelor’s thesis, new versions have appeared, the most recent one being Hunspell 1.2.8, released on November 1, 2008. This development has brought several new features, including some relevant to the project. The most important change is “extended compound word checking for better COMPOUNDRULE related suggestions” that was introduced in Hunspell 1.2.7. It was apparently this change that has fixed the biggest shortcoming of the devised proof-of-concept implementation of the spell checker – the inability to provide suggestions for compounded words – which rendered the proof-of-concept virtually incapable of producing suggestions, since most of the recognized words are implemented as compounds. If the new version of Hunspell is used, suggestions work even for compounds, although a significant delay has been observed (caused apparently by the high number of rules that may produce a valid suggestion) – yet an annoying factor which needs to be paid attention to in the following development, particularly in relation to prospective reworking of the implementation of the morphology. The change log for Hunspell 1.2.8 also promises better treatment of hyphens, due to which hyphens may be analyzed “by spell checker instead of by work breaking code of OpenOffice.org”. This has, though, not yet been implemented in OpenOffice.org itself, since its new milestone, OpenOffice.org 3.0 (the bachelor’s thesis was developed with version 2.4), was released prior to the release of Hunspell 1.2.8.

Page 1

Most of the work at enhancing the dictionary’s capabilities has been done in the field of affix support. Although the bachelor’s thesis has tried to implement all officially recognized Esperanto affixes (as listed in the grammar reference manual PMEG), there is a number of affixes and pseudo-affixes that may be found in use but have not been (so far) recognized as official by the Academy of Esperanto. Many of these affixes, however, are pretty common in texts on topics involving modern technologies, recent world issues or in scientific contexts. PMEG gives a list of these affixes, along with examples. All of these unofficial affixes (28 prefixes and 32 suffixes) and “international” (SI) affixes (20 prefixes) have now been implemented in the spell-checking dictionary. Because of their controversial nature, though, in many cases only such their combinations with stems have been allowed that were found in the corpus, instead of extending the affix’s applicability to the whole of an existing semantic class. These newly supported affixes are: Affix AFROANTIARĤI/ARKIAŬDIOAŬTOBIOEKOEŬROHIPERINFRAKOKVERMAKROMETA-

Example Afroamerikano antifaŝisto

Translation Afroamerican anti-fascist

Description shortcut for AFRIK (stem for “Africa”) “against”, “enemy”, “in opposition”

arĥiepiskopo

archbishop

“first”, “main”, “top”, “most important”

audiocassette autopilot biophysics ecosystem eurobarometer hypertext infrared cosecant C flute macroeconomics metaphilosophy microorganism micrometer mini-skirt monosylabic preamplifier prototype pseudoscience retroact Saint Helena semicircle half-brother telephoto lens thermos ultrasound videocassette water spouter Rosaceae Rosales nasal bone

electronic recording / reproduction of sound automatic life, living being ecological shortcut for EŬROP (stem for “Europe”) “above”, “too”, “over” (scientific) “below” (technical) “complement”, “together” (mathematical) “transversal” (technical) “large scale”, “gigantic” activities or works relating to itself “small scale”, “tiny” 10-6 “very small”, “very short”, “very little” “one” “before” “main”, “very first”, “primitive”, “original” “false”, “wrong”, “secret” “in opposite / uncommon direction” shortcut for SANKT (stem for “Saint”) alternative for DUON (stem for “half”) alternative for DUON or VIC (relationships) “long distance” (technical) “warm” “extreme”, “extraordinary” electronic recording / reproduction of images fixed installation family of plants family of plants bone

aŭdiokasedo aŭtopiloto biofiziko ekosistemo eŭrobarometro hiperteksto infraruĝa kosekanto kverfluto makroekonomio metafilozofio mikroorganismo MIKROmikrometro MINIminijupo MONO- monosilaba PREpreamplifilo PROTO- prototipo PSEŬDO- pseŭdoscienco RETRO- retroagi SANSanheleno SEMIsemicirklo STIFstiffrato TELEteleobjektivo TERMO- termobotelo ULTRA- ultrasono VIDEO- videokasedo -AB trinkabo -AC rozacoj -AL rozaloj nazalo

Page 2

varmala -ARI duaria -ATOR kalkulatoro rozeo -E rozea -ED cervedoj -EN lutrenoj -ENZ solvenzo -ESK japaneska -I Slovakio -IĈ boviĉo -IF varmifi -IK gimnastiko -ILION duiliono -ILIARD duiliardo -ISTAN Afganistano -IT nefrito -IV produktiva zinkizi -IZ pasteŭrizi -NOMIAL dunomialo -OFON esperantofona sufiksoido -OID asfodeloido -OL duolo -OLOG birdologo -OLOGI birdologio -OMETR altometro -OTEK filmoteko virusozo -OZ sabloza -T sesto -TET kvarteto -UK bovuko JOTAjotabajto ZETAzetabajto EKSAeksabajto PETApetabajto TERAterabajto GIGAgigabajto MEGAmegabajto KILOkilobajto HEKTO- hektolitro DEKAdekagramo DECIdecimetro CENTIcentimetro MILImilimetro NANOnanometro

thermic binary calculator Rosa pink Cervidae Lutrinae solvent Japanesque Slovakia bull heat up gymnastics 1012 1015 Afghanistan nephritis productive galvanize pasteurize binomial Esperantophone suffixoid Asphodelae dual ornithologist ornithology altimeter filmotheque virosis sandy sixth quartet ox yottabyte zettabyte exabyte petabyte terabyte gigabyte megabyte kilobyte hectoliter decagram decimeter centimeter milimeter nanometer

related, belonging to alternative for UM for number systems alternative for IL (apparatus) family of plants color family of animals family of animals substance which serves for an action “similar but not real”, “in a way”, “in a style” country name male animal, man, male “to cause something” subject field, science 10n*6 10n*6+3 country inflammation able, capable cover, provide use a method named after a person mathematical expression with n parts language speaker “of a similar form” family of plants or animals rhythmic subdivisions (music) professional, specialist subject field, science measuring instrument collection illness, fault, disorder full, containing, rich on musical interval group of n members or piece for such group castrated male animal 1024 1021 1018 1015 1012 109 106 103 102 101 10-1 10-2 10-3 10-9

Page 3

PIKOFEMTOATOZEPTOJOKTO-

pikometro femtometro atometro zeptometro joktometro

picometer femtometer attometer zeptometer yoctometer

10-12 10-15 10-18 10-21 10-24

Moreover, the two official suffixes producing homely forms of personal names – ĉj and nj – have been treated appropriately, which was not the case in the bachelor’s thesis (where they were mentioned among the known problems and omissions). This has been accomplished by devising a novel algorithm within the dictionary files, which produces all imaginable combinations of 1 to 5 letters of the Esperanto alphabet and adds a ĉj or nj suffix thereafter, thus forming a homely form of a proper noun which may yet be subject to limited morphological alteration. The limitation in length is inspired by the description of these suffixes by PMEG. Permitted combinations of letters are only those that may form syllables in Esperanto (since PMEG avoids the discussion of this topic, inspiration instead has been taken from the essay “Silabo kaj sib” by Marc Bavant, a member of the Academy of Esperanto). Finally, it must be noted that permitting such a variability in allowed word forms does not present a drawback for the overall functioning of the dictionary, because of the high specificity of the affixes ĉj and nj which are not easily interchangeable with other Esperanto stems or affixes, and particularly because of the condition that a capital letters must always stand at the beginning of a personal name. Examples of newly recognized homely forms of personal names include: Peĉjo from Petro Manjo from Maria Anjo from Ana Enjo from Ema Alekĉjo from Aleksandro On the other hand, words like the following are not recognized as valid words: *Kpstĉjo (prohibited letter sequence) *Aleksandĉjo (too long letter sequence) As far as capitalization is concerned, the list of stems beginning in a capital letter, originally taken from the PIV dictionary, has been revised manually, to allow for the use of a lower-case letter in cases where this may be common use or where current practice makes this possible as an unwritten rule (e.g. derivatives of people names from the names of their countries). Examples of valid words that are now recognized due to changes in capitalization: suno (“sun”) as well as Suno (“Sun”) luno (“moon”) as well as Luno (“Moon”) esperanto as well as Esperanto (although it already used to pass as esper+ant+o)

OpenOffice.org Integration The issue of making the Esperanto dictionary an integral part of OpenOffice.org (or of its Esperanto localization, actually) has two aspects that need to be considered. The first one is the technical details of the integration itself, the other is proper licensing of the dictionary so that such integration may be permitted. Eventually, after communication with the package’s developers, the prepared files must be submitted to become a part of the distribution. Since OpenOffice.org 3.0, which appeared after the bachelor’s thesis had been published, a change came for the way spell-checking dictionaries are linked to the main program. Instead Page 4

of using a special dictionary wizard, dictionaries are now available via the extensions repository. This means that a dictionary extension will need to be created to encompass the prepared dictionary so that it is possible to use it in this new version of OpenOffice.org. As for the licensing policy, research has shown that it is apparently the fact the existing spellchecking dictionary for Esperanto (by Sergio Pokrovskij) is distributed under the terms of the GNU General Public License (GPL) while OpenOffice.org is distributed under the terms of the GNU Lesser General Public License (LGPL), what makes it impossible to ship the Esperanto spell-checking dictionary in one package with the office suite. The LGPL permits software to be linked to a non-(L)GPLed software, which may not be done with software licensed merely under the GPL. Moreover, shall the spell checker be included also in the distribution of Mozilla Firefox, licensing it under the Mozilla Public License (MPL) comes handy too. As a conclusion, it seems best to copy the practice followed by Mozilla in the case of Firefox and license the prepared spell-checking dictionary under three licenses – GPL, LGPL and MPL – at the same time, so that anyone is free to choose the license they want to follow.

Public Presentations Since the last written report on the project (June 2008, when it was defended as a bachelor’s thesis), several public presentations of the project’s goals and outcomes have been performed, both to the experts community and the general public. These presentations, apart from popularizing the project and preparing a user basis for its result, have been followed by discussions, which provided additional input and suggestions of features to be implemented. Following is the list of these public presentations: • 2008-07-25 Rotterdam (NL) 93rd World Congress of Esperanto, Meeting of Computer Linguists • 2008-09-27 Stockholm (SE) E@I Grammar Checker (Lingvohelpilo) Developers Meeting • 2008-11-22 Berlin (DE) Gesellschaft für Interlinguistik (Society for Interlinguistics), 18. Tagung • 2009-02-06 Antwerpen (BE) La Verda Stelo, local Esperanto speakers group • 2009-02-09 Ottignies-Louvain-la-Neuve (BE) Kvinfolio, local Esperanto speakers group

Page 5