Reading the Legends of Roman Republican Coins

2 Reading the Legends of Roman Republican Coins ALBERT KAVELAR, SEBASTIAN ZAMBANINI and MARTIN KAMPEL, Vienna University of Technology Coin classifi...
Author: Cory Brooks
0 downloads 0 Views 9MB Size
2

Reading the Legends of Roman Republican Coins ALBERT KAVELAR, SEBASTIAN ZAMBANINI and MARTIN KAMPEL, Vienna University of Technology

Coin classification is one of the main aspects of numismatics. The introduction of an automated image-based coin classification system could assist numismatists in their everyday work and allow hobby numismatists to gain additional information on their coin collection by uploading images to a respective website. For Roman Republican coins, the inscription is one of the most significant features and its recognition is an essential part in the successful research of an image-based coin recognition system. This paper presents a novel way for the recognition of ancient Roman Republican coin legends. Traditional OCR strategies were designed for printed or hand-written texts and rely on binarization in the course of their recognition process. Since coin legends are simply embossed onto a piece of metal, they are of the same color as the background and binarization becomes error-prone and prohibits the use of standard OCR. Therefore, the proposed method is based on state of the art scene text recognition methods which are rooted in object recognition. Sift descriptors are computed for a dense grid of keypoints and are tested using SVMs trained for each letter of the respective alphabet. Each descriptor receives a score for every letter and the use of pictorial structures allows to detect the optimal configuration for the lexicon words within an image; the word causing the lowest costs is recognized. Character and word recognition capabilities of the proposed method are evaluated individually; character recognition is benchmarked on three and word recognition on different datasets. Depending on the Sift configuration, lexicon and dataset used, the word recognition rates range from 29% to 67%. Categories and Subject Descriptors: J.5 [Computer Applications]: Arts and Humanities—Fine arts; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Object recognition; I.5.4 [Pattern Recognition]: Applications—Computer vision General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Ancient coins, coin legend recognition, local image descriptors, scene text recognition, OCR ACM Reference Format: Albert Kavelar, Sebastian Zambanini and Martin Kampel. 2013. Reading the Legends of Roman Republican Coins. ACM J. Comput. Cult. Herit. 1, 4, Article 2 (June 2013), 20 pages. DOI:http://dx.doi.org/10.1145/0000000.0000000

1.

INTRODUCTION

Numismatics is the scientific discipline studying all forms of money and other payment media as well as its historical aspects. Besides researching the distribution of coin hoard finds or the identification of coins, their classification plays an important role in numismatics. As stated by Kampel and Zaharieva [2008], numismatists refer to the association of a specific coin to a number according to reference books such as the Crawford catalog [Crawford 1974] as classification. This is a complex and cumbersome task The presented work is part of the ILAC project, which is funded by the Austrian Science Fund (FWF): TRP140-N23-2010. The authors would like to thank their partners Dr. Klaus Vondrovec and Mag. Kathrin Siegel at the Department of Coins and Medals of the Kunsthistorisches Museum Wien (Vienna Museum of Fine Arts) for their great support and valuable input. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. © 2013 ACM 1556-4673/2013/06-ART2 $15.00 DOI:http://dx.doi.org/10.1145/0000000.0000000 ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:2



A. Kavelar, S. Zambanini and M. Kampel

requiring years of experience in the field of numismatics [Kampel and Zaharieva 2008]. Apart from material analysis, classification primarily relies on the inspection of images and textual inscriptions on the coin. Hence, digital photographs of both sides of a coin, referred to as obverse (front) and reverse (back), also enable classification. Therefore, computer vision methods can assist numismatists in their work and expedite coin classification. Fully automated classification systems have already been developed for modern coins, e.g., Dagobert [N¨olle et al. 2003] or C OIN -O-M ATIC [van der Maaten and Poon 2006]. However, up until now, there is no technically mature automatic classification system for ancient coins despite some research efforts in this direction [Zaharieva et al. 2007; Arandjelovi´c 2010; Kampel and Zaharieva 2008]. Due to the way in which they were manufactured, ancient coins substantially differ from their modern counterparts and hence pose different challenges to computer vision. As opposed to nowadays coins, ancient coins were cast or struck from handcrafted dies, as illustrated in Fig. 1(a). That is, a die served as a blueprint from which a certain number of coins could be struck before it was worn down to a degree where the engraved imagery could no longer be transferred to the flan. Consequently, different dies had to be used for producing coins of the same class and coins originating from the same die were of different quality depending on how degraded the die already was [Kampel and Zaharieva 2008]. The described manufacturing process also results in coins having individual minting marks, different shapes, off-centered or cropped imagery and misaligned obverse and reverse images. Hence, each coin is unique and the intra-class variability is high which impedes the use of methods for modern coins [N¨olle et al. 2003; van der Maaten and Poon 2006] as they assume uniform background and identical appearance for coins of the same class. Fig. 1(b) depicts the high intra-class variability for ancient Roman Republican coins. 1.1

Definition of Terms

By coin we refer to a specific physical item [Arandjelovi´c 2010]. The raw piece of metal from which the coin is struck is entitled flan. The coin’s front face is denominated obverse while its back is called reverse. The piece of metal on which the inverse image to be transferred onto the flan is engraved is entitled die. It can thus be considered the blueprint or stencil for a certain coin type and there are separate dies for obverse and reverse. The textual inscription on the coin is referred to as legend. In addition to the legend and the main symbol (on the obverse of ancient Roman Republican coins, the head of a Roman deity, such as Roma, Venus or Minerva, serves as main symbol in the majority of coin types [Crawford 1974]), many coins show mint marks, i.e., small symbols or letters used to identify the mint. The contraction of two or more letters into a single glyph is referred to as ligature, as depicted in Fig. 1(g). Fig. 1(a) and Fig. 1(c) further illustrate these terms. 1.2

Objective

In order to assign an ancient coin to a certain class, numismatists have to identify and discern their characteristics, highlighted in Fig. 1(c). One of the most significant features used for classification is the textual inscription or legend that, for the Roman Republican coins minted between 280 BC and 27 BC considered in this work, reads the name of the depicted Roman deity or mythological figure (obverse) or the respective moneyer (reverse) [Crawford 1974; Arandjelovi´c 2010]. We focus on Roman Republican coinage because it is well documented and researched numismatically by Crawford [1974], whose work gives a comprehensive overview of the different coin types of this period. It lists 550 distinct reference numbers which allow the classification of a certain coin according to its characteristics such as minting marks, the name of the issuer given in the legend or the imagery depicted. Moreover, this research is carried out in collaboration with the Kunstistorisches Museum Wien (Vienna Museum of Fine Arts) whose Coin Cabinet owns one of the largest collections of Roman Republican coins worldwide. The ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:3

Fig. 1. (a) Illustration of the coin manufacturing process. (b) Three Roman Republican coins of the same class. (c) The different characteristics of an ancient coin. (d) Arbitrary legend orientations. (e) Three different examples of the letter ’B’. (f) The letter ’A’ illuminated from three different directions. (g) Different ligatures found on ancient coins (with permission of [KHM 2013].)

presented work is part of the ILAC project, in the course of which the entire collection was digitized using a DSLR camera. Consequently, there is a large image database which was used for the research presented. The automatic recognition of coin legends facilitates its successful classification and thus a legend recognition system substantially contributes to a fully-fledged coin recognition system. The results of researching such an automatic image-based legend recognition system for ancient Roman Republican coins are the topic of this article. Even though the proposed method was tested on another dataset, the development of a general text recognition algorithm is not of particular interest in this research. 1.3

Research Challenges

As explained by Arandjelovi´c [2010], the manufacturing process, wear from use and abrasions from exposure to environmental influences like chemicals in the soil, ancient coins show a high intra-class variability and a low inter-class variability. This hinders successful classification and also applies to legend letters, as illustrated in Fig. 1(e). The occurrence of various font types and sizes poses additional challenges. In contrast to text printed on a sheet of paper, which can be considered flat, coins have a three-dimensional metallic surface structure causing shadows and specular highlights on the coin when illuminated. These phenomena cause the same letter to appear differently when illuminated from various directions (Fig. 1(f)). ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:4



A. Kavelar, S. Zambanini and M. Kampel

Since conventional Optical Character Recognition (OCR) methods rely on binarization [Plamondon and Srihari 2000; Vinciarelli 2002], they are not appropriate for ancient coins because the color of the legend is the same as the coin background. That is, thresholding leads to segmentation errors and thus misclassified characters and words. Furthermore, horizontal or at least straight alignment of words is assumed in most OCR methods; both cannot be guaranteed for coin legends (ancient Roman coins may also show a curved legend aligned with the coin border, see Fig. 1(d) for various arbitrarily aligned coin legends). For these reasons, we decided to employ binarization-free object detection methods, such as local feature-based approaches, for character recognition. To facilitate the recognition process, a dictionary comprising the known finite set of legend words will be incorporated. In contrast to standard printed text, e.g., in English, the succession of certain characters in inscriptions on Roman Republican coins is represented by ligatures, i.e., one or more letters are condensed into a single glyph, as shown in Fig. 1(g). Given that ligatures cannot be recognized by standard OCR software out of the box, using object recognition-based techniques instead becomes even more reasonable as these allow treating ligatures as a separate character class if required. The remainder of this article is structured as follows: In Section 2, state-of-the-art methods of the scientific fields relevant for developing a coin legend recognition system are reviewed. Section 3 explains the proposed methodology for recognizing legend coins in detail. In Section 4, experiments carried out on a test set of 180 coin images and on a standard benchmark image dataset for character and text recognition methods are presented. Finally, Section 5 concludes this paper and points out the next steps to be taken in our research. 2.

RELATED WORK

The implementation of a legend recognition system is a complex topic based on manifold areas of computer vision and machine learning. Besides the state-of-the-art of coin recognition, Optical Character Recognition (OCR), local image descriptors and text detection the rather new field of scene text recognition (STR) will be reviewed. 2.1

Coin Recognition and Classification

Image based coin recognition started with Fukumi et al. [1991] who applied computer vision-based methods to the field of numismatics. The proposed rotationally invariant pattern recognition system is based on a multilayered neural network and can discern 500 won and 500 yen coins. Davidsson [1996] applied decision trees to coin classification. His approach allows rejecting unknown specimens rather than misclassifying them, as Fukumi’s method does. Bremananth et al. [2005] suggested coin recognition based on the depicted numeral, thus the problem faced is related to coin legend recognition. After template-based localization of the numerals in an edge image, Gabor filter responses are computed for the respective sub image and are classified using a backpropagation network. Several classification systems for modern coins which are capable of distinguishing between hundreds of coin classes were successfully developed. The Dagobert coin recognition system presented by N¨olle et al. [2003] is capable of discerning between more than 600 types of modern coins. Their method relies on binarized edge information which is tested against all possible master edge images stored in a database. C OIN -O-M ATIC proposed by van der Maaten and Poon [2006] also relies on edge information. Edge angle-distance distributions are formed and classified using nearest neighbor (NN) approach. The method introduced by Reisert et al. [2006] first segments the coin from the background by applying the Hough transform [Ballard 1981]; the resulting image region containing the coin is normalized and transformed to polar coordinates. An angular image is then computed based on the image gradient orientations. The similarity between two different coins is computed by counting number of ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:5

pixels for which the two respective angles coincide. This similarity measure is fed to a NN classifier that finds the best matching coin within a given coin image database. As one of the first, Kampel and Zaharieva [2008] presented an end-to-end coin identification workflow for ancient coins. They evaluate the performance of different local image feature descriptors for coin classification and recognition including S IFT [Lowe 1999], G LOH [Mikolajczyk and Schmid 2005], Shape Context [Belongie et al. 2002] and S URF [Bay et al. 2008]. Arandjelovi´c [2010] presents an automatic coin attribution system based on a novel type of feature called locally-biased directional histogram. For each interest point found by the Difference-of-Gaussian (DoG) detector [Lowe 1999], a set of weighted and directed histograms is computed. These features aim to capture geometric relationships between interest points and are inspired by the concept of correlatons, first introduced by Savarese et al. [Savarese et al. 2006]. Huber-M¨ork et al. [2011] propose a method which extends the approach introduced by Kampel and Zaharieva [2008] by a preselection step based on the coin’s contour. In this step, equally spaced rays are cast from the coin’s center of gravity and intersected with its contour. The distances along the rays between these intersection points and the hypothetical perfect circle fitted to the coin area are measured and form a descriptor which can be computed quickly. This descriptor can be quickly matched and allows for fast pruning of large coin databases when attempting to identify a specific coin from an image. Arandjelovi´c [2012] proposed a way for reading ancient coin legends. His work concentrates on Roman Imperial denarii showing legends running around the coin edge. After transforming the image from Cartesian to polar coordinates, the legend is oriented horizontally which allows the use of HoG-related features in the letter appearance model. Next, the likelihood of particular legends is computed in a similar way in which word recognition is performed in [Wang et al. 2011], where individual likelihoods for each character of a set of legend words are combined to an overall likelihood usind dynamic programming. 2.2

Optical Character Recognition

Since the early 1960s, OCR is a well-researched and prominent topic in machine vision [Ejiri 2007; Suen et al. 1980; Plamondon and Srihari 2000; Vinciarelli 2002; Wang and Belongie 2010]. However, the very first approaches towards OCR date back to the early 20th century, when the extension of telegraphy and the creation of reading devices for the visually impaired were the driving force behind research. In 1913, long before the invention of the personal computer, Edmund Fournier d’Albe created the Optophone, a device capable of detecting black print on paper via photo sensors and translating it into audible signals allowing the blind to read. Another early approach towards OCR was made by the Austrian inventor Gustav Tauschek, who was granted a U.S. patent for his Reading Machine in 1935 [Tauschek 1935]. Tauschek’s apparatus compares a printed digit placed in front of a lens to a set of stencils via photo-electric cells capturing the light reflected from the paper. This kind of stencil matching was also used in the first computer vision based OCR systems, where images of scanned characters were compared to a set of sample images stored in a database with basic template matching algorithms. As explained by Plamondon and Srihari [2000] OCR systems can generally be divided into off-line and on-line OCR methods, depending on the input data available. Off-line OCR means that the recognition algorithm is applied to an image of a machine-printed or handwritten sheet of paper digitized with a flatbed scanner or a digital camera, i.e., the data for the OCR algorithm is acquired after the completion of the printing or writing process. On-line OCR, operates on an ordered sequence of twodimensional point coordinates sampled while input with a stylus on touch-sensitive devices such as PDAs, smartphones or graphics tablets. That is, the recognition process is performed while the symbols are written [Suen et al. 1980]. Hence, the input coordinates are available as a function of time; therefore the writing direction and the order in which the strokes were written are known. This additional ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:6



A. Kavelar, S. Zambanini and M. Kampel

information leads to better recognition rates for on-line OCR as opposed to off-line OCR [Plamondon and Srihari 2000]. Since legend recognition operates on coin images, it is related to off-line OCR. A typical off-line OCR system comprises the following steps: (1) preprocessing (2) normalization (3) localization and segmentation of lines, words and characters (4) recognition [Plamondon and Srihari 2000; Vinciarelli 2002]. Due to noise and skew introduced in the digitization process, character recognition cannot be directly applied to an input image; hence, preprocessing steps, such as noise reduction or background removal, are required. Background removal is performed via thresholding [Otsu 1979; ¨ Sauvola and Pietikainen 2000] which works well for text written on a homogeneous background, such as a sheet of paper, which has a bimodal image histogram. However, since coin legend letters are minted into the flan and are thus of the same color as the background, thresholding cannot be applied to separate the legend from the rest of the coin. What renders the legend visible are the edges – highlights and shades resulting from the relief structure on the coin surface [Arandjelovi´c 2010] – not ink, paint or a differently colored alloy. Normalization attempts to detect and compensate for skew in case of matching-printed text or slant and slope in case of handwritten documents [Vinciarelli 2002]. Word and character segmentation attempts to find maximal connected pixel regions representing the fundamental elements used in the recognition step are identified [Vinciarelli 2002]. Depending on the type of text (handwritten or machine-printed), these elements are assumed to be characters or word fragments. Word segmentation presumes larger gaps between words than between characters [Plamondon and Srihari 2000]. Finally, recognition computes features for the primitives found during segmentation based on which they are assigned to certain classes, e.g., different characters of an alphabet. A more detailed overview of off-line OCR can be found in [Plamondon and Srihari 2000]. 2.3

Local Image Descriptors

Image-based legend recognition of ancient Roman coins is not only related to OCR in particular but also to object recognition in general – after all, finding letters within an image is an object recognition task and has therefore already been addressed with object recognition methods based on local image descriptors [Diem and Sablatnig 2009; Wang and Belongie 2010]. Based on the image property exploited, they can be classified as gradient-, shape- or texture-based local descriptors [de Campos et al. 2009]. Gradient-based Descriptors. These descriptors make use of gradient information contained in the proximity of an interest point. The Scale-Invariant Feature Transform (S IFT) introduced by Lowe [2004] and has been used in a wide range of applications including character recognition in degraded documents [Diem and Sablatnig 2009] and wide baseline stereo matching [Zhang and Wei 2010]. Scaleand orientation-invariant keypoints are found by looking at local extrema of the difference-of-Gaussian function. The gradient information within a square neighborhood around the keypoints are used to form gradient orientation histogram representing this image patch [Lowe 2004]. Ke and Sukthankar introduced P CA-S IFT, which applies a Principal Component Analysis (PCA) [Jolliffe 1986] to the horizontal and vertical gradients extracted in a 41 × 41 neighborhood around the keypoint instead of applying Gaussian weights to the gradient magnitudes. Alhwarin et al. [2010] added four pairwise independent angles to the standard S IFT descriptor, which, in combination with clustering, allows for a much faster feature matching. Mikolajczyk and Schmid [2005] introduced the gradient location and orientation histogram (G LOH) which use a log-polar location grid instead of S IFT’s rectangular grid. The histograms of oriented gradients (H OG) descriptor proposed by Dalal and Triggs [2005] also captures local gradient information in a rectangular pixel neighborhood but does not compute features for selected keypoints but slides a window across all possible image locations and was already successfully employed for character recognition [Wang et al. 2011]. However, basic H OG lacks rotation-invariance. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:7

Shape-based Descriptors. This category of descriptors is based on edge information. Shape contexts introduced by Belongie et al. [2000] is similar to S IFT but employs edge information instead of gradient information to form histograms. Shape contexts are 2D histograms capturing local edge point distributions across a shape’s border in a log-polar grid centered on a certain reference point [Zhu et al. 2011]. Berg et al. [2005] introduced the geometric blur descriptor which also forms histogram with respect to points sampled along an object’s contour. The descriptor comprises a set of sparse points sampled around the reference contour point. However, instead of simply forming edge pixel histograms, responses of multiple oriented Gauss filters whose sigma increases with the distance to the reference point are accumulated in the descriptor. The D AISY descriptor introduced by Tola et al. [2010] can be regarded as a derivative of the geometric blur and S IFT descriptor. Its layout and construction follow the ideas of Berg et al. [2005], however, the edge filtering step was replaced by simple convolutions. Thus, the descriptor can be computed much faster and allows for dense matching. Texture-based Descriptors. These descriptors are based on gray-level distributions around a particular pixel and are commonly used in texture classification applications [Lazebnik et al. 2005] as well as for object recognition [Johnson and Hebert 1999]. One way of capturing texture information of an image patch is to apply a Gabor transform [Gabor 1946] with filter kernels of different frequency and orientation, called Gabor wavelets. The response of the wavelets is large when the texture of the filtered image reflects the kernel and small otherwise. The S URF descriptor proposed by Bay et al. [2008] is reminiscent of Gabor wavelets and is based on responses of box filters which approximate second order Gaussian derivatives. The use of integral images [Viola and Jones 2002] allows for a rapid descriptor computation. Local binary patterns were originally introduced by [Ojala et al. 1994] and capture gray level distributions in the 3 × 3-neighborhood around interest points. The eight neighbors are thresholded with the gray level value of the center pixel and get assigned 1 if their values are greater or equal than the center and 0 otherwise and are concatenated to an 8-bit vector describing the image patch. 2.4

Text Localization

As the results achieved with standard OCR software drop drastically when applied to images of natural scenes [Epshtein et al. 2010], a preceding step which extracts image regions containing textual information is conducive in such a scenario. In general, text localization techniques can be split into two categories: 1) texture-based and 2) region-based methods [Epshtein et al. 2010; Gllavata et al. 2004]. Methods of the first category scan the image at various resolutions and categorize image regions based on texture properties computed in a local pixel neighborhood. Depending on the method, the properties used vary and include, among others, edge density, distribution of wavelet coefficients or intensity values [Epshtein et al. 2010]. The second category, region-based text detection, groups pixels sharing certain properties such as similar intensity or color values. The pixel groups found form connected components, whose texture properties are then inspected to single out regions representing characters [Epshtein et al. 2010]. According to Epshtein et al. [2010], the integral advantage of regionbased methods is the ability to detect texts of arbitrary scale and orientation in a single scan, thereby reducing computational costs. 2.5

Scene Text Recognition

The field of scene text recognition (STR) is closely related to image-based coin legend recognition. It aims at reading texts in photographs or videos of natural scenes, i.e., images acquired under entirely unconstrained, so-called real-life conditions, disallowing any a priori assumptions regarding text properties, such as position, orientation or color. Thus, an ideal scene text recognizer would be capable of spotting words in arbitrary images spanning the entire range of difficulty levels illustrated in Fig. 2, ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:8



A. Kavelar, S. Zambanini and M. Kampel Dif�iculty Easy

Scanned Text

Hard

Image Text

Coin Legends

C�����

Fig. 2. An overview of the difficulty levels of various text recognition tasks (the scene text images were taken from ICDAR 2003 dataset, the C APTCHA images were taken from Mori and Malik’s project website [Mori and Malik 2003], coin legend images used with permission of [KHM 2013]).

as defined by Wang and Belongie [Wang and Belongie 2010]. The lower end of the spectrum is marked by scanned documents of machine-printed texts, as this problem is well researched. At the upper end images that were artificially degraded and distorted with the goal of hindering OCR engines from correctly reading the depicted words are found. Such images, called visual C APTCHAs, are used on websites to keep software bots from accessing Internet resources or, e.g., participating in on-line votings [Wang and Belongie 2010]. Mori and Malik [2003] were among the first to attempt to break visual C APTCHAs. They present two algorithms tailored for breaking two different image types. Their first approach aims at breaking C APTCHAs showing one single word and employs a modified version of shape contexts [Belongie et al. 2002] for finding character candidate locations, which are combined to word candidates by acyclic directed graphs; nonsense words are omitted using a lexicon of invalid trigrams. The word having the highest overall probability is detected. The second algorithm is designed for images showing several pairs of overlapping words, thus a more challenging task addressed with a holistic approach, i.e., they do not try to recognize individual characters but clusters of multiple characters. A bigram-based pruning helps sorting out words of the lexicon representing the C APTCHA vocabulary. The final word probabilities are computed as in their first algorithm. Chandavale et al. [2009] present another method for breaking C APTCHAs. They propose elaborate methods for clutter removal and define a number of individual features for segmented characters. de Campos et al. [2009] were among the first to tackle the problem of character recognition in natural scene images, generalizing the approaches towards breaking C APTCHAs to arbitrary images. STR is a multi step process depending on initial text localization within the image. However, de Campos et al. only address the character recognition aspect of the problem and work with images of presegmented characters. They evaluate various classifiers and use local features in a bag-of-visual-words approach to describe and recognize characters. The method of Wang and Belongie [2010] works on images of entire words rather than on single character images. They use H OG features in combination with a NN classifier for character recognition. After character segmentation and size normalization H OG features are computed and tested against training sample images. The resulting score is then fed to the NN classifier to determine the closest distance to each character class. This results in lists of character scores for each image location and class. Non-maximum suppression (NMS) yields a set of character candidate locations per class which are combined to meaningful lexicon words using Pictorial ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.



Reading the Legends of Roman Republican Coins

(a) AA

A

...

BA

B

...

...

XA

(b)

X

Legend Recognition Pipeline

...

Keypoint Extraction

Labeled sample images

Resize image (384 x 384 pixels)

Keypoint Classification Compute SIFT features

SIFT descriptors

...

BA

B

...

...

Sample image regions

Keypoints

SIFT Descriptor Computation

A

Entropy filter (find relevant image regions)

Input: Coin Image

Machine Learning

AA

Input: Lexicon

XA

2:9

SVM Database

Test against all SVMs

Scores for each character

X

Word Detection

...

Find pictorial structures for every word

SIFT descriptors

Re-scoring with fixed orientation

Apply threshold to words found

Output: List of words

Support Vector Machine Learning

SVM Database

SVM Database

Fig. 3. (a) Training of Support Vector Machines (b) Legend Recognition Pipeline.

Structures [Felzenszwalb and Huttenlocher 2005]. Wang et al. [2011] extended their approach and present the first end-to-end STR method. The NN classifier was substituted for Random Ferns [Bosch et al. 2007], since they offer multi-class support and perform efficiently even for many classes. An input image is processed by first computing a descriptor for a sliding window of different scales at every possible location. The classification of these descriptors yields probability values for each class at the according window location and scale, which are combined to meaningful words following the methodology presented in [Wang and Belongie 2010]. Their experiments with text detection methods and standard OCR software prove that an initial localization of text regions does not significantly increase word recognition rates and can thus be omitted for a reduced complexity [Wang et al. 2011]. 3.

METHODOLOGY

In this section, the proposed approach is explained in detail. It was shown that traditional OCR methods are not suitable for coin legend recognition. Thus, our approach is rooted in object recognition and machine learning, similar to state-of-the-art scene text recognition systems as, e.g., the one introduced by Wang et al. [2011]. In contrast to traditional OCR systems, our method does not rely on binarization, which is error-prone when applied to images of ancient coins. Moreover, the presented approach does not depend on text localization, which requires large text blocks of a consistent prevailing orientation to be present in the image. The general architecture of the presented method is illustrated in Fig. 3. It is separated into an initial training step, shown in Fig. 3(a), and a recognition step, depicted in Fig. 3(b). When approaching legend recognition with object recognition techniques, a classifier is needed, since an exact matching of letters is not possible due to the large intra-class variance mentioned before. Our method employs an off-line training mechanism, that is, the target function of the classifier is not modified after the training phase. This strategy contrasts with on-line training, where the model is updated with each newly classified sample. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:10



A. Kavelar, S. Zambanini and M. Kampel

(a)

(c)

Ideal con�iguration: Deformed con�iguration:

(e)

Straight legend:

(d)

l1l7 l1

(b)

l7

Curved legend: l1l7 l1

Spring tension: Orientation = −153.2° Orientation = −31.2°

min.

l7

... Keypoint li max.

... Distance dli

Fig. 4. (a) Left column: real examples of the letter ‘I’; right column: structures found in the imagery resembling the letter ‘I’. (b) The S IFT descriptors for two images of the letter ‘A’ are oriented differently due to the light’s angle of incidence. (c) Mass-spring model for the word ’CREPVSI’. (d) Placing the PS model on the detected keypoints. (e) Derivation of S IFT descriptor orientations for straight and curved legend words.

3.1

Training

Support Vector Machines (SVM) have been successfully used in previous character recognition applications [Diem and Sablatnig 2009] and are generally considered as a versatile and powerful classifier. Therefore, the proposed algorithm relies on SVMs for character classification. The legends of the considered coins use the Latin alphabet and only show capital letters; certain letters do not occur in the legend words of the finite vocabulary and consequently are not considered. Additionally, the letter ’I’ is ignored because many structures found in the coin imagery resemble its appearance (see Fig. 4(a)). This narrows down the alphabet to 18 letters. Thus, 18 different SVMs need to be trained, one for each letter or character class. The kernel and its respective parameters were determined empirically (see Section 4); our experiments show that RBF kernels have the highest classification accuracy. For the training of an SVM, which is illustrated in Fig. 3(a), 50 samples of the respective positive class and 10 randomly chosen samples of every negative class are used, which leads to an overall of 220 images (50 positive and 170 negative samples) used in the training of each SVM. As the legend orientation is initially unknown, it is imperative to employ rotationally invariant S IFT features for character recognition. However, the presented experiments show that the omission of rotational invariance leads to higher character recognition rates. This is due to the reduced degree of freedom and the fact that the use of fixed orientations makes the descriptor invariant to changes in the illumination direction. The relief structure of the coin surface causes edges that are perpendicular to the light source direction to be more salient. Consequently, the appearance of a legend letter may change significantly when illuminated from another direction. Hence, the computed orientation of the S IFT descriptor is different as well, as illustrated in Fig. 4(b). Therefore, two sets of SVM databases are trained – one with rotationally invariant and one with fixed oriented S IFT descriptors, which is used to validate word hypothesis detected using rotationally invariant descriptors. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins

3.2



2:11

Legend Recognition

After all SVMs have been trained and stored accordingly, coin images can be tested for certain legend words. The legend recognition process is depicted in Fig. 3(b) and follows the STR methodology proposed by Wang et al. [2011]. The recognition algorithm takes two input parameters: (1) an image showing obverse or reverse of a coin and (2) an input lexicon comprising the words which the algorithm looks for in the input image. Hence, only words that are present in the lexicon can be found. As the coin images come in various sizes they are downsampled to a standardized size of 384 × 384. This ensures that legend letters are approximately of the same size for all images. Thus, S IFT descriptors only need to be computed at a single scale, as opposed to Wang’s approach, where letters of arbitrary size may occur. After the image has been downscaled, it is passed on to the Keypoint Localization step where homogeneous image regions are filtered out since they cannot contain relevant legend information. Since the entropy is a measure for the information content of an image region [Sonka et al. 2008], an entropy filter is applied to the image and regions of low entropy are eliminated via the application of a simple threshold using Otsu’s method [Otsu 1979]. The entropy filter used returns an image of the same size as the input gray-level image. Each output pixel contains the entropy [Sonka et al. 2008] of a circular neighborhood having a diameter of 19 pixels; this neighborhood was chosen experimentally. According to Sonka et al. [2008], the entropy He for an image with G gray-levels is defined as

He = −

G−1 X

P (k)log2 (P (k))),

(1)

k=0

where P (k) is the probability of gray-level k. The entire process is illustrated in Fig. 5. Since simple thresholding would return the entire coin area (as shown in Fig. 5(c)), a subsequent threshold is applied only to the coin region to finally attain the region of interest (ROI). From the remaining ROI, every second pixel is sampled in both directions following a checkerboard pattern. Those two steps drastically reduce the number of keypoints for which S IFT features have to be computed and thus significantly speeds up the recognition process: For the used test set of 180 coin images, the application of the two thresholds reduces the initial 121104 (384×384 pixel) keypoints to an average of 64713 keypoints, which equals 43.9% of the initial keypoints. Sampling only every second keypoint within this region reduces the number even further to an average of only 32358, which represents 21.9% of the original value. Thus, 78.1% fewer S IFT descriptors need to be computed and classified. This significally accelerates the recognition process. The found keypoints are passed on to Keypoint Classification step where for each of them a rotationally invariant S IFT descriptor of a fixed scale is computed. The scale of the S IFT descriptors can be set to a fixed value since the font size of the legends is consistent for all coins considered in this work. For this work, a descriptor size of 36 × 36 pixel was determined empirically. These descriptors are then tested with all 18 SVMs trained for rotationally invariant descriptors. Thus, each keypoint gets assigned 18 different scores, one for each character class which tells how likely it is to find the respective letter at this location. In the current implementation, the score is simply the output of the decision function of the SVM; that is, the distance between the feature and the SVM’s separating hyperplane. As shown in Section 4.1, this approach works sufficiently well. However, transforming the output of the decision function to an actual class probability, as proposed by Platt [2000], will be implemented in the future and is expected to improve the character and thus also the legend recognition results. The computed character probabilities serve, along with the downsampled input image and the lexicon, as input for the Word Detection step. In order to combine the character probabilities to words, the ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:12



A. Kavelar, S. Zambanini and M. Kampel

(a)

Entropy Filter

Threshold

(b) (c)

(d) Threshold

(e)

Fig. 5. Detection of ROIs using an entropy filter. (a) The sample image used for illustrating the proposed legend recognition pipeline. (b) Low-energy regions within the image; manually marked in blue. (c) The ROI detected initially after thresholding the entropy-filtered image (highlighted in red). (d) The ROI detected when thresholding the entropy-filter result masked with the initially detected ROI. (e) The detected ROI marked in red in the sample image (with permission of [KHM 2013]).

pictorial structures (PS) approach is adopted. PS were originally introduced by Fischler and Elschlager [1973] and later rediscovered for object recognition by Felzenszwalb and Huttenlocher [2005]. The general idea of PS is to describe an object by a configuration of its pieces. The individual pieces have to meet a specific topological relation ship, i.e., the parts have to be placed in a certain way with respect to their arrangement and relative distances in order to be recognized as the object. Consequently, the overall probability for an object depends on the probability of the individual parts found and how closely their spatial configuration meets the object description. In the context of legend recognition, words represent the objects and their letters are considered the pieces. Thus, when an image is inspected for a certain word, the word model is placed at every possible location within the image and the placement costs for this location are computed. If the costs found for the cheapest location lie below a certain threshold, the word is considered to be found at this place. Fig. 4(c) illustrates a PS model for the word ’CREPVSI’ in its ideal and deformed configuration and Fig. 4(d) shows how the matching algorithm is applied to the keypoints detected. In a formal way, the word detection problem can be considered as a weighted-graph optimization problem [Wang and Belongie 2010], where K = {k1 , . . . , km } represents the set of the m keypoint or candidate character locations found in the keypoint detection step. The subset L = {l1 , . . . , ln |li ∈ K} ⊆ K contains the n location of a specific n-characters long word configuration in the image. The multiset of characters for a word is given by C = {c1 , . . . , cn |ci ∈ A}, where A = {C1 , . . . , C18 } is the alphabet used. Then, G(C, E) is the undirected graph representing the configuration for this word, where C is the list of the n characters whose locations are given by li ∈ L, and E = {ej , . . . , en−1 } is the list of the n − 1 edges ej (lj , lj+1 ) connecting these locations, corresponding to the aforementioned conceptual springs [Wang and Belongie 2010] of a PS. The overall placement costs comprise two parts: (1) The matching costs [Wang and Belongie 2010], which are equal to the score sci (li ) that the S IFT descriptor ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:13

at the location li received from the respective SV Mci . The smaller the output of the SVM, the closer the match. Hence, the matching costs describe how close the image patch around the location li matches the character ci . (2) The deformation costs [Wang and Belongie 2010] or distance costs depend on the edge lengths dj = ||ej || = d(lj , lj+1 ) between two adjacent character locations of G. This term reflects how heavily the placement of the characters for the given word configuration deviates from the ideal word configuration given by the PS model, or – to stick with the aforementioned mass-spring model metaphor – how much tension is applied to the springs. If the distance does not meet the one described in the PS model, penalty costs proportional to the deviation are added to the overall costs. As stated by Wang et al. [2011], the ideal configuration for a certain word is the one causing the lowest costs and is found by minimizing the term L∗ = min (λ ∀i,li ∈L

n X

si (li , ci ) + (1 − λ)

i=1

n−1 X

d(li , li+1 )).

(2)

i=1

Reformulating Eq. 2 to the following recursive function allows for an optimization via DP. D(li ) = λs(li , ci ) + (1 − λ) min d(li , li+1 ) + D(li+1 ) li+1 ∈L

(3)

where λ is a trade-off parameter used for balancing matching and deformation costs. The recursive formulation allows optimizing 2 by first precomputing the matching costs for all ln and then iteratively move backwards in the word by combining all ln−1 within the proximity of ln . This process is repeated recursively until the first letter is reached. The DP algorithm does not take into account that due to the rotationally invariant nature of the S IFT descriptors employed, the character orientations within the word hypotheses found must not neccessarily be consistent. The characters should be oriented in a way that all their up-vectors (Fig. 6(a)) match the layout of the word hypothesis (see Fig. 4(e)). Fig. 6(b) illustrates two words found within an image: The word ’ROMA’ is found in the imagery and has randomly aligned letters while the word ’CREPVSI’ is found correctly and shows consistent letter alignment. In order to reject words with inconsistent letter alignment, a subsequent re-scoring step is introduced, which serves to reject hypotheses having incosistent character orientations and to increase the confidence in the remaining hypotheses. This is achieved by a recomputation of the word scores, an thus the respective character scores, for the 50 word hypotheses which yielded the lowest scores in the DP algorithm. Within a specific word hypothesis, the relative positions of the CCLs is known. This allows to derive the orientations the individual S IFT descriptors should have according to the word hypothesis (Fig. 4(e)). Consequently, the respective character scores can now be recomputed using the SVMs trained for fixed oriented S IFT descriptors. That is, the S IFT descriptors for the locations li are recomputed using the orientation derived from the word hypothesis layout instead of the orientation derived from the gradients of the underlying image patch. The derivation of the S IFT descriptor orientations is illustrated in Fig. 4(e). These descriptors are then tested using the SV Mci trained for the character ci , which is expected at the location li according to the respective word hypothesis. The resulting character scroes are then employed to recompute the respective word score. As the character recognition experiments have shown (see Section 4.1), a higher accuracy is achieved when fixed oriented S IFT descriptors are used. Consequently, the confidence in a word hypothesis based on character scores computed using SVMs trained for fixed oriented S IFT descriptors can be considered higher. The lowest word score achieved in this step also represents the final score for a certain lexicon word. Following this approach, the scores for all words contained in a given lexicon are calculated, the word causing the lowest costs D(li ) is considered as the legend of the depicted coin; that is, the current ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:14



A. Kavelar, S. Zambanini and M. Kampel (a)

Up-Vector

(b)

SIFT Descriptor

Fig. 6. (a) The up-vector is defined to be perpendicular to the direction in which the letter is read. (b) Illustration of two word hypotheses found: The word ’ROMA’ has misaligned letters, i.e., the up-vectors of its individual letters do not match the orientation of the word hypothesis; the letters of the word ’CREPVSI’ are aligned in accordance with the word hypothesis (with permission of [KHM 2013]).

implementation only allows for the detection of a single word. Words whose final score lies above a specific threshold θ are rejected. 4.

EXPERIMENTS

This section presents a detailed evaluation of the presented legend recognition pipeline. In order to determine the optimal parameters for the SVMs to be used for the legend recognition later on, the character recognition is benchmarked in an isolated step at first. Furthermore, the character recognition experiments allow for a comparison of various S IFT descriptor configurations. 4.1

Character Recognition Experiments

Character recognition experiments were carried out on three different datasets. The coin legend letters (Coins) dataset comprises 900 manually segmented legend letter images (50 images letter) having a standardized size of 100 × 100 pixels. The synthetic letters (Synth) dataset consists of synthetically generated images mimicking the appearance of coin legend letters and were created using a standard vector graphics editor. Again, 50 100 × 100 pixels sized images having for 18 character classes are used. The third dataset is the ICDAR 2003 (Icdar) dataset which has been frequently used to evaluate character, word or text recognition systems [Epshtein et al. 2010; Wang et al. 2011]. The images used of the Icdar dataset range from 6 × 15 to 444 × 457 pixels in size and – if possible – 50 images per class were used. Fig. 7(a) shows some sample images of each dataset. In addition, two different options of the S IFT descriptor were combined giving an overall of four configurations. (1) The use of either rotationally invariant of fixed oriented S IFT descriptors. In the former case, the orientation of the S IFT descriptor is computed from the gradients of the image patch surrounding the respective keypoint as described by Lowe [2004] (dynaSift). In the latter case, the S IFT descriptor is computed for a fixed orientation of 0◦ (fixSift). This option helps to evaluate whether the omission of rotational invariance has a positive effect on the character classification rate. (2) The use of either the entire angular spectrum (sift360) or only half the spectrum (sift180). In the latter case, gradients in the range (180◦ , . . . , 360◦ ] are rotated back by 180◦ to make the descriptor invariant to diametrically opposed edge directions. This is a reasonable extension considering the relief structure of legend letters where the edge direction depends on the light’s angle of incidence. Fig. 7(b)-(e) illustrate the various S IFT descriptor configurations tested. Finally, three different SVM kernels (linear, polynomial and radial basis functions (RBF)) and their respective parameters were varied in the experiments. The optimal configuration was found via 5-fold ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:15

Table I. Character Recognition Results and Optimal SVM and S IFT descriptor configurations found via cross-validation. Setting

Accuracy

dynaSift180 (noBG) dynaSift360 (noBG) fixSift180 (noBG) fixSift360 (noBG) dynaSift180 (w/ BG) dynaSift360 (w/ BG) fixSift180 (w/ BG) fixSift360 (w/ BG)

44.44% 45.56% 75.56% 64.44% 45.56% 45.56% 72.22% 68.89%

dynaSift180 dynaSift360 fixSift180 fixSift360

63.89% 62.78% 83.89% 78.89%

dynaSift180 dynaSift360 fixSift180 fixSift360

53.03% 56.77% 72.05% 72.33%

Precision Recall Coins dataset 47.62% 44.44% 74.55% 45.05% 83.95% 75.56% 75.32% 64.44% 59.42% 45.05% 74.55% 45.05% 77.38% 71.43% 82.67% 68.13% Synth dataset 67.65% 63.89% 64.94% 62.78% 88.30% 83.89% 88.75% 78.89% Icdar dataset 55.93% 55.76% 63.14% 59.70% 74.18% 75.76% 78.93% 76.06%

Kernel

C

σ

d

27

2−11 2−11 24 25

23 — 23 23 — — 23 24

— 2 — — 2 2 — —

RBF RBF RBF RBF

29 25 25 27

23 23 23 23

— — — —

poly RBF RBF poly

2−11 27 23 2−11

— 23 23 —

2 — — 2

RBF poly RBF RBF poly poly RBF RBF

2−11 25 23

cross-validation on the training set. Table I lists the accuracy, precision and recall achieved on the test sets and the optimal SVM and S IFT descriptor configuration detected via cross-validation. The following parameter ranges were used in the grid search: Linear kernels do not have specific kernel parameters, therefore, only the box constraint C was varied for the values C = {2−27 , 2−23 , · · · , 29 }. Besides the box constraint, which can take on the same values as for linear kernels, the polynomial degree was varied from 2 to 5 for polynomial kernels. In the evaluation of RBF kernels, the box constraint is limited to the values C = {2−3 , 2−1 , · · · , 29 } and the kernel parameter γ can take on the values γ = {2−15 , 2−13 , · · · , 23 , 24 }. As listed in Tab. I, fixSift generally performs better than dynaSift because it guarantees that the descriptors for all characters of a class are oriented in the same way, that is, the intra-class variability is minimized. Moreover, for fixSift, the sift180 configuration performs better since the fixed orientation assures that descriptors for letters illuminated from opposite directions result in similar S IFT descriptors when mapped to the half angular spectrum. In the case of dynaSift, this cannot be guaranteed, because for certain letters (such as ’A’ or ’V’), the dominant gradient direction follows the orientation of the letter’s legs rather than the light source direction. (as illustrated in Fig. 7(c)). Tab. I shows that the best classification accuracy of 83.89% is achieved for the Synth dataset with the fixSift180 configuration using RBF kernels. The lowest classification accuracy is attained for the Coins dataset in combination with the dynaSift180 configuration using RBF kernels. This results from the fact that the Coins dataset is the most challenging of the three sets benchmarked: in many cases the contours are ragged, scratches and dirt reduce the legibility, shadows and highlights depend on the light’s angle of incidence and introduce additional intra-class variations. The images of the Synth dataset, however, provide crisp contours and are free from noise. The recognition for the Icdar dataset works better than for the Coins dataset because it only contains images of printed letters. Thus, the orientation of the light source does not influence the S IFT descriptor. Moreover, the images show a strong contrast between background and text color and the letters have solid contours which leads to lower variations of the S IFT descriptors computed. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:16



A. Kavelar, S. Zambanini and M. Kampel

(a)

(b)

(c) (f) (d)

(e)

180°

Fig. 7. (a) Samples of the character recognition datasets. (b) S IFT descriptors with a fixed horizontal orientation. (c) S IFT descriptors oriented according to the dominant gradient direction of the respective letter. (d) Full vs. half angular spectrum. (e) Construction of the half angular spectrum. (f) Samples of the legend- and word recognition datasets (with permission of [KHM 2013].)

4.2

Legend Recognition Experiments

This section presents the results of the legend and word recognition experiments. For these experiments, the SVM parameters that proved best for character classification were used. The word recognition algorithm was evaluated for coin images (Coins) and for a subset of images of the ICDAR 2003 Robust Reading Dataset (Icdar). Fig. 7(f) shows samples of the two datasets. The coin images are more challenging than the ICDAR images because the word orientation is unknown and the contours of the individual letters are often weak. As can be seen in Fig. 7(f), the ICDAR images only show a single cropped word aligned horizontally, which is exploited by the word recognition pipeline in a way that only horizontally aligned S IFT descriptors are used and DP only considers word configurations running from left to right. The coin image test set comprises 180 different coins with arbitrary legend orientations and is evaluated with a lexica of various sizes n (n = {5, 10, 20, 30}). For the presented experiments, an individual lexicon was constructed for each image by adding n − 1 randomly selected words from a master lexicon, which contains all words present in the ground truth of the test set, to the image’s ground truth. The master lexicon for the Coins dataset comprises 35 different words of which the longest comprise 8 letters (such as ’NORBANVS’) and the shortest 3 letters (like ’PVB’). For the Icdar dataset, the master lexicon contains 77 words, the longest having 9 letters (e.g., ’TRANSPORT’, and the shortest ones being 3 letters long (such as ’THE’). For a real world legend recognition scenario for Roman Republican coins, all words to be found in the respective legends would be present in the master lexicon, which then would be used for every input image. The current implementation of the word recognition algorithm only detects one word within an image; the one causing the lowest placement costs. The Icdar test set contains 95 images showing words that can be formed from the 18 letter alphabet and is also evaluated with the same four different lexicon sizes. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:17

Table II. Word Recognition Results. Dataset Coins, no rescoring Coins, after rescoring Icdar Wang et al. (ICDAR 2003) Wang et al. (Street View)

n=5 31.7% 53.3% 67.4% — —

n = 10 21.7% 42.8% 58.9% — —

n = 20 17.8% 34.4% 51.6% — —

n = 30 13.9% 28.9% 48.4% — —

n = 50 — — — 72.0% 56.0%

Tab. II lists the results for the word recognition experiments achieved by our method and compares them to the results of Wang et al. [2011]. Obviously, the recognition rates drop as the lexicon size increases because the chances to detect the right word by hazard decreases. Regardless of the lexicon size, the proposed method works better on the Icdar dataset. This results from the fact that the words of the Icdar dataset are all aligned horizontally and no rotationally invariant S IFT descriptors are involved in the recognition process. As demonstrated in the character recognition experiments, fixed S IFT descriptors yield higher recognition rates. Consequently, the entire word recognition works better. The images of the Coins dataset are initially scanned for characters using rotationally invariant S IFT descriptors to detect word hypotheses. These top 50 hypotheses are then reevaluated with fixed S IFT descriptors. This step helps reducing false positives and significantly increases the classification accuracy (from 13.9% to 28.9% in case of 30 lexicon words). Fig. 8(a) shows examples of misclassified (left colum) and correctly recognized legends (right colum). In Fig. 8(b) false and true positives for the Icdar dataset are presented. However, the proposed legend recognition system still does not achieve the recognition rates of dedicated scene text recognition methods, such as the one by Wang et al. [2011]. This is because their method is explicitly designed for flat textures, as encountered in the Icdar dataset, while our’s respects the challenging conditions of the coin’s relief structure and hence is not tailored to flat characters. 5.

CONCLUSION

This work presents a novel technique towards legend recognition of ancient Roman Republican coins. Unlike conventional OCR methods, the proposed method does not rely on binarization because in case of coin legends, the text cannot be separated from the background by simple thresholding operations since it has the same color as the rest of the coin. Thus, our method is based on recent advances in the field of scene text recognition and employs S IFT descriptors to describe the image patches around densely sampled keypoints. The S IFT descriptors are then tested against a set of SVMs, each of which is trained for one letter of the alphabet used. Every keypoint receives a vector holding the scores for each letter. Based on keypoint locations and character scores, a set of lexicon words is fit to the image. The adoption of pictorial structures allows to assign each possible word configuration an overall cost. For each word, the configuration causing the lowest costs is chosen and the lexicon word whose optimal configuration is the cheapest, is detected in the image. The presented system respects the special challenges imposed by the relief structure of the coin’s metallic surface and adapts S IFT descriptor to account for shadows and highlights that depend on the light’s angle of incidence by accumulating gradients pointing in opposite directions in the same orientational histogram bin. Despite being tailored towards coin legend recognition, the proposed system can easily be adapted to detect text in other image types, such as in the images provided by the ICDAR 2003 dataset. In this paper, the same alphabet was used but the design of our system allows working with an arbitrary alphabet. However, changing the alphabet requires teaching the letter appearances to SVMs via respective training data. Introducing new words to the lexicon is rather simple as long as the new word can be spelled with the alphabet which the SVMs are configured for; it suffices to simply insert the word into the lexicon passed to the detection algorithm. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:18



A. Kavelar, S. Zambanini and M. Kampel

The presented work is the first to provide a legend recognition method for ancient Roman Republican coins capable of detecting legend words at arbitrary orientations. Both, straight and curved words can be detected. Our system can thus be integrated as a preselection step into a fully-automated imagebased coin recognition system and narrow down the set of possibly matching classes. Thus, the proposed system is not only interesting for computer vision in general and STR in particular but also for numismatics: The implementation of a legend recognition system can significantly improve classification results of a fully-automated coin classification system since the legend is highly discriminative. Such a classification system can precipitate the classification process and reduce the workload of numismatists dealing with coin findings of laymen who want to have their find checked for potential value by professionals. However, the current implementation requires the text to have a certain size relative to the coin image size since the S IFT descriptors are only computed at one fixed scale. This works well for Roman Republican coins, as the legends are all of similar sizes. However, an exact matching of the descriptor’s scale to a letter’s extent is expected to increase the recognition performance. Moreover, this would allow the application of our method to other types of ancient coins. Another limitation of the presented system is that only one word is detected per image. Many coin legends comprise multiple words and have words in common; that is, the recognition of one word might not provide sufficient information to tell certain classes apart. Thus, a future implementation must be capable to detect multiple words and reject word hypotheses with low scores. Moreover, the evaluation of other local features which provide an even better invariance to changes in illumination and shadowing will be the topic of further research.

(a)

(b)

FLASH, Costs: -0.7043

COT, Costs: -0.4985

POBLICI, Costs: -0.3346

HERE, Costs: -2.1683

STAR, Costs: -2.3649

ROMA, Costs: -0.1094

BALA, Costs: -0.1808

GENE, Costs: -0.1094

Fig. 8. (a) Results for the Coins dataset. Left column: misclassified coin legends (with permission of [KHM 2013]). Right column: correctly recognized legends. (b) Results for the Icdar dataset. Left column: misclassified words. Right colum: correctly recognized words.

ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

Reading the Legends of Roman Republican Coins



2:19

REFERENCES ¨ Faraj Alhwarin, Danijela Risti´c-Durrant, and Axel Graser. 2010. VF-SIFT: very fast SIFT feature matching. In Proceedings of the 32nd DAGM Conference on Pattern Recognition. 222–231. Ognjen D. Arandjelovi´c. 2010. Automatic attribution of ancient Roman imperial coins. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 1728–1734. Ognjen D. Arandjelovi´c. 2012. Reading ancient coins: automatically identifying denarii using obverse legend seeded retrieval. In Proceedings of the 12th European conference on Computer Vision - Volume Part IV. 317–330. Dana H. Ballard. 1981. Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition 13 (1981), 111–122. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2008. SURF: Speeded-Up Robust Features. Computer Vision and Image Understanding 110, 3 (2008), 346 – 359. Serge Belongie, Jitendra Malik, and Jan Puzicha. 2000. Shape Context: A new descriptor for shape matching and object recognition. In Neural Information Processing Systems. 831–837. Serge Belongie, Jitendra Malik, and Jan Puzicha. 2002. Shape Matching and Object Recognition Using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 4 (April 2002), 509–522. Alexander C. Berg, Tamara L. Berg, and Jitendra Malik. 2005. Shape matching and object recognition using low distortion correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. 26–33. Andrew Bosch, Anna Zisserman, and Xavier Muoz. 2007. Image Classification using Random Forests and Ferns. In International Conference on Computer Vision. 1–8. R. Bremananth, B. Balaji, M. Sankari, and A. Chitra. 2005. A New Approach to Coin Recognition Using Neural Pattern Analysis. In Proceedings of IEEE INDICON 2005. 366–370. Anjali A. Chandavale, Ashok M. Sapkal, and Rajesh M. Jalnekar. 2009. Algorithm to Break Visual CAPTCHA. In 2nd International Conference on Emerging Trends in Engineering and Technology. 258 –262. Michael H. Crawford. 1974. Roman republican coinage. Cambridge University Press. Navneet Dalal and Bill Triggs. 2005. Histograms of Oriented Gradients for Human Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 886–893. Paul Davidsson. 1996. Coin Classification Using A Novel Technique For Learning Characteristic Decision Trees By Controlling The Degree Of Generalization. In Ninth International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems. Gordon and Breach Science Publishers, 403–412. Te´ofilo E. de Campos, Bodla Rakesh Babu, and Manik Varma. 2009. Character Recognition in Natural Images. In Proceedings of the International Conference on Computer Vision Theory and Applications, Vol. 18. 133 – 140. Markus Diem and Robert Sablatnig. 2009. Recognition of Degraded Handwritten Characters Using Local Features. In International Conference on Document Analysis and Recognition. 221–225. Masakazu Ejiri. 2007. Machine Vision in Early Days: Japan’s Pioneering Contributions. In Computer Vision – ACCV 2007, Yasushi Yagi, SingBing Kang, InSo Kweon, and Hongbin Zha (Eds.). Lecture Notes in Computer Science, Vol. 4843. Springer Berlin Heidelberg, 35–53. DOI:http://dx.doi.org/10.1007/978-3-540-76386-4 3 Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting Text in Natural Scenes with Stroke Width Transform. IEEE Conference on Computer Vision and Pattern Recognition 1 (2010), 2963–2970. Pedro F. Felzenszwalb and Daniel P. Huttenlocher. 2005. Pictorial Structures for Object Recognition. International Journal of Computer Vision 61 (2005), 55–79. Martin A. Fischler and Robert A. Elschlager. 1973. The Representation and Matching of Pictorial Structures. IEEE Trans. Comput. 22, 1 (January 1973), 67 – 92. Minoru Fukumi, Sigeru Omatu, Fumiaki Takeda, and Toshihisa Kosaka. 1991. Rotation-invariant Neural Pattern Recognition System with Application to Coin Recognition. In Proceedings of the International Joint Conference on Neural Networks, Vol. 2. 1027–1032. Dennis Gabor. 1946. Theory of communication. Part 1: The analysis of information. Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering 93, 26 (1946), 429–441. Julinda Gllavata, Ralph Ewerth, and Bernd Freisleben. 2004. Text Detection in Images Based on Unsupervised Classification of High-Frequency Wavelet Coefficients. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1. 425–428. Reinhold Huber-M¨ork, Sebastian Zambanini, Maia Zaharieva, and Martin Kampel. 2011. Identification of ancient coins based on fusion of shape and local features. Machine Vision Applications 22 (2011), 983–994. Andrew E. Johnson and Martial Hebert. 1999. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 5 (1999), 433 –449. ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.

2:20



A. Kavelar, S. Zambanini and M. Kampel

Ian T. Jolliffe. 1986. Principal Component Analysis. Springer Verlag. Martin Kampel and Maia Zaharieva. 2008. Recognizing Ancient Coins Based on Local Features. In Advances in Visual Computing, Vol. 5358. Springer, 11–22. KHM. 2013. Kunsthistorisches Museum Wien. Burgring 5, 1010 Vienna, Austria. (2013). http://www.khm.at/en/ Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2005. A sparse texture representation using local affine regions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005), 1265–1278. David G. Lowe. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision, Vol. 2. 1150 – 1157. David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (2004), 91–110. Krystian Mikolajczyk and Cordelia Schmid. 2005. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 10 (2005), 1615–1630. Greg Mori and Jitendra Malik. 2003. Recognizing objects in adversarial clutter: breaking a visual captcha. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’03). IEEE Computer Society, 134–141. ¨ Michael N¨olle, Harald Penz, Michael Rubik, Konrad Mayer, Igor Hollander, and Reinhard Granec. 2003. Dagobert - A new Coin Recognition and Sorting System. In Proceedings of the 7th International Conference on Digital Image Computing - Techniques and Applications. 329–338. ¨ Timo Ojala, Matti Pietikainen, and David Harwood. 1994. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In Proceedings of the 12th International Conference on Pattern Recognition. Nobuyuki Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man and Cybernetics 9 (1979), 62–66. R´ejean Plamondon and Sargur N. Srihari. 2000. On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1 (January 2000), 63–84. J. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers. 61–74. Marco Reisert, Olaf Ronneberger, and Hans Burkhardt. 2006. An efficient gradient based registration technique for coin recognition. In Proceedings of the Muscle CIS Coin Competition Workshop. 19–31. ¨ Jaakko Sauvola and Matti Pietikainen. 2000. Adaptive Document Image Binarization. Pattern Recognition 33 (2000), 225–236. Silvio Savarese, John Winn, and Antonio Criminisi. 2006. Discriminative Object Class Models of Appearance and Shape by Correlatons. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vol. 2. 2033–2040. Milan Sonka, Vaclav Hlavac, and Roger Boyle. 2008. Image Processing, Analysis, and Machine Vision, 3rd Edition. Cengage Learning. Ching Y. Suen, Marc Berthod, and Shunji Mori. 1980. Automatic recognition of handprinted characters - The state of the art. In Proceedings of the IEEE, Vol. 68. 469–487. Gustav Tauschek. 1935. Reading Machine, US Patent 2,026,329. (31 December 1935). Engin Tola, Vincent Lepetit, and Pascal Fua. 2010. Daisy: An efficient dense descriptor applied to wide baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 5 (2010), 815–830. Laurens J. van der Maaten and P. Poon. 2006. Coin-o-matic: A fast system for reliable coin classification. In Proceedings of the Muscle CIS Coin Competition Workshop. 7–18. Alessandro Vinciarelli. 2002. A Survey on Off-Line Cursive Word Recognition. Pattern Recognition 35, 7 (2002), 1433–1446. Paul Viola and Michael Jones. 2002. Robust real-time object detection. IJCV 2 (2002), 137–154. Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-End Scene Text Recognition. In ICCV. 1457–1464. Kai Wang and Serge Belongie. 2010. Word Spotting in the Wild. In European Conference on Computer Vision. 591–604. Maia Zaharieva, Martin Kampel, and Sebastian Zambanini. 2007. Image Based Recognition of Ancient Coins. In Computer Analysis of Images and Patterns. 547–554. Yong Zhang and Kai-Bin Wei. 2010. Research on wide baseline stereo matching based on PCA-SIFT. In 3rd International Conference on Advanced Computer Theory and Engineering, Vol. 5. 137–140. Chao Zhu, Charles-Edmond Bichot, and Liming Chen. 2011. Visual object recognition using DAISY descriptor. In International Conference on Multimedia and Expo. 1–6. Received May 2013; revised July 2013; accepted October 2013

ACM Journal on Computing and Cultural Heritage, Vol. 1, No. 4, Article 2, Publication date: June 2013.