Generalised Object Recognition

Generalised Object Recognition Rory C Flemmer, Huub H C Bakker School of Engineering and Advanced Technology Massey University Palmerston North, New Z...
Author: Harvey Gaines
1 downloads 1 Views 792KB Size
Generalised Object Recognition Rory C Flemmer, Huub H C Bakker School of Engineering and Advanced Technology Massey University Palmerston North, New Zealand [email protected]

Abstract—Object recognition from machine vision is a complex task that has, to date, no formal method of solution. The use of brightness contours instead of edges and the corresponding contour profile diagram, or fingerprint, can provide mathematically non-intensive comparisons that can be efficiently performed in a database. More than that, this method provides a formalism which claims to be appropriate for all machine vision. This, the first of two papers, outlines the general problem, the current solution and provides preliminary results. The second paper[1] deals in greater depth with data mining a library of known objects for matches and the final confirmation or rejection of the object identity. Keywords-machine vision; object recognition; image contours; fingerprints

I.

INTRODUCTION In 1983 Rosenfeld[2] summarised the field of image analysis over the previous 25 years and provided a synopsis of the current paradigm. He viewed the first step in image analysis as segmentation into physically significant areas, resulting from edge-finding and pixel analysis. Thereafter postprocessing, perhaps in the hands of a later generation, would construct object geometries and recognise objects. Since then, a voluminous and clever literature has developed on edgefinding, segmentation and object recognition but the progress on Rosenfeld’s scheme has been slight and, another 25 years on (despite more than a two thousand-fold increase in computer power computed from Moore's Law), the image analysis community would not claim to be successful in the central ambition of image analysis. This is the identification of the general objects present in a general image and the measurement of their orientations and positions. It has been known for a long time[3] that the eye’s saccadic movements are conducted on the basis of visual perceptual hypotheses. It is also known that much of the retina must be out of focus so, obviously, the eye cannot operate on a complete, sharp image – such as we have in artificial vision. It is also clear, loosely speaking, that the brain asks the eye/visual cortex to look at details of the visual field; it doesn’t wait for a report, including segmentation and object recognition. This task is just too hard – in fact it is impossible because the retina/ visual cortex cannot recognise objects which might be known elsewhere in the cortex. Coupled with the possibility that we are asking the wrong question of artificial vision goes the notion that it is not

generally possible to find complete and unambiguous edges for objects in real scenes.[4,5] Yet the eye/brain manages very well. Perhaps if we followed the strategy resulting from half a billion years of biological refinement (since trilobites) we might have better success. Broadly, this requires that we produce a guess at what object might be in the picture and then do tests that confirm or deny the hypothesis. Once we know what we are looking at, the problem is orders of magnitude easier. Further, the hardware operation in looking should be active and intelligent, as is the human eye/cortex. We offer a method whereby this can be accomplished. Specifically, we can recognise and report the position and orientation of any known rigid (fixed shape) object in an image. This can be done acceptably quickly on a PC, based on an object library comprising hundreds of thousands of objects stored on the PC. Preliminary work on these ideas was reported at ICARA 2005(ref). II. PIXEL RESOLUTION Standard CCD cameras offer a retina with about 640 x 480 pixels, though of course mega-pixel cameras are available. It seems appropriate to set our benchmark relative to the human eye. A performance at 20/20 on the Snellen eye-test implies that an object (the letter E, say) can be distinguished when it subtends five minutes of arc. In order to recognise it, it seems that a pixel matrix of about 10 x 10 pixels, covering the letter would be required. This implies that an equivalent pixel would subtend about 30 seconds of arc. (Polyak[7] estimates 24 seconds in the fovea) In fact, it is probably less because most retinal neurons function within an interconnected net. The human field of gaze subtends at least 90 degrees. This implies a “camera” of 100 million pixels – or about 380 standard cameras. The human eye achieves this performance by using saccadic movements and moving a “telescope” successively over small areas of the image and integrating the findings. The process of scanning in detail the whole area of the gaze takes up to minutes. We are not bound to follow the same procedure because our camera pixels do not become saturated and do not need to be refreshed - as biological pixels do. However, it seems necessary to provide a significant level of zoom for our lens. In this way, we can gaze at the big picture and zoom in to an appropriate resolution in order to recognise objects. Our experience indicates that it is convenient to learn objects at a

resolution which provides about 100 000 pixels to describe the object. This resolution permits the observation of considerable internal detail. Unlike the eye, our artificial optical system can zoom to any reasonable level we wish and can in fact have considerably more acute vision because we are not limited by the diameter of the eye. There is therefore no need to be parsimonious with pixel resolution. III. CONTOURS For the initial application, consider a 640x480 monochrome image. Imagine the 640 and 480 pixels to lie along the X and Y axes respectively and that these are horizontal. Consider the Z axis to be vertical and to represent image grey level at each X,Y. Consider also the surface so formed to be smooth. Then the image can be viewed as a topography. Define the contour of a specified grey level in exact analogy with cartographic contours. The co-ordinates of points on the contour will not generally be integral in X and Y. It is a relatively simple matter to construct a set of non-integer data duples (X,Y) for each contour such that the distance along the contour between duples is exactly one pixel. For a particular image, construct such contours at intervals of 10 grey levels from 10 to 250 (grey levels in standard-format cameras range from 0 to 255). Figure 1 shows an image and figure 2 shows the same image, greyed out and with contours superimposed.

Figure 1.

Original image of the mug.

IV. FINGERPRINTS Represent the contours as the second derivative of the line. That is, plot the amount by which each contour rotates per pixel distance along its trajectory versus distance along the contour. Such a plot is shown in figure 3 and this curve we term a fingerprint. This plot represents that specific contour in figure 2 which starts at the lower left vertical edge of the silhouette and proceeds to outline the mug in a clockwise direction. Because this particular contour was closed, some of the first points were wrapped around in order to provide 125% of a cycle. This was done so that, if the contour started partway through a feature, that whole feature would be accessible at the end of the contour.

Figure 3.

Fingerprint of a contour on the outer edge of the mug.

Examination of Figure 3 shows that, as the contour progresses up the left edge of the mug, the contour is straight until it reaches the top left corner. The corresponding initial portion of the fingerprint remains very close to the axis, reflecting the fact that there is very little change in orientation from pixel to pixel along this section of the contour. When the contour reaches the top of the mug, there is a positive spike in the fingerprint, representing the change of direction as the contour moves from the vertical side to the top horizontal surface. The subsequent form of the fingerprint in figure 3 can, with some small consideration, be mapped back to the image. It may further be noted that the fingerprint of figure 3 is completely specific and contains the detailed information of the contour in a form which is independent of translation or rotation of the image contour; rotation of the contour merely translates the fingerprint horizontally. Further consideration shows that the area of a lobe (defined between approaches to the X-axis) represents the amount of rotation of the contour as it traverses the lobe. In the case of the first lobe of the fingerprint, the area would be about π/2 radians. The second and third lobes would have similar areas, although the third lobe would be negative. The fourth lobe could be expected to have an area of approximately π radians, corresponding to the rotation all the way around the mug handle. This representation of the contour data has the further delightful property that, as the mug recedes from the camera and becomes smaller in the image, the lobe areas do not change. This should be verified by contemplation.

Figure 2.

Image with contours superimposed.

V. FEATURES In order to codify the information in all the contours of an image (there may be hundreds of them), it is convenient to define features of the fingerprint. We define three significant types of feature. The first is the lobe, defined between approaches to zero. The first spike in figure 3 is a lobe. The second type of feature is a line and this is defined as an

extended portion of the fingerprint which does not depart significantly from zero. The third type of feature is an arc. This is a portion of the fingerprint which remains approximately equidistant from the X axis, representing a constant radius of curvature of the contour. A particular line feature can be defined most conveniently, with reference to the contour on the image. From this we can compute the angle of the line and the co-ordinates of some point through which it passes. Clearly these data are not invariant as to rotation and translation but this limitation is handled at a later time. It is desirable to classify lobes in a manner which is conducive to later homeomorphic manipulation. Each feature can be placed at an (X, Y) on the image by choosing the coordinates of the point on the contour which is, in some way, midway along the feature. The angle of each feature can be recorded as the corresponding angle of the contour at these coordinates. Each lobe has an associated area, A. In addition we define skewness, S, and kurtosis, K. These are defined in the Appendix to have senses analogous to those used in statistics. Skewness measures whether the lobe is symmetrical about its centre and quantifies the extent to which it is skew, left or right of centre. Kurtosis measures to what extent the lobe bulges in the centre, rather than at the edges, like a saddle. A completely flat lobe (i.e. a circular arc) would have K=1. A symmetrical lobe would have S=0. Compound features can be derived from the relationships between these basic features. For instance, a sequence of features—line, π/2 lobe, line, π/2 lobe, line—that are contiguous would clearly define three sides of a rectangle. Analysing the image for a large number of these compound features, say 128, can lead to a simple test for the presence or absence of a large number of characteristic features in images with a concise representation, namely a 128 - bit vector. VI. REMOVING REDUNDANCIES For a particular image, there may be several hundred contours and of the order of thousands of features. This is the strength of the method; that it captures all this information. It is convenient to retain only those features which are multiply redundant, i.e. they are captured in contours at several grey levels. It is our experimental observation that all important features are, in fact, captured redundantly but, depending on lighting and shadowing, the feature may be defined over only a small, changing, range of grey levels. Given that each lobe is associated with co-ordinates in the image and with an orientation in the image, it is a simple matter to cull from the data-set those features which are not represented to some specified level of redundancy (say 5). It is also not very difficult to determine whether two lines are approximately colinear. Analysis of a general image typically produces a distillate of 50 – 100 features. VII. FEATURE WEIGHT Some features, defined by Area, Skewness and Kurtosis (ASK) are more rare than others – and therefore more significant. It would seem desirable to associate a weight with each feature so that we can assess the significance of its occurrence. Imagine that we accumulated the features of very

many images and then plotted the ASK co-ordinates of each of the features in 3-space. We could then find the centre of gravity of all the points and determine the distance of each feature from this COG. We assume, and have in fact confirmed, that the spatial distribution is not anomalous. If we divided the greatest measured distance into 20 bins, we could place a population in each bin. We could then normalise these populations and determine a probability that any feature might fall into a particular bin. Further, if we select, randomly, any N features from all our archival features, we would like the sum of their weights to be approximately constant, dependent only on the number, N. That is, we would like the average of total weight divided by the number of features to be approximately constant when random features are selected. For a particular image, of course, the features present are not random and, if they are significant, will have a higher than normal average weight. If the probability for the ith bin is Pi, then giving this bin a weight, W = 1/ Pi will accomplish this condition. The consequence of this manipulation is that, when we find features in an image, we can assess their value as identifiers of an object without undue regard to how many features we have found. VIII. OBJECT REPRESENTATION The aim of this work is to recognise any known object, even partially occluded, at any angle, at any size, in a given image. Recognition means that the object is identified with an object whose descriptors are stored in the computer. Because views of an object from different angles may be entirely unrelated, it is desirable to store data of each object “in the round”. Researchers often choose to study geometric shapes in order to ease this problem and to limit the information required to define the object. We consider general, probably irregular, objects. If we imagine an object to be placed at the centre of a regular icosahedron (20 sides), and imagine that the object’s image is sequentially captured by a camera aligned with the normal through each face, then we will have 20 images, each rotated 36 degrees relative to each of its neighbours. Imagine that each such image is processed to produce a set of features as defined above – perhaps 100 features each – and that these 20 sets are stored as the complete description of the object. Since each feature can be specified completely in 11 bytes, an object can be stored in about 20 kB. Bryson[8] estimates that an educated person’s vocabulary would include perhaps 50,000 words, of which about one third are nouns. To store 17,000 objects in the round would therefore take about 30 gigabytes of memory. This remarkably small number assumes that we have handled the problem of universals and adjectives [see below] and that we can recognise short mugs and tall mugs as the same object. We will discuss these concepts in due course. A further assumption, experimentally verified, is that when an object is presented at 18 degrees to a known view, we can still recognise it. Thus, when it lies between three known views we can interpolate to estimate its orientation. IX. DATA MINING Assume that our system has been taught several thousand objects, each specified by 20 views, each view comprising perhaps 100 features. (100 features is a high estimate.) These are what we term our gold or universal images. We are then presented with a new image, termed a brass image, and we

distil from this image something between 30 and 300 features, termed brass features. Our first task, in line with our aim to follow the example of the biosphere, is to form a guess as to which gold objects might be present in the brass image. Our second task is to establish, very quickly, whether a particular candidate is, in fact, present as we go down the list of possibilities. We can put all our objects into a database under the control of an engine such as MYSQL. We can then interrogate the engine to determine a list of candidates with appropriate ASK values and/or the presence of appropriate derived features. It will be necessary, as a full-blown system is developed, to determine the best window of tolerance for A, S and K as well as the most discriminating choice of derived features. Because the discriminatory process which follows this initial harvest is very good, the only reason to limit the size of the initial harvest is to limit processing time. This issue must be weighed when we determine how often the correct identification follows from the top contenders of the list. This first trawl through all the objects of the world is very, very quick and takes only a few milliseconds. This impressive performance follows from the advances in data-mining technology which are deployed in MYSQL – coupled with the economy of our object specification and the high speed of modern PC’s. Having specified a candidate object and a candidate view of that object, our next task is to say whether these features from the gold image are actually present among the features which we have distilled from the brass image. Not only must some proportion of the features be present, but they must be in the correct relation to each other.

of the fit. Once this probability relation is known, this test can be applied at any of the levels of the cascaded discrimination process and, if an acceptable level of probability is achieved, the process can be terminated early (either successfully or unsuccessfully). X. EXPERIMENTAL The scheme was deployed, coded in Microsoft C#, using a standard monochrome camera and frame grabber. One hundred images were captured under varying conditions of lighting, two of which were deemed to be a gold view of the two different objects of our attentions. (see Figures 4 and 5) Most of the images did not contain the gold objects. Some contained the objects at varying distances from the camera and occluded to varying degrees. Some were oriented at graded angles from the orientation of the gold object and in some the gold object was partially occluded.

Figure 4.

Gold images - mug and tape.

Figure 5.

Selected test images.

This can be achieved by pairing up matching brass and gold features before executing a series of geometry-based tests to eliminate mismatched pairs. If, at the end of the tests, we have sufficient remaining features pairs (or aggregate weight of the features), we can consider the object to be found. The first requirement is that they have appropriate angles relative to each another. The object as seen in the brass image will generally have some rotation, φ, relative to the gold representation and a dilatation, µ. Two separate methods can be used to obtain these from the feature pairs. The first uses clustering of the difference, δ, between feature orientations in the pairs1, and the ratio of goldimage distances between features to brass-image distances between features, µ. The second method involves enumerating the combinations of possible gold/brass feature pairs to find that combination which returns the largest number of valid pairs. This system of checks can be done quickly on those features which are present and, because they are so very specific, the method is robust, even when relatively few features are accessible. It is possible to construct a measure of the fit based on the sum of the weights of those features that have been positively identified. It is further possible, on the basis of very extensive experimental evidence, to construct a cumulative plot of probability that the identification is correct versus this measure 1

Two series of tests were carried out. In the first the parameters defining a successful match were tuned to be

That is to say we can find the mean δ over all the pairs and discard feature pairs with outlying δs.

conservative—no false positives were allowed. In the second this restriction was eased to allow, at most, one false positive. XI.

RESULTS

A. Object Recognition The results of the object recognition experiments are summarised in the two following tables.

TABLE I.

OBJECT RECOGNITION - CONSERVATIVE.

Gold object

of 20 icosahedral views based on very much less visual data. The eye accomplishes this by using perceived symmetries. A second large area of interest follows from a necessary consideration of the question of universals – we must recognise that a tall mug and a short mug are the same object, but with an adjective distinguishing them. We believe that a family of universals is a set of objects which can be transformed back to the exemplar by means of a simple geometrical transformation. XIII. CONCLUSIONS A process was described to use the second derivative of image contours, or ‘fingerprints,’ as the starting point for generalised image recognition. The process was tested on 100 images using two different objects, a mug and a measuring tape.

Mug

Mug (occluded)

Tape

Tape (occluded)

Images

11

23

30

11

Recognised

11

13

21

4

70% for tapes and 100% for mugs of unobstructed instances of the images were recognised with no false positives, even at differing angles of presentation.

7

A significant proportion of partially-occluded instances was also successfully recognised.

False positives False negatives

0 0

0 10

9

XIV. TABLE II.

OBJECT RECOGNITION - AGGRESSIVE.

Gold object Mug

Mug (occluded)

Tape

Tape (occluded)

Images

11

23

30

11

Recognised

11

15

24

5

False positives False negatives

1 0

6

Recalling that the X axis in figure 3 has units of pixels, let L be the length of the lobe in pixels. Let M be the mean of the greatest and the least pixel values of the lobe. Let C be the centre of gravity of the lobe (pixels). For a symmetric lobe, C=M. Then we can define skewness, S=(C-M)/L

1 8

APPENDIX

6

XII. DISCUSSION Results for conservative recognition tests show that all of the unobstructed mug instances were identified and two thirds of the occluded instances. Notably, one of the successful matches (shown bottom left in Figure 5) was of the mug split in two by a pen. The tape was harder to identify with only two thirds of the unobstructed instances found. The tape had fewer features at the end of the processing phase, which made the subsequent recognition phase less robust. Results for the aggressive recognition tests show a significant increase in recognition, particularly for the tape, but with the consequent appearance of false positives. These are significant results for the first deployment of the technique. In hindsight, it is clear that, without compound features the technique was much weakened. For instance, if we had used pairs of lines at a specified angle, then clearly the tape would have jumped out. Further, consider the profile of a cat’s head with the dome of the cranium and the two distinctive ears. This is instantly recognisable to the eye. If it were defined by a compound feature which captured this information, this feature would be blind to almost anything other than a cat’s head. As we develop the technique, we will explore compound features and we must also consider the acquisition of the data

(1)

This may be positive or negative and has no dimensions. For the computation of kurtosis, K, let the lobe have values Z(i) at each pixel value, and set Zmax equal the largest Z(i) in the lobe. The lobe starts at istart and ends at iend. Define z(i) = Z(i)/Zmax in order to normalise z. Compute a distance, W, such that W is the smaller of C - istart and iend – C. Essentially this allows the definition of the maximum interval of i, symmetrical about C such that it is contained within the pixel values of the contour. Set n = 2W+1. Then compute a dimensionless second moment of area about the centroid, summed over all i from C – W to C + W as; M1 = ∑ z (i) ((C-i)/n)2 Let zave be the average value of z(i) for the interval and compute M2 = ∑ zave ((C-i)/n)2 Then K = M1 / M2 The significance of this ratio, aside from nondimensionalising kurtosis, is that, if the z(i) were constant, K would have a value of unity. If the lobe were highest in the centre, then K would be less than one.

REFERENCES [1] [2] [3] [4]

H.H.C. Bakker & R.C. Flemmer, "Data Mining for Generalised Object Recognition", submitted to ICARA 2009. A. Rosenfeld, Readings in Computer Vision, Morgan Kaufmann, Los Altos (1983). R.L. Gregory, Eye and Brain: The Psychology of Seeing, Wiedenfeld and Nicolson, London (1990). T. Sanocki, K.W. Bowyer, M.D. Heath & S. Sarkar, "Are edges sufficient for object recognition?", Journal of Experimental PsychologyHuman Perception and Performance 24, pp 340-349 (1998).

[5] [6] [7] [8]

J.H. Elder, "Are edges incomplete?", International Journal of Computer Vision 34, pp 97-122 (1999). R.C. Flemmer, H.H.C. Bakker, "Sensing Objects for Artificial Intelligence," ICARA 2005, pp 687-690, (November 2005). S.L. Polyak, The Retina C.U.P, Chicago, (1941) B. Bryson, The Mother Tongue: English and how it got that way, William Morrow and Co., New York (1990).