Multi-Class SVM Learning using Adaptive Code

Multi-Class SVM Learning using Adaptive Code FELIX MANDOUX Master’s Degree Project Stockholm, Sweden 2004 TRITA-NA-E04087 Numerisk analys och dat...
3 downloads 0 Views 626KB Size
Multi-Class SVM Learning using Adaptive Code

FELIX MANDOUX

Master’s Degree Project Stockholm, Sweden 2004

TRITA-NA-E04087

Numerisk analys och datalogi KTH 100 44 Stockholm

Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden

Multi-Class SVM Learning using Adaptive Code

FELIX MANDOUX

TRITA-NA-E04087

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering, Royal Institute of Technology year 2004 Supervisor at Nada was Barbara Caputo Examiner was Jan-Olof Eklundh

Abstract Classification of objects in computer vision is done mostly without any knowledge about multiple class memberships. In this master thesis several learning algorithms based on Support Vector Machines and similar approaches regarding multiple class memberships are explored. Recognition performance and robustness of the algorithms are tested with small quantities of training objects, making all learning difficult for standard and new learning algorithms. New algorithms are real multiple class membership algorithms based on a innovative class separation, but also some one-vs-rest SVM based algorithms. Most approaches improve slightly the performance compared to classical SVM, but no important changes can be found.

SVM-inlärning av multipla klasser med adaptiv kodning Examensarbete

Contents 1 Introduction 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3

2 Classification using Support Vector Machines 2.1 The Classification Problem . . . . . . . . . . . 2.2 Mapping Functions and Kernel Machines . . . . 2.3 Hyperplane Classifiers . . . . . . . . . . . . . . 2.4 Two-class Support Vector Machines . . . . . . . 2.4.1 Maximal Margin Classifier . . . . . . . . 2.4.2 Soft Margin Classifier . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

4 4 5 5 6 7 8 8

3 Extensions of SVM to Multi-class Problems 3.1 Extension of the Learning Process . . . . . . 3.1.1 One-vs-rest SVM . . . . . . . . . . . . 3.1.2 Adaptive Code Algorithm . . . . . . . 3.2 Extension of the Decision Process . . . . . . . 3.2.1 Multi-cue over two Levels . . . . . . . 3.2.2 Hierarchical SVM . . . . . . . . . . . . 3.2.3 DAS-Decision Tree or Splitting Classes 3.3 Conclusion . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

10 10 10 11 14 15 16 17 17

4 Experiment Setup 4.1 The Image Database . . . . . . . . . . . . . . . . . . 4.2 Database Splitting . . . . . . . . . . . . . . . . . . . 4.3 Features and Feature Selection . . . . . . . . . . . . 4.3.1 Color Histograms . . . . . . . . . . . . . . . . 4.3.2 Multidimensional Receptive Field Histograms 4.4 Computation and Kernel Selection . . . . . . . . . . 4.5 Algorithm related Parameters . . . . . . . . . . . . . 4.5.1 One-vs-rest SVM and Hierachical SVM . . . 4.5.2 Adaptive Code Algorithm . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

19 19 19 20 21 21 22 23 23 23

. . . . . . . .

. . . . . . . .

. . . . . . . .

4.6

4.5.3 Multi Cue and DAS-DT . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Results and Discussion 5.1 One-vs-rest SVM . . 5.2 Adaptive Code . . . 5.3 Hierachical SVM . . 5.4 Multi Cue SVMs . . 5.5 DAS-DT . . . . . . . 5.6 Conclusion . . . . . .

25 25

. . . . . .

26 26 27 27 28 29 30

6 Conclusion and Future Work 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33

References

34

A All A.1 A.2 A.3 A.4 A.5

37 37 38 39 39 41

. . . . . .

experimental Results One-vs-rest SVM . . . Adaptive Code . . . . Multi-cue SVM . . . . Hierachical SVM . . . DAS-DT . . . . . . . .

B Hardware and Software

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . .

47

Acknowledgements I am grateful to many people for help, active and passive support during the writing of my master thesis. Beginning with those who directly contributed to this work, I would like to thank Barbara Caputo, Michal Perdoch, Gyorgy Dorko, Mario Fritz and Marie-Elena Nilsback, the computer vision group of NADA at KTH and the members of CogVis for their help during the experiments and writing. Without the good introduction in computer vision I got in my internship at Trixell in Moirans (France) and at INSA in Lyon (France) I may perhaps never got interested in this field, so I would like to thank all professors, friends and colleagues for their support. Finally I want to thank especially my parents Sigrid and Gérard Mandoux and my grand parents for their financial and personal support during all my studies and especially during the academic exchange.

Chapter 1

Introduction Computer aided decision processes have been gaining importance in recent years. Today industrial processes are computer guided, identification processes in airports are computer driven and vacuum cleaners can find their own way in apartments. These are only few technological examples where computers have to make decisions about data collected in their environment. Humans performance in visual perception shows how much information visual data provides. It is therefore seducing to use visual cues for machine conducted recognition processes. The richness of visual data often provides enough information for humans to recognize objects, but it makes it also difficult to find out the essential information necessary for recognition. One challenge in computer vision is to extract the necessary information out of the visual data and make decisions on its basis. Recognition is one of the most wanted tasks of today’s computer vision and robotics research. The human brain uses all kind of knowledge to recognize objects: knowledge about how likely they can be found in a certain environment, knowledge about different view angles, occlusion and other large variations [Ullman, 1996], knowledge about class memberships and knowledge about non classical usage of objects are only some examples of influences. Classical machine learning algorithms like Support Vector Machines [Cristianini, 2000, Vapnik, 1998] do not include knowledge about connected object classes. Multi-class problems can be learned by kernel machines [Allwein et al., 2000], only visual data is used to solve this task. The aim of this master thesis is to find out if and how knowledge about multiple class memberships of objects can help to recognize them. The term multiple class memberships designes the membership of an object to more than one object classes.

1.1

Related Work

From the very beginning in research about artificial intelligence, learning was considered an essential component of computer vision systems. The founder of computer science, Alan Turing, proposed the idea of learning machines in 1950 in his book [Turing, 1950]. Early probabilistic learning models were introduced with 1

Rosenblatt’s perceptron [Rosenblatt, 1959] in 1959, a linear learning machine and Support Vector Machines by Vapnik in [Vapnik et al., 1992] and more detailed in [Vapnik, 1998]. In the last years researchers have also developed new theories about multiple visual cues to improve the robustness of SVM recognition systems [Caputo et al., 2002, Caputo et al., 2003, Brautigam et al., 1998]. Multiple class problems were discussed recently by Schölkopf in [Schölkopf, 2000]. Rätsch proposed a new recursive, self optimizing algorithm in [Rätsch et al., 2002]. Today’s panorama of Support Vector Machines is well summarized by N. Cristianini [Cristianini, 2000] and more completely by A. Schölkopf, J. Smola in [Schölkopf and Smola, 2002].

1.2

Contribution of this Thesis

Objects can be members of many classes simultaneously: an apple is member of the classes “apples”, “fruits” and many others. This concept, using multiple class memberships 1 , has never been investigated in object categorization so far. This thesis evaluates some techniques, extending kernel machine based learning algorithms, for multiple class membership usage. The contributions of this thesis are: - Empirical evaluation of adaptive code for object categorization. The Adaptive Code algorithm was presented by Rätsch in [Rätsch et al., 2002] combined with an Embedding Optimization step. This recursive algorithm seems to be powerful but slow. Here only the (faster) Adaptive Code step is empirically evaluated by introducing different types of prior knowledge in the algorithm. This prior knowledge can also be based on multiple class memberships; their performance is evaluated here. - Empirical evaluation of alternatives for multiple class membership categorization. Several alterative techniques issued from other learning problems are adapted and tested in this new context. The techniques are described by Schölkopf in [Schölkopf, 2000]. A simple multiple cue algorithm and the more complex DAS-DT algorithm are adapted to the new context and a hierachical Support Vector Machine algorithm developed and evaluated. Evaluation of the performance of multiple class membership based algorithms is important because this domain of learning is completely unexplored. This thesis tries to understand multiple class memberships and to use them to improve recognition. Finally it tries to answer the question if the usage of multiple class memberships in this way is really useful. 1

Multiple class membership and multi-class recognition sound similarly but are two completely different terms. The first one describes the usage of multiple class memberships of one object and the second one the ability of the algorithm to recognize more than one class.

2

1.3

Outline

This report is composed of six chapters. After this introductory chapter the reader will learn basics about machine learning techniques and Support Vector Machines (Chapter 2). More specific details about the extension of single class problems to multi class problems and the Adapting Code is written in Chapter 3. Then the experiments are explained in detail in Chapter 4, their results presented and explained in Chapter 5. Interpretation and ideas for future work can be found in Chapter 6. Each chapter begins with a short introduction and an outline helping the interested reader to find information fast and without having to read everything in detail. A summary at the end of each chapter underlines the important points of the previous chapter and connects them to the next one. Many Figures and Tables illustrate complicated processes and ideas. Normally the legend provides enough information for the competent reader to understand them.

3

Chapter 2

Classification using Support Vector Machines This chapter gives basic knowledge necessary to understand the main ideas of this thesis. Section 2.1 presents the ideas of classification, computer-based decisions and machine-learning. Then a short introduction to kernel machines, providing also some mathematical basics, is given in Section 2.2. Together with the hyperplane classifiers presented in Section 2.3 the extension to support-vector machines is done in Section 2.4. Readers with background in machine-learning and SVM may skip this chapter. 1 In this project SVM classification and recognition algorithms are used because of their good generalization performance (see Section 2.2). Compared to other approaches SVM are quite young but very well known and well documented.

2.1

The Classification Problem

The usage of computers has developed much in last years. Today not only simple execution tasks are requested but also decisions on incoming data. These tasks can be very complex, like biometric identification tasks via humans’ iris or fingerprint. A human identification problem cannot be solved by classical hard-structured computer programs, because the developer cannot precisely specify the way how to get a correct identification from a given input. A powerful approach to this kind of problems is through learning (i.e. [Turing, 1950]): a learning-algorithm constructs a decision function based on a part of the training data and some hypothesis. By applying the decision function on some new data, the computer can classify it as known or unknown. One task in computer vision is to identify objects present as image-data. A learning algorithm has to respond on the question “What is this?”. The user expects 1

This chapter is a very general introduction to classification problems and support vector machines. Many aspects of support vector machines like the dual representation of problems are not mentioned here but fundamental to optimize support vector machines. Very good articles [Schölkopf, 2000] and books [Cristianini, 2000, Schölkopf and Smola, 2002] provide more detailed and complete knowledge about kernel machines.

4

an answer like “a cup, a car, a horse”. All of these possible answers refer to a group of objects here called classes. The decision process of putting an object on an input-image into a class is called classification. The difference between identification and classification can be explained easily with a little example: When the user is looking for “his apple” he needs identification, otherwise when he searches “an apple” he needs classification.

2.2

Mapping Functions and Kernel Machines

Kernel machines can be used for classification of every kind of data. On some given empirical data (x1 , y1 ), · · · , (xm , ym ) ∈ χ × {±1} a generalization to unseen data has to be done. The patterns xi are taken from the domain χ, a non-empty set of data. The targets or labels yi = {±1} define the membership of the pattern. When x ∈ χ represents a test image containing an apple, then the goal is to predict the corresponding label y ∈ {±1} such that (x, y) is in some sense similar to the training samples. Making decisions in the original data space is very difficult; in most cases no trivial separation between data of two different targets can be found. To solve this problem the data is mapped with a non-linear mapping function Φ to another high dimensional feature space F. Φ : χ → F, x 7→ Φ(x). In a good feature space a separation function should be linear and general enough to provide acceptable results also for unseen data. To guarantee a good generalization, some feature selection has to be done (i.e. [Cristianini, 2000] Ch.4). Non relevant features with low variance can be excluded from the feature space, other weighted to gain importance. Using too many features can cause an overfitting decision function, focused too much on the training data and unable to generalize to new data. The inverse problem – underfitting – occurs when too less features are selected (see for the moment [Bartlett et al., 1999]). Data mapped in the feature space can now be compared by using a similarity measure. In the simplest case, similarity can be computed by a canonical dotproduct (Φ(x) · Φ(x0 )). An interesting point regarding kernel machines is that the dot product in the feature space can be selected for every problem. The function defining the dot-product or similarity measure is called kernel K(~x, x~0 ): K(~x, x~0 ) = hΦ(~x), Φ(x~0 )i

2.3

Hyperplane Classifiers

Hyperplane classifiers separate sets of data with different labels in a dot-product space like the feature space F introduced in the last Section. The primary idea 5

of hyperplane classifier algorithms is to find the optimal hyperplane in order to separate two differently labeled sets. Considering ~x ∈ RN a point in a hyperspace of dimension N so a class of hyperplanes can be defined as in [Vapnik et al., 1963, Vapnik and Chervonenkis, 1964]: (w ~ · ~x) + b = 0, where w ~ ∈ RN represents the weight vector, normal to the hyperplane and b ∈ R the normal (i. e. shortest) distance of the hyperplane from the origin. A hyperplane of this set is called optimal when it provides the maximum margin of separation between two classes, there is one unique solution for a given problem. The maximization of the margin can be written in the following way: ½ ¾ N max min{k~x − ~xi k : ~x ∈ R , (w ~ · ~x) + b = 0, i = 1, · · · , m} . | {z } w,b ~ margin

The optimal hyperplane can be computed by solving the following constrained optimization problem: ~ 2 minimize τ (w) ~ = 21 kwk subject to yi · ((w ~ · ~xi ) + b) ≥ 1,

i = 1, · · · , m.

To estimate the origin class of an object, the decision function ³ ´ f (~x) = sgn (w ~ · ~x) + b can be used. The result is a label {±1} giving the classification target of the input data. Theoretical arguments support the good generalization performance of the optimal hyperplane (i.e. [Bartlett et al., 1999]). In addition it is attractive from a computational point of view because standard quadratic program solvers can be used.

2.4

Two-class Support Vector Machines

The combination of kernel-based feature space and hyperplane-classifier builds the so-called Support Vector Machines [Vapnik et al., 1992]. Basically they can be used to solve two-class learning problems. Some improvements compared to linear classifiers2 are robustness in cause of generalization and a relatively low computational cost even for large learning problems [Bartlett et al., 1999]. All SVM algorithms present some common ideas and structures. The principle tasks are the selection of a suitable kernel to represent the data in the feature space and the computation of separating hyperplanes in order to minimize the empirical risk of miss-classification. 2

The mapping function of linear classifiers has to be a linear transformation. In order to separe object classes the data has to be linearly separable in image space. This is not the case in most real world applications.

6

w ~ · ~x + b = −1 w ~ · ~x + b = 0 w ~ · ~x + b = +1

y2 = +1 y1 = −1 w ~ b

Figure 2.1. Illustration of the simple classification problem circles vs squares. The

circles labelled as y2 = +1 and the squares as y1 = −1 and separated by the hyperplane given by (w ~ · ~x) + b = 0. The dashed lines show the margin.

2.4.1

Maximal Margin Classifier

The maximal margin classifier was the first SVM-classifier introduced. It is not usable for most real-world learning applications, because data can be processed only in linearly separable feature space. The ideas of the classifier are simple and form the basis for powerful extensions like soft margin classifiers developed in the next section. The idea of the maximal margin classifier is to separate two classes in feature space by computing the maximal margin hyperplane. This is done by minimizing a quadratic function under linear inequality constraints. When the functional margin is set to 1, the canonical hyperplanes can be expressed in this way: hw ~ · ~x+ i + b = +1, hw ~ · ~x− i + b = −1. Given a linearly separable training sample S = ((~x1 , y1 ), . . . , (~xi , yi )), the hyperplane (w, ~ b) that solves the optimization problem minimize

hw ~ · wi ~ ³ ´ subject to yi hw ~ · ~xi i + b > 1, w,b ~

i = 1, . . . , l,

7

represents the maximal margin hyperplane with the functional margin 1 and the geometrical margin γ = 1/kwk. ~ Data samples ~xi lying on the margin are called support vectors. The target function can be entirely described by them. A geometrical example of a maximal margin classifier is given in Figure 2.2(a).

2.4.2

Soft Margin Classifier

The assumptions made about the training set in the last section do not allow to use maximal margin classifiers for real-world applications. A linearly separable training set has to be noiseless. In other words, no training error has to be done when maximal margin classifiers are used. Soft margin classifiers are based on the same idea as maximal margin classifiers but their robustness is much higher. Errors in the training are allowed and compensated, overfitting partly avoided. The margin can be violated for individual data-samples, so the constraints of the optimization problem introduced in the last section are extended with the slack variables ξ [Cortes et al., 1995, Vapnik, 1995, Schölkopf and Smola, 2000]: P minimise hw ~ · wi ~ + C li=1 ξi2 , ξ,w,b ~ ³ ´ (2.1) subject to yi hw ~ · ~xi i + b > 1 − ξi , i = 1, . . . , l,

The parameter C has to be defined for each problem and is generally varied through a wide range of values. For a better performance, C is often chosen on a tiny validation set of data and then used on the test set. The slack variables ξ i are individual for each sample xi , so individual errors on the training due to noise are also formulated by this model. This so-called C-SVM algorithm is used in many real-world applications. Extensions and details about the weighting parameter C are given in [Cristianini, 2000]. For a better understanding the Figure 2.2(b) shows an example of soft margin classification.

2.5

Conclusion

This short introduction on support vector machines gives the necessary knowledge to understand the chapter about extension to multi-class problems. Many terms related to SVM and computer learning such as feature selection, kernel, overfitting or training set defined in this chapter will be used without explanations in the rest of this thesis. The main ideas to remember of this chapter are that kernel machines are a powerful utility to solve computer learning problems independent of the complexity of the data; and kernel machines in this configuration are only able to solve binary (two-class) problems.

8

xi

ξi ξk

xj

ξj xk

(a) Maximal margin hyperplane classifier.

(b) Soft margin hyperplane.

Figure 2.2. Support vectors are filled in grey. ξi,j,k denote the slack-variables of three classification vectors xi,j,k

9

Chapter 3

Extensions of SVM to Multi-class Problems The first chapter introduced kernel machines and binary SVM classifiers. In order to use SVMs for real-world classification tasks they have to be extended to multipleclass problems. This extension can be done at different levels: - during the learning process [Rätsch et al., 2002] (feature space and hyperplane computation). - during the decision process [Allwein et al., 2000] (based on classical results). Both ideas are described in this chapter and tested/evaluated in the Chapters 4 and 5. This chapter provides all necessary information to understand the experimental results. In Section 3.1 the learning process is modified for multi-class usage and in Section 3.1.2 the decision process. To reproduce the experiments, the Chapter 4 – “Experiment setup” – provides some complementary information about the used implementations and data.

3.1

Extension of the Learning Process

The learning process of SVM consists mainly of finding a separating hyperplane in the feature space. Two possible extensions to multi-class problems are given in the following subsections.

3.1.1

One-vs-rest SVM

A natural way to solve a N -class learning problem is to split it into N binary learning problems. Binary problems can be solved with classical SVM [Allwein et al., 2000]. The basic idea is to formulate the problem differently: instead of learning “class 1 against class 2 against class 3 ...”, the problem can be written “class 1 against the rest, class 2 against the rest, ...”. Finally N binary learning problems “class n against the rest” are equivalent. The reduction to binary problems can be interpreted 10

geometrically as searching N separating hyperplanes. Figure 3.1 shows a simple representation of this idea. Other reductions to binary problems are possible, classes

¤ ¤ ¤ ¤ ¤¤ ¤ ¤ ¤

4 vs. the rest ¤ vs. the rest 4 44 4 4 4 44 4 4 4

♦ ♦ ♦ ♦ ♦ ♦ ♦♦ ♦

♦ vs. the rest Figure 3.1. A three-class problem reduced to three binary learning problems. The three separating hyperplanes are computed separately. It is not really clear how to classify an object which is placed in the middle of the central triangle in the featurespace.

can for example be learned by pairs. Allwein et al. [Allwein et al., 2000] gives a nice overview about ideas of multi-class reduction to binary problems. Classifying an object in a multi-class problem is not as easy as in a true binary problem. The decision function has to be much more complex, a simple distance measure is not possible. To solve this problem the Hamming-distance or a loss-based distance, taking in consideration all hyperplanes, can be used as decision function (cf. [Allwein et al., 2000]). The “one-vs-rest” multi-class SVM has been used with success in [Caputo et al., 2003] and many other applications. It has been the reference algorithm to evaluate new approaches by experiments for this project.

3.1.2

Adaptive Code Algorithm

The previous hyperplane separation conducts to some ambiguous cases when an object mapped in feature space is placed between the hyperplanes. A more natural way than in Figure 3.1, to separate three classes in feature space, is given in Figure 3.2. To realise this new separation, the hyperplane first has to take the multiple-class problem into consideration [Rätsch et al., 2002]. Coding the multi-class to a multi-binary Problem Multi-class problems can be solved by decomposing a polychotomy into N dichotomies and then be solved with a two-class classifier. More details can be found in 11

¤ ¤ ¤ ¤ ¤¤ ¤ ¤ ¤

4 44 4 4 4 44 4 4 4

♦ ♦ ♦ ♦ ♦ ♦ ♦♦ ♦

Figure 3.2. Cross-hyperplane separation of three classes in feature space.

[Allwein et al., 2000]. The principle idea is to assign a binary code word of length N , denoted here t(c) ∈ {0, 1}N , t(c) ∈ {−1, 1}N or even t(c) ∈ {−1, 0, 1}N for each class c. The result is the code matrix T :     t(1) t11 t12 . . . t1N  t(2)   t21 t22 . . . t2N      T = . = . .. ..   ..   .. . .  t(N )

|

tN 1 tN 2 . . . tN N {z N classes

}

Now each column defines a separation of the classes in two subsets {−1, 1}, 0 valued elements are simply ignored. Each column is fed into a separate classifier for learning and recognition. The result is another codeword tL which can be compared with the existing N code words by using Hamming or other distance measures. Large Margin Classifier For dichotomies, a soft margin classifier similar to the one described in Section 2.4.2 can be defined. It can be understood as mapping f : χ → R, with the property yn f (~xn ) ≥ ρ with (~xn , yn ) ∈ {training set} and ρ some positive constant [Vapnik and Chervonenkis, 1964] giving the margin. The function f is also called the embedding. To avoid overfitting some slack variables ξi are also introduced [Cortes et al., 1995, Vapnik, 1995, Schölkopf and Smola, 2000]. maximize

M λ X ξm + Ω{f } (the margin) M m=1

subject to

ym f (~xm ) ≥ 1 − ξm 12

ξm ≥ 0, n = 1, . . . , M. The term Ω{f } is a regularization term and λ a regularization constant. M gives the number of samples in the training set. For ym = 1 and ym = −1 the condition ym f (~xm ) ≥ 1 − ξm can be rewritten as a difference between the distance of f (~x) to the targets 1 and −1: 1 1 (f (~xm ) − (−1))2 − (f (~xm ) − 1)2 ≥ 1 − ξm . 4 4 This margin maximization alone is nothing new, but the notation can be extended to polychotomial problems. Integrating the code word t(c) ∈ RN for the classes c = 1, . . . , C as the target vector and introducing a distance measure d, the margin ρ(f, ~x, y) can be defined as ρ(f, ~x, y) := min d(t(c), f (~x)) − d(t(y), f (~x)). c6=y

This is in fact the minimal relative difference in distance between f , the correct target t(y) and any other target t(c) (i.e. [Crammer et al., 1963]). The new optimization problem can now be written in the following way: λ PM minimize M m=1 ξm + Ω{f } subject to d(t(c), f (~xm )) − d(t(ym ), f (~xm )) ≥ 1 − ξm c 6= ym , ξm ≥ 0.

(3.1)

This optimization problem is a multi-class classifier using the distance measure function d and large soft margins. In [Rätsch et al., 2002] it is shown that only d(t, f ) = kt − f k22 and related function as distance measure will lead to a convex constraint on f . The constrains of the optimization problem (3.1) can now be expressed in the following way: d(t(c), f (~x)) − d(t(y), f (~x)) = kt(c)k2 − kt(y)k2 + 2t(c)T f (~x) − 2t(y)T f (~x). By fixing the code t or fixing the embedding f these constraints get linear. In this special case standard optimization packages can be used to compute the minimization problem. Learning Code and Embedding The aim of this algorithm is to find a good code and a suitable embedding function. Fixing the code or the embedding results in solving a linear constrained problem, but varying t and f makes it computationally impossible to find the global minimum. The idea in [Rätsch et al., 2002] is to optimize the embedding f for fixed code t, then optimize the code t for fixed embedding f and doing this recursively until 13

P 2 converging to a local minimum. Using C c=1 kt(c)k2 as regularization term, Equation 3.1 becomes PC λ PM 2 minimize M m=1 ξm + c=1 kt(c)k2 t(c),ξm ≥0

subject to (t(ym ) − t(c))T f (~xm ) ≥ 1 − ξm ∀m = 1, . . . , M ; c = 1, . . . , C; and c 6= ym ,

(3.2)

a convex quadratic program. The choice of the initial code t is important. In this thesis many experiments have been done on the starting point. More details about the initialization is given in the last chapter. Fixing the code (i.e. [Bennett et al., 2000, Nash et al., 1996]), the optimization problem (3.1) becomes minimize

M α∈RJ + ,ξm ∈R

subject to

λ M

PM

m=1 ξm

+

PJ

j=1 αj

(t(c) − t(ym ))T fα (~xm ) ≥ 1 − ξm ∀m = 1, . . . , M ; c = 1, . . . , C; and c 6= ym ,

(3.3)

© ª P where fα = Jj=1 αj hj and H := hj : χ → RC |j = 1, . . . , J a class of basic functions. More details about the minimization problem (3.3) and its implementation can be found in [Rätsch et al., 2002]. Summary and Limitations The algorithm proposed by Rätsch, Smola and Mika in [Rätsch et al., 2002] given by the quadratic minimization problems (3.2) and (3.3), can be used with standard optimization software. It is unfortunately impossible to predict after how many iterations of optimizing code and embedding, t and f converge. Therefore this thesis proposes to limit the algorithm to a one-step code computation. Limiting to adaptive Code Fixing the embedding, only the minimization problem given by (3.2) has to be computed. The code will converge to a local minimum, so that the initialization of the code gets very important. Many experiments in this thesis have been done to see if and how the initial code influences the final code. In other words: how does the starting point influences to local minimum found?

3.2

Extension of the Decision Process

Combining several one-vs-rest SVM presented in Section 3.1.1, may help to solve the multi-class decision process. The new idea in this thesis is to combine the cues of different levels of class definitions. 14

3.2.1

Multi-cue over two Levels

The classical one-vs-rest SVM learning uses one specific feature of the object image. This multi-cue approach consists in learning for two different features, testing on both separately and combining the resulting distance measures. Figure 3.3 shows a block diagram of the algorithm: SVM learning

Features A

SVM Distance Measure Learned Data

Code A

Φ

Features A

Training Images

Test Images

Φ max Ψ Features B

Ψ Features B

Learned Data

Code B

Figure 3.3. Block diagram of classical multi-cue integration using SVM.

Dual Representation and the Discriminative Accumulation Scheme The minimization problem given in (2.1) can be also written in the so-called dual way (i.e. [Cristianini, 2000]). K(~x, ~z) is the kernel function, α ~ are the Lagrangian multipliers and δij is the Kronecker signal, 1 for i = j and 0 otherwise. P P maximise W (~ α) = li=1 αi − 21 li,j=1 yi yj αi αj (K(~xi , ~xj ) + C1 δij ), Pl (3.4) subject to i=1 yi αi = 0, αi ≥ 0, i = 1, . . . , l. In this representation W (αc ) is the margin of the sample from the cst -class-vs-rest hyperplane. For every sample of the test image set C margins can be computed and grouped in vector form as W1 = ( w11 w12 . . . w1C ). Stopping the algorithm now, the maximum value of these margins corresponds to the best class. Given the margins of a second cue W2 = ( w21 w22 . . . w2C ), the margins of J cues can be accumulated by simply summing them up: D=

J X

a j Wj ,

j=1

where aj is a weight factor which permits to give some cues more importance than others. The maximal accumulated margin gives therefore the new best class: arg max D(c) , c = 1, . . . , C. c

15

Selected Class

This method was used with success in [Caputo et al., 2003] for SVM recognition using multiple feature types as cues. Accumulation over two Levels Objects often do not take part of only one category. Super categories are often present: dogs and cats are animals, so the super category is animals. New in this thesis is to combine cues coming from two different categories. In this little example the cues from “dogs” are combined with the cues obtained from “animals” and so on (Figure 3.4). This technique tries to take in account knowledge about superior dogs

apples

cats

cars

bananas

···

animals dogs

cars cats ···

one-vs.-rest SVM

one-vs.-rest SVM

cue-accumulation maximal margin best class

Figure 3.4. Block diagram of multi-cue integration with multiple levels.

classes. The accumulation of two distance matrices of the structure showed in Figure 3.5 is quite simple: The distance of “apple 1” to the hyperplane “apples” is added to the distance of “apple 1” to the hyperplane “fruits” and so on. For every sample a vector containing the cumulative distances between the sample, the hyperplanes and the superclass hyperplanes is created. On this vector the maximum margin defines the winning class.

3.2.2

Hierarchical SVM

Hierarchical SVM uses the possibility to represent the class memberships of objects in a tree (Figure 3.6). For every node of the tree one SVM-one-vs-rest problem is computed. Depending on the detail the user wishes, the computation can be stopped at a high or a low level of decomposition. The problem of this kind of hierarchical decision is that a misclassification on low level decomposition will persist in the high level decomposition. The error rates are therefore cumulative. 16

classes

pears cars dogs cups apples tomatos cows horses

hyperplanes

fruits

cars

animals

cups

classes

apple h pear i

fruit k

tomato j car k

car j

cow l dog m

animal i

horse n cup o

cup l

Figure 3.5. Structure of two vote matrices from different levels.

world

animals

cows

dogs

cars

horses

fruits

apples

pears

cups

tomatos

Figure 3.6. Hierarchical tree of the class memberships.

3.2.3

DAS-Decision Tree or Splitting Classes

The discriminative accumulation scheme (DAS) can be used in an extended decision tree [Caputo et al., 2003]. For each splitting, the weight factor a i for two cues is optimized, the error rates for one-vs-rest with each class computed and the best class selected. In the next step this procedure is repeated with one class less, and so on. Each cue combination results in a different decision tree (see for example Figure 3.7). This method was explored in [Caputo et al., 2003] and gives much higher recognition rates than simple cue integration. Here two different learning problems corresponding each to another class splitting are used for the cue integration, similar to the DAS method described above.

3.3

Conclusion

In this chapter several extensions of SVM to multi class problems were discussed. Globally two different developments can be distinguished: multi class hyperplane 17

a1

apples

a2

cars

a3

horses

pears

...

Figure 3.7. DAS decision tree with 4 classes.

computation and cue integration on the learning results. Some approaches promise good recognition performance but the computational cost is simply too high. The more computationally light weight algorithms may leak in recognition performance and robustness. The next chapter tries to point out difficulties by experiments on an image database.

18

Chapter 4

Experiment Setup Evaluating and concluding about learning algorithms is a difficult task. The experiments have to be designed to get comparable results to previous ones and be general enough to repeat them with different input data. All decisions and tasks of the experiments are described in this chapter. In the Sections 4.1 and 4.2 the image database and its usage is presented. Section 4.3 describes which features are used and how they can be calculated. Finally the kernels and and some general computational aspects are discussed in Section 4.4. Explanations about some algorithm specific parameters can be found in Section 4.5.

4.1

The Image Database

Human perception is based on 3D-vision and understanding. In order to get a similarly performing learning algorithm, objects have to be recognized from different points of view. For these experiments non-occluded objects without a disturbing background have to be recognized. Further experiments with backgrounds (noise) can be done at a later stage. All experiments in this chapter are done on the CogVis image database, containing 80 objects of 8 classes. 41 views per object from different view angles are available. The CogVis image database is commonly used for testing of object recognition algorithms, because it provides many images and a quite complex set of objects. Object size and scale are not respected in this database: the object fills independant of its original size nearly the whole image. Example images of the CogVis image database are given in Figure 4.1. The 8 object classes are “apples”, “toy cars”, “toy cows”, “cups”, “toy dogs”, “toy horses”, “pears” and “tomatos”. The CogVis image database is available on the internet [CogVis, 2001].

4.2

Database Splitting

In order to make learning experiments, training and test data are necessary. Often some parameters of the kernel or some weighted mean functions have to be optimized. Optimization on the whole test set is too expensive from a computational point of 19

Figure 4.1. Some examples of the CogVis database.

view. To avoid these calculations a validation set was added to the training and test set. All three sets are randomly generated 5 times, and all experiments repeated with these 5 sets. The results reported here are the arithmetic mean of them. An image cannot be used twice, in a training, validation or test set. Table 4.1 gives the splitting rules used for the set-generation. Objects Views Images

Training set 5 4 20

Validation set 2 8 16

Test set 3 16 48

Table 4.1. Splitting rules for the database.

4.3

Features and Feature Selection

Learning directly on the image data does not make sense here. The image data has to be mapped in a feature space by a mapping function (cf. 2.2). This task is the so-called feature extraction. Two kinds of features can be extracted: Local Features Local features are properties of certain regions of interest in the image. These regions have first to be detected. For photographic real world images still no robust feature detector is available. One possibility to detect regions of interests for local features is computing the entropy of a scaling and sliding window at each point of the image. Discontinuities in the entropy function point to an interesting feature. Once a region 20

of interest is found, local properties like statistical moments or local histograms can be computed. For the experiments in this thesis no local features are used. Global Features Global features are properties concerning the whole image. Their advantage is that they can be computed with ease, because no previous selection has to be done. Problematic is that global features are often not precise enough to get information about the object in the image. For real world applications with complex objects, occlusion and backgrounds (noise), global features may be not appropriate. Common used features are based on statistical properties of the objects. One possible statistical representation of images is the histogram. Here two types of histograms are used:

4.3.1

Color Histograms

Color histograms contain of course the color information of the object. The CogVis image database pictures are taken on a blue background, so the blue channel is not suitable for object recognition purposes. The color histograms used here are 2D color histograms of the red against the green channel. Each experiment was made with 8 and 16-bin histograms: once the graylevel interval [0; 255] was split in 8 and once in 16 equal sub-intervals. Figure 4.2 gives an example of the color histogram computation.

4

x 10 2

10000

1 0 1

5000 0

2

3

4 5 green channel 6

7

8

2

6 4 red channel

8

5 green channel 10

15 10 15

5

red channel

Figure 4.2. Computation of a RG color histogram: original image (top left), red (top right), green (bottom left) and blue channel (bottom right), 8 bin histogram and 16 bin histogram.

4.3.2

Multidimensional Receptive Field Histograms

Graylevel histograms contain the texture information, but linear 1D histograms are very difficult to separate in feature space. B. Schiele and J. Crowley propose in [Schiele, 2000] another type of graylevel histograms, more suitable for recognition. 21

The so-called multidimensional receptive field histograms (MFH) are gray level histograms of the image convolved with Gaussian derivative functions using different standard deviations σ. For each filter operation the histogram gains one dimension. In Figure 4.3 some MFH histograms are computed for the previous sample image. For these experiments the images have been convolved with a Laplacian of Gaussi-

x 10

4

x 10

4

4

4 2

2 0

0

2 4 6 8

2

4

6

8

5

15 10

10 15

5

Figure 4.3. Computation of a MFH : original image (top left), image convolved with the second derivative of a gaussian with σ = 1 (top right) and σ = 2(bottom right), 8 bin histogram and 16 bin histogram.

ans for σ1 = 1 and σ2 = 2. Given Gσ (x, y) = e− as follows: ¶ µ 2 1 x σ − Gσ (x, y) Gxx (x, y) = σ4 σ2 ¶ µ 2 y 1 σ − Gyy (x, y) = Gσ (x, y) σ4 σ2 Lap(x, y) = Gσxx (x, y) + Gσyy (x, y)

4.4

x2 +y 2 2σ 2

, the Laplacian can be written

Computation and Kernel Selection

As stated in Section 2.2, the kernel is a distance measure in the feature space. For SVM purposes some common kernels are listed in Table 4.2. For all experiments in this thesis a χ2 -kernel introduced by Caputo and Dorko in [Caputo et al., 2002] was used. It is a well known kernel and it is not too expensive regarding computational cost, also because only the parameter γ has to be adjusted. This has to be done for each training set. This optimization is done by learning on the training set and predicting on the validation set, using the values γ = 0.01, 0.1, 1, 10, 100. The one providing the best recognition rate is selected and used for the prediction on the test set. A polynomial or Gaussian kernel might provide better results than χ 2 , but 3 parameters have to be adjusted. The computational cost is much higher and the aim of the thesis is not to evaluate kernel performances. The γ-selection could also 22

linear polynomial radial basis function sigmoid chi-square Gaussian

function K(x, y) x·y (γ · xy + b)n 2 e−γ·kx−yk tanh(γ · xy + b) 2 2 e−γχ χ2 = (x−y) x+y a a b e−γkx −y k

parameters none γ,b,n γ γ,b γ γ, a, b

Table 4.2. Common kernel functions for SVM.

be optimized by fitting a quadratic function on the recognition rates and calculating the γmax corresponding to the empirical maximum recognition rate, but for global evaluation of learning performances this is not necessary.

4.5

Algorithm related Parameters

4.5.1

One-vs-rest SVM and Hierachical SVM

The one-vs-rest SVM is done with the opensource library LIBSVM [Libsvm, 2001]. It is freely available on [Libsvm, 2001]. To use it with one-vs-rest SVM the source code had to be modified1 . This modified version will soon be available on-line. As hyperplane calculation method “C-SVM” can be used, its only parameter is C and fixed to C = 100 for all experiments.

4.5.2

Adaptive Code Algorithm

The AP algorithm is implemented in MATLAB and uses the commercial solver LOQO [Vanderbrei, 2000] to solve the minimization problems2 . Evaluating the influence of the initial code is one of the important tasks of this thesis. Several more or less trivial initial codes can be computed. Here the codes are given as 8×8-matrices. The class order is “apples, cars, cows, cups, dogs, horses, pears and tomatos”, the same for the matrix rows. In this configuration the code matrix can only be used to classify 8 images. For each supplementary image a row has to be added to the code matrix. Example: a classification task for 100 images and 8 categories needs a 100 row long and 8 column large code matrix. 1

At this moment the author would like to thank Mario Fritz and Marie-Elena Nilsback for the code modifications and the support. 2 The author want to thank Robert J. Vanderbrei from Princeton University for his generosity.

23

Identity Code The identity code is a trivial entry point. Every class is near to itself, no higher knowledge is introduced at this point:   1 0 0 0 0 0 0 0  0 1 0 0 0 0 0 0     0 0 1 0 0 0 0 0     0 0 0 1 0 0 0 0    TI =    0 0 0 0 1 0 0 0   0 0 0 0 0 1 0 0     0 0 0 0 0 0 1 0  0 0 0 0 0 0 0 1

Random Code A random matrix as code will show if it also leads to a suitable result if arbitrary knowledge is entered in the algorithm. TR = random(8, 8) N (0,1)

Prior Knowledge Here some knowledge about superclasses can be introduced. Prior knowledge means that information like “a dog is an animal” is given as starting point. In this example all dog, horse and cow images (rows in the code matrix) get marked “1” in the animals’ columns an “0” other where, because we do not know anything about other class memberships. Two different schemes can be imagined: an un-weighted and a weighted prior knowledge code:   1 0 0 0 0 0 1 1  0 1 0 0 0 0 0 0     0 0 1 0 1 1 0 0     0 0 0 1 0 0 0 0    TP rKn1 =   0 0 1 0 1 1 0 0    0 0 1 0 1 1 0 0     1 0 0 0 0 0 1 1  1 0 0 0 0 0 1 1   1 0 0 0 0 0 0.5 0.5  0 1 0 0 0 0 0 0      0 0 1 0 0.5 0.5 0 0    0 0 0 1 0 0 0 0    TP rKn2 =  0    0 0 0.5 0 1 0.5 0  0 0 0.5 0 0.5 1 0 0     0.5 0 0 0 0 0 1 0.5  0.5 0 0 0 0 0 0.5 1 24

Code obtained by other learning Algorithms Less precise but fast learning algorithms can also produce an initial code for the AC algorithm. To evaluate this entry point the “Spin Glass-Markov Random Fields” (SG-MRF) learning algorithm presented in [Caputo, 2003] can be used. The resulting code matrix TSG−M RF should provide precise knowledge about the visual relationship between the classes.

4.5.3

Multi Cue and DAS-DT

For multi cue result the results of the one-vs-rest experiments are combined. The distance matrices Di have to be multiplied with a weight factor a and summed: D = D1 + a · D 2 The weight factor a can be determined on the validation set: the value of a giving the highest recognition rate on the validation set is used for testing. Possible values for a are {0, 0.01, 0.1, 1, 10, 100, ∞}. Some optimisation could be done by using some polynomial regression to maximise the recognition rate, or by simply approaching the maximum value with varying intervals. For DAS-DT the same selection method for a is used.

4.6

Conclusion

In order to provide comparable results, algorithms and parameters were chosen carefully in this project. This chapter gave all information necessary to repeat the experiments of this thesis. In the next chapter the numerical results and some interpretations are presented.

25

Chapter 5

Results and Discussion This chapter finally summarizes the experimental results, discusses the algorithms’ performances and gives possible explanations for disappointing outcomes. First, Sections 5.1 – 5.5 describe individual results; then all results are compared in the conclusion of this chapter. Only some representative numerical ones are shown, while the whole batch of results is available in the Appendix. The experiments were done with histogram features of two resolutions: 8 bins and 16 bins. Here only the 16 bin histograms are shown. They provide a slighlty better result than the 8 bin ones.

5.1

One-vs-rest SVM

The training, validation and test sets were designed to be “difficult” to learn. In fact it is quite easy to get good learning results by using the whole image database, but an improvement on the recognition rate was wanted. For this reason, only few images and objects are used for learning. This explains also the poor recognition rate (Table 5.1) given by the one-vs-rest classifier. Observing the standard deviation, the Feature Type Color RG

Resolution 16 bins

Recognition rate in % Validation Test 65.31 75.94 ±3.77

±6.84

Gray 2Lap

16 bins

75.21

75.16

±5.85

±8.57

Table 5.1. Recognition rates of the one-vs-rest classifier. These are mean values of 5 independent experiments. (The standard deviations are given by the small numbers.)

difference between validation and test attracts attention. This phenomenon can be explained by the γ selection: it is selected on the validation set, so it can be assumed 26

to be optimal for the validation. The test set is independent from the validation set, so the selected value for γ may not be the best for testing.

5.2

Adaptive Code

The most interesting result of the AC-algorithm is the influence of the inital code matrix. Comparing the results row by row in Table 5.2, the enormous difference of the recognition rate attracts attention. The best result can be obtained with an identity matrix and with the resulting matrix of the SG-MRF algorithm. Interesting seems the robustness of the algorithm: the standard deviation of the results is lower than in the previous experiment. Respect to the good results using the identity and

Initial code Identity Random Un-weighted prior knowledge Weighted prior knowledge SG-MRF

Color RG 16 Recognition rate in % Validation Test 72.31 72.13

Gray 2Lap 16 Recognition rate in % Validation Test 80.98 78.68

±4.28

±6.32

±5.33

±2.35

23.96

19.42

23.87

23.08

±3.80

±2.92

±0.85

±2.19

31.01

28.21

29.57

28.86

±1.43

±3.28

±2.54

±3.00

67.38

62.76

70.49

68.72

±3.75

±2.08

±1.67

±2.87

71.98

66.20

79.85

74.49

±4.83

±6.72

±4.69

±4.21

Table 5.2. Recognition rates of the adaptive code algorithm using different initial code matrices (table rows).

the SG-MRF code matrices, the overall performance of the AP-algorithm is a little bit better than the one of the classical one-vs-all SVM algorithm. Including knowledge about high level class memberships using code matrices like the un-weighted and weighted prior knowledge provides unfortunately a worse recognition rate than one-vs-all SVM. With this algorithm it does not seem possible to increase recognition performance by using high level class memberships.

5.3

Hierachical SVM

Hierachical SVM experiments show how “difficult” the differenciation between several classes is. Making the difference between animals is less successful than between fruits or the main classes (1st level). Table 5.3 summarizes the results of the hierachical SVM trees given in the Appendix A.4. The overall performance is not worse 27

than classical one-vs-rest SVM. But no remarkable improvement by using knowledge about high level class memberships can be seen either. Not shown in the Recognition 1st level fruits animals total mean

rates in % 90.52 92.26 42.49 73.79

Recognition 1st level fruits animals total mean

(a) Color features with 16 bins.

rates in % 92.24 89.67 51.23 76.27

(b) Graylevel features with 16 bins.

Table 5.3. Recognition rates of the hierachical SVM algorithm. The last row shows the final recognition rates on

tables 5.3(a) and 5.3(b) are the standard deviations. There is no standarized way to calculate them from the different SVM trees. The high risk in this algorithm is that a missclassification occurs in the first SVM level. Error rates of the hierachical SVM are cumulative, so that every node in the tree has to be very good to get a suitable result. In real life experiments this can never be guaranteed, so summarizing about the performance of hierachical SVM it can be said that its robustness is simply too low to be used in real world applications.

5.4

Multi Cue SVMs

Multi cue SVM was used with success in combination with one-vs-rest SVM using simultaneously graylevel and color features [Caputo et al., 2003]. Here the cues come from different categorization levels. The results of these experiments summarized in Table 5.4 (the integral set of results is listed in Appendix A) are nearly identical to those of the one-vs-rest experiments showed in Section 5.1. Combining the cues was done by optimizing a weight factor a; these results show that neither the recognition rate or the standard deviation (so the robustness) is affected regarding the one-vsall SVM alone. Given the higher computational cost of multi-cue-over-level SVM Feature Type Color RG

Resolution 16 bins

Recognition rate in % Validation Test 65.31 75.21 ±3.77

±6.08

Gray 2Lap

16 bins

75.78

75.47

±5.52

±8.36

Table 5.4. Recognition rates of the multi-cue algorithm.

28

compared to classical one-vs-all SVM (≈ factor 2.5), the interest of this algorithm is low.

5.5

DAS-DT

The DAS-DT analysis is an extended version of the multi-cue algorithm. For these experiments1 a decision tree for each kernel (γ-value) on the validation set is computed. The kernel giving the best recognition rate is then used for the test with the test set. The DAS-DT algorithm decides about the splitting and choses the weight factor a = {0, 0.01, 0.1, 1, 10, 100} giving the highest recognition rate at each step. Here only the results of the 16 bin features are shown. The ones of the 8 bin features behave in a similar way, but are worse than these. Simply looking at the numerical results given in Table 5.5 only a very thin improvement about ≈ 2% can be recognized. The standard deviations are similar to those of the multi-cue algorithm. This can be explained by looking at the trees Feature Type Color RG Gray 2Lap

Resolution 16 bins 16 bins

Recognition rate in % Validation Test 67.81 77.86 ±3.24

±6.07

77.97

76.30

±6.71

±8.02

Table 5.5. Recognition rates of DAS-DT, some selected features.

in Figure 5.1 and in Appendix A.5. The algorithm selected mainly weight factors near a = 0. This means that it uses mainly the 8 class vote matrix. The fact that a = 0 for the decision between the animals and between the fruits is totally normal because no information about theese distances are given in the 4 class vote matrix. But unfortunatly no consistency can be seen about the usage of the 4 class vote matrix in high level decisions like cups vs. apples and tomatos. The same remark can be done for the tree form: it changes from test set to test set, no simple rule can be seen (exept the fact that animals are often split at the end due to of their similarity).

1 The author would like to thank once more Maria-Elena Nilsback for her help by modifying the source code and running these experiments.

29

Total Rec. Rate

78.65%

a = 10

99.48% 89.44%

a = 10

96.61%

a=0

94.27%

a = 0.1

88.80%

a=0

81.51%

a=0 a=1 apple

car

cow

dog

horse

pear

cup

78.56%

tomato

Figure 5.1. Example DAS decision tree giving the result using color 16 bin histogram features on the test set 5. The kernel had been selected on the validation set with γ = 0.1. The numbers on the right side of the tree show the between class recognition rate of each splitting step. The last one corresponds also to the overall recognition rate. In the decision tree are printed the weight factors a. When a = 0 only the 8 class vote matrix is used, for a = 1 equal importance is given to the 8 and 4 class vote matrices. Higher values mean that the 4 class vote matrix gets more importance. The classes at the bottom are colored to show their relationship: light grey for animals and dark grey for fruits.

5.6

Conclusion

The algorithms cannot be compared only by looking the final recognition rates. Robustness is an important property which cannot be neglected. Comparing the main results summarized in Table 5.6 it can be seen that only the DAS-DT algorithm increases the overall performance of standard multi-class SVM with comparable robustness (standard deviation). The gain of DAS-DT is unfortunatly not very high, the recognition rates rise only around 2%. Compared to the high computational cost of DAS-DT (about 3 times of SVM) the algorithm seem not to pay off for this kind of multi-class experiments. None of the tested algorithms provide really bad results, but compared to the computationally less expensive multi-class SVM no real advantage results. Another interesting result is that the multi-cue and DAS-DT algorithm seem to have more difficulties with the gray level features: the standard deviation rises much more for this kind of features than for color features.

30

Algorithm SVM AC Identity AC SG-MRF HSVM Multi-Cue DAS-DT

Color RG 8 bins 16 bins 62.60 75.94

Gray 2 Laplacians 8 bins 16 bins 74.63 75.16

±6.03

±8.18

±6.84

±8.57

65.94

72.13

76.18

78.68

±5.71

±6.32

±5.33

±2.35

64.97

66.20

71.56

74.49

±7.87

±6.72

±7.31

±4.21

62.81

75.21

74.79

75.47

N/A

N/A

N/A

N/A

67.89

73.79

74.09

76.27

±6.18

±6.08

±8.11

±8.36

64.48

77.86

74.90

76.30

±6.18

±6.07

±7.00

±8.02

Table 5.6. Recognition rates of all tested algorithms. The values are the mean recognition rates on the 5 test sets. (SVM: Multi-class One-vs-rest Support Vector Machines, AC Identity: Adaptive code algorithm using initial identity code, AC SG-MRF: Adaptive code algorithm using initial SG-MRF code, HSVM: Hierachical SVM)

31

Chapter 6

Conclusion and Future Work The immense power of the human brain in categorization and understanding of object’s subsequent relationships seems to be one of the keys to powerful object recognition. Using multiple class memberships in object categorization and recognition is a new subject in computer vision research. Few important results have been presented until now. The main difficulty is to find a suitable way to combine and weight this multiple class knowledge. This thesis has proposed several algorithms using knowledge about super classes and multiple class memberships doing object classification and recognition. All presented and tested algorithms are based on Support Vector Machines and use mathematical standard techniques. The algorithms can be separed in two groups using multiple class knowledge in the learning or in the decision process. The newest approach, the so called adaptive code looked very promising at the beginning but the gain compared to classical Support Vector Machines is too low, given the higher computational cost. The extension of the decision process were done in three different ways, some simple calculations like multi-cue decision over two levels and hierachical Support Vector Machines are less promising because of their low recognition rate and robustness compared to the higher computational cost. The most promising algorithm using the Discriminative Accumulation Scheme combined with a Decision Tree (DAS-DT) does not increase the recognition rate much, but it is the only algorithm providing a higher rate than classical Support Vector Machines for all tests. The selection of the weight factor in the decision tree show that sometimes a higher level of class memberships really increase recognition. Interesting also is that the robustness did not decrease. Given the very difficult task to learn image sets, it can be summarized that the generalization of all those algorithms is quite good. The overall performance is for sure not worse than classical multi-class Support Vector Machines, but unfortunately no real improvement could be shown.

32

6.1

Future Work

Even the success of the evaluated algorithms in this thesis is limited, some very important results are provided by it. Basic work in a new research field has been done and there is still much work to do. Several improvements on the algorithms should give a better idea about the performance or robustness improvements by using multiple class memberships. Based on these algorithms some questions need to be answered: - How do the evaluated algorithms behave when other learning features than histograms are used? Actually only histogram features were used for learning, but scale, shape and local features were ignored. Even mixed combinations like histograms or shape for high level classes and local features for detailed classes could be interesting to evaluate. - Can the learning data be selected in a better way? Here all data was used to categorize an object. Weighting and data selection were only done using DAS-DT and multi-cue algorithms, but this weighting is finally less precise. An algorithm guessing the importance or relevance of parts of data for detailed data selection can perhaps improve performance and show which information is really given by multiple class memberships. - How does the recursive Adaptive Code and Embedding algorithm converge? Rätsch proposed in [Rätsch et al., 2002] the basics for the Adaptive Code algorithm. Originally this algorithm was recursive and adapted the code and the embedding function. Being computationally too expensive, this idea had been abandoned. The main problem of this algorithm is that it can converge after some recursions or several hundreds of recursions. When this convergence is better known, the initial values of the algorithm can be set in a better way and computation time limited to reasonable values. Here multiple class memberships have been evaluated for discriminative learning methods, or more specifically for Kernel Machines. Other algorithms use different representations of the relevant information which are perhaps more suitable for multiple class memberships. An interesting work could be the evaluation of probabilistic learning methods using the same kind of multiple class memberships like in this thesis and extend existing algorithms like SG-MRF (see [Caputo, 2003]) or Ada-Boost.

33

References [Allwein et al., 2000] E. Allwein, R. Schapire and Y. Singer: Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, AT&T Corp., 2000. [Bartlett et al., 1999] P. Bartlett, J. Shawe-Taylor: Generalization Performance of Support Vector Machines and other Pattern Classifiers. In B. Schölkopf, C.J.C. Burges and A.J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 43-54. MIT Press, 1999. [Bennett et al., 2000] K.P. Bennett, A. Demiriz, J. Shawe-Taylor: A column generation algorithm for boosting, in P. Langley, editor, Proc. 17th ICML, pages 65-72, San Francisco, 2000. Morgan Kaufmann. [Brautigam et al., 1998] C. Brautigam,J.-O. Eklundh, H. Christensen: A model free approach of integrating multiple cues., ECCV 1998. [Caputo, 2003] B. Caputo: A new Kernel Method for Object Recognition: Spin Glass-Markov Random Fields, Technische Fakultät der Universität ErlangenNürnberg, 2003. [Caputo et al., 2002] B. Caputo and G. Dorko: How to combine Color and Shape Information for Object Recognition: Kernels do the Trick, NIPS 2002. [Caputo et al., 2003] B. Caputo and M.-E. Nilsback: Cue Integration through Discriminative Accumulation, Royal Institute of Technology Stockholm, 2003. [CogVis, 2001] CogVis database, 2001, Images available at http://www.vision.ethz.ch/projects/categorization/eth80-db.html. [Cortes et al., 1995] C. Cortes, V. Vapnik: Support Vector Networks, Machine Learning, 20:273-297, 1995. [Crammer et al., 1963] K. Crammer, Y. Singer: On the Learnability and Design of Output Codes for Multiclass problems, in N. Cesa-Bianchi and S. Goldberg, editors, Proc. Colt, pages 35-46, San Francisco, 2000. [Cristianini, 2000] N. Cristianini and J. Shawe-Taylor: An introduction to Support Vector Machines, and other kernal-based learning methods, Cambridge University Press, 2000. 34

[Fergus et al., 2003] R. Fergus, P. Perona and A. Zisserman: Object Class Recognition by Unsupervised Scale-Invariant Learning, University of Oxford and California Institute of Technology, 2003. [Libsvm, 2001] C. Chang, C. Lin: LIBSVM: A Library for Support Vector Machines, 2001, Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. [Nash et al., 1996] S. Nash, A. Sofer: Linear and Nonlinear Programming, McGrawHill, New York, 1996. [Rätsch et al., 2002] G. Rätsch, A.J. Smola and S. Mika: Adapting Codes and Embeddings for Polychotomies, RSISE, CSL, Machine Learning Group, Canberra, Australia and Frauenhofer FIRST, Berlin, Germany, 2002. [Rosenblatt, 1959] F. Rosenblatt: The perceptron: a probabilistic model for information storage and organization in the brain, Psychological Review, 1959. [Schiele, 2000] B. Schiele and J. Crowley: Recognition without Correspondence using Multidimensional Receptive Field Histograms, International Journal of Computer Vision, 36(1), 31-52, 2000. [Schölkopf, 2000] B. Schölkopf: Statistical Learning and Kernel Methods, Microsoft Research, 2000. [Schölkopf and Smola, 2002] B. Schölkopf and A. Smola: Learning with Kernels, MIT Press, Cambridge, MA, 2002. [Schölkopf and Smola, 2000] B. Schölkopf, A. Smola, R.C. Williamson and P.L. Bartlett: New Support Vector Algorithms, Neural Computation, 12:10831121, 2000. [Turing, 1950] A.M. Turing: Computing Machinery and Intelligence, Mind, 1950. [Ullman, 1996] S. Ullman: High-level Vision: Object Recognition and Visual Cognition. Cambridge, MA: The MIT Press, 1996. [Vanderbrei, 2000] R.J. Vanderbrei: LOQO: A commercial solver for quadratic programs, 2000, Princeton University, Software evaluation and licenses available at http://www.orfe.princeton.edu/~loqo/. [Vapnik et al., 1963] V. Vapnik, A. Lerner: Pattern Recognition using generalized Portrait Method, Automation and Remote Control, 24, 1963. [Vapnik and Chervonenkis, 1964] V. Vapnik, A. Chervonenkis: A Note on one Class of Perceptrons, Automation and Remote Control, 25, 1964. [Vapnik et al., 1992] B.E. Boser, I.M. Guyon, V. Vapnik: A training algorithm for optimal margin classifiers. D. Haussler, Proceedings of the 5th Annual ACM Workshop on Computer Learning Theory, pages 144-152, ACM Press, 1992. 35

[Vapnik, 1995] V. Vapnik: The Nature of Statistical Learning Theory, Springer, New-York, 1995. [Vapnik, 1998] V. Vapnik: Statistical Learning Theory, Wiley and Son, New-York, 1998.

36

Appendix A

All experimental Results All experiments were done with 5 different training, validation and test sets. The recognition rates here are means of these 5 experiments, the corresponding standard deviation is printed just below them.

A.1

One-vs-rest SVM

Feature Type Color RG

Resolution 8 bins

Recognition rate in % Validation Test 61.36 62.60 ±2.10

±6.03

Color RG

16 bins

65.31

75.94

±3.77

±6.84

Gray 2Lap

8 bins

71.88

74.63

±3.87

±8.18

75.21

75.16

±5.85

±8.57

Gray 2Lap

16 bins

37

A.2

Adaptive Code

Identity vs Random Feature Type Color RG

Resolution 8 bins

Color RG

16 bins

Recognition rate in Identity code Validation Test 69.33 65.94

% Random code Validation Test 23.48 23.2

±2.85

±5.71

±2.03

±1.54

72.31

72.13

23.96

19.42

±4.28

±6.32

±3.80

±2.92

76.18

24.51

24.21

Gray 2Lap

8 bins

79.42 ±4.16

±5.33

±0.39

±0.53

Gray 2Lap

16 bins

80.98

78.68

23.87

23.08

±5.33

±2.35

±0.85

±2.19

Prior Knowledge Feature Type Color RG

Resolution 8 bins

Recognition rate in % Prior Knowledge 1 Prior Knowledge 2 Validation Test Validation Test 26.40 24.51 50.43 46.92 ±1.09

±2.48

±11.01

±10.64

Color RG

16 bins

31.01

28.21

67.38

62.76

±1.43

±3.28

±3.75

±2.08

Gray 2Lap

8 bins

28.57

28.90

64.51

62.57

±2.94

±3.21

±3.94

±3.41

29.57

28.86

70.49

68.72

±2.54

±3.00

±1.67

±2.87

Gray 2Lap

16 bins

SG-MRF Feature Type Color RG

Resolution 8 bins

Recognition rate in % SG-MRF Code Validation Test 66.46 64.97 ±5.84

±7.87

Color RG

16 bins

71.98

66.20

±4.83

±6.72

Gray 2Lap

8 bins

78.45

71.56

±2.91

±7.31

79.85

74.49

±4.69

±4.21

Gray 2Lap

16 bins

38

A.3

Multi-cue SVM Feature Type Color RG Color RG

16 bins

Gray 2Lap

8 bins

Gray 2Lap

A.4

Recognition rate in % Validation Test 61.88 62.81

Resolution 8 bins

16 bins

±1.4

±6.18

65.31

75.21

±3.77

±6.08

72.03

74.79

±3.99

±8.11

75.78

75.47

±5.52

±8.36

Hierachical SVM

Color RG 8 bins Recognition 1st level fruits animals total mean

rates in % 81.87 89.00 37.17 67.89

animals

fruits

cars

cups

horses apples pears tomatos cars

cups

79.32%

91.94%

14.10%

cows

dogs

2.28%

41.40% 25.42% 73.75% 90.42% 100.00%100.00% 5.97%

20.27%

17.16%

7.45%

0.00%

0.00%

28.99%

15.20%

32.84% 20.16% 58.50% 83.13% 91.94% 91.94% 61.56% 83.75%

Color RG 16 bins Recognition 1st level fruits animals total mean 39

rates in % 90.52 92.26 42.49 73.79

fruits

cars

cups

horses apples pears tomatos cars

cups

animals

92.78%

93.94%

2.12%

3.33%

cows

dogs

49.58% 45.41% 40.69% 98.33% 100.00%100.00% 29.55%

23.82%

41.82%

1.74%

0.00%

0.00%

23.26%

14.39%

46.58% 42.67% 38.22% 91.23% 92.78% 92.78% 70.42% 90.00%

Gray 2Lap 8 bins Recognition 1st level fruits animals total mean

rates in % 81.87 83.50 49.92 74.09

animals

fruits

cars

cups

horses apples pears tomatos cars

cups

98.47%

91.53%

2.27%

cows

dogs

8.99%

62.50% 56.67% 32.92% 83.69% 96.67% 93.33% 20.36%

16.03%

18.07%

22.07%

6.35%

10.46%

3.09%

29.31%

61.54% 55.80% 32.42% 76.60% 88.48% 85.42% 90.00% 72.92%

Gray 2Lap 16 bins Recognition 1st level fruits animals total mean 40

rates in % 92.24 89.67 51.23 76.27

animals

fruits

cars

cups

horses apples pears tomatos cars

cups

98.89%

92.36%

1.26%

cows

dogs

7.23%

60.42% 61.25% 33.78% 95.42% 96.25% 99.58% 23.89%

14.78%

14.91%

5.95%

8.93%

0.93%

3.78%

19.41%

59.75% 60.57% 33.37% 88.13% 88.90% 91.97% 90.42% 73.75%

A.5

DAS-DT

Only some selected trees are listed here: 16 bin resolution features (8 bin a worst but comparable). The kernel selection (γ-selection) had been done on the validation set before. Table A.1 summarizes all results of the DAS-DT experiments. On the Feature Type Color RG

Resolution 8 bins 16 bins

Gray 2Lap

8 bins 16 bins

Recognition rate in % Validation Test 62.81 64.48 ±1.52

±6.18

67.81

77.86

±3.24

±6.07

73.59

74.90

±4.36

±7.00

77.97

76.30

±6.71

±8.02

Table A.1. Recognition rates of DAS-DT.

right side of the decision trees the recognition rate between the classes at each step is written. In the graphic some a values giving the weight factors are printed. When the weight factor is a = 0 only the 8 class vote matrix has been used, higher a is more influence is given to the 4 class matrix. For example a = 1 means that the importance of both vote matrices was equal for this splitting.

41

Total Rec. Rate

82.29%

a=0

100% 99.48%

a = 100

97.92%

a=0

96.09%

a=0

93.75%

a=0

88.28%

a=0

82.29%

a=0 cup

horse

apple

car

cow

dog

pear

tomato

Figure A.1. Color 16 bins, Test set 1, γ = 0.1

Total Rec. Rate

83.59%

a=0

99.22%

a = 10

97.66%

a=0

91.93%

a=0

87.5%

a=0

83.85%

a=0 a=0 cup

car

99.74%

horse

cow

dog

pear

apple tomato

Figure A.2. Color 16 bins, Test set 2, γ = 0.1

42

83.59%

Total Rec. Rate

68.23%

a=0

100% 99.74%

a = 100

97.14%

a = 10

93.23%

a = 10

85.94%

a=0

75.52%

a=0

68.23%

a=0 apple

cup

car

cow

dog

horse

pear

tomato

Figure A.3. Color 16 bins, Test set 3, γ = 1

Total Rec. Rate

76.56%

a=0

100% 100%

a = 10

99.22%

a=0

95.05%

a = 100

89.32%

a=0

82.03%

a=0

76.56%

a=0 cup

apple

car

cow

dog

horse

pear

Figure A.4. Color 16 bins, Test set 4, γ = 0.1

43

tomato

Total Rec. Rate

78.65%

a = 10

99.48% 89.44%

a = 10

96.61%

a=0

94.27%

a = 0.1

88.80%

a=0

81.51%

a=0 a=1 apple

car

cow

dog

horse

pear

cup

78.56%

tomato

Figure A.5. Color 16 bins, Test set 5, γ = 0.1

Total Rec. Rate

72.40%

a=1

100% 99.48%

a=1

98.18%

a = 10

96.61%

a=0

93.23%

a=0

86.72%

a=0

72.40%

a=0 apple

car

cup

dog

cow

horse

pear

Figure A.6. Gray 16 bins, Test set 1, γ = 1

44

tomato

Total Rec. Rate

90.36%

a=0

100% 100%

a = 10

99.48%

a=0

95.31%

a=0

90.89%

a=0 a=0

90.36%

a = 0.1 pear

car

dog

cow

horse

apple

cup

90.62%

tomato

Figure A.7. Gray 16 bins, Test set 2, γ = 10

Total Rec. Rate

75.00%

a=0

99.74% 99.48%

a = 10

98.96%

a = 10

91.15%

a=0

85.16%

a=0

77.34%

a=0

75.00%

a=0 cup

car

cow

dog

horse

apple

pear

Figure A.8. Gray 16 bins, Test set 3, γ = 10

45

tomato

Total Rec. Rate

70.57%

a = 10

99.48% 98.18%

a=0

95.57%

a = 10

91.67%

a=0

88.54%

a = 0.1

84.38%

a=0

70.57%

a=0 apple

car

cup

dog

cow

horse

pear

tomato

Figure A.9. Gray 16 bins, Test set 4, γ = 10

Total Rec. Rate

73.18%

a = 10

100% 99.22%

a=0

96.35%

a=1

93.75%

a=0

89.58%

a=0

81.25%

a=0

73.18%

a=0 apple

car

cup

cow

dog

horse

pear

Figure A.10. Gray 16 bins, Test set 5, γ = 10

46

tomato

Appendix B

Hardware and Software For the experiments three computer types have been used: - Intel Pentium IV, 768 MB RAM, Linux 2.4.x and 2.6.x - Sun Blade 100, UltraSparc IIIe, 500MHz, 256 MB RAM, Solaris 8 - Sun 8 processor machine, UltraSparc II, 400MHz, 4096 MB RAM, SunOS 5.8 The algorithms were executed / compiled with the following software: - Mathworks Matlab 6.5 R13 for Linux and Solaris - LOQO [Vanderbrei, 2000] solver for quadratic programs - LIBSVM [Libsvm, 2001] library for Support Vector Machines - SG-MRF library by Gyorgy Dorko - GCC 3.3 Gnu-C-Compiler for Linux and Solaris

47

Suggest Documents