Ì ÖÔÓÖØ Ö Ö Ö ÓÒ ÛØÒ Ø ÒØÖ ÓÖ ÓÐÓÐ Ò ÓÑÔÙØØÓÒÐ ÄÖÒÒ Ò Ø ÔÖØÑÒØ Ó ÖÒ Ò ÓÒØÚ ËÒ Ò Ò Ø ÖØ Ð ÁÒØÐÐÒ ÄÓÖØÓÖÝ Ø Ø Å Ù ØØ ÁÒ ØØÙØ Ó ÌÒÓÐÓݺ Ì Ö Ö ÔÓÒ ÓÖ Ý ÖÒ

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND...
Author: Rudolph Sutton
0 downloads 0 Views 1MB Size
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY

and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES

A.I. Memo No. 1697

September, 2000

C.B.C.L Paper No. 192

Feature Sele tion for Fa e Dete tion Thomas Serre, Bernd Heisele, Sayan Mukherjee, Tomaso Poggio

This publi ation an be retrieved by anonymous ftp to publi ations.ai.mit.edu. The pathname for this publi ation is: ai-publi ations/1500-1999/1697

Abstra t We present a new method to sele t features for a fa e dete tion system using Support Ve tor Ma hines (SVMs). In the rst step we redu e the dimensionality of the input spa e by proje ting the data into a subset of eigenve tors. The dimension of the subset is determined by a lassi ation

riterion based on minimizing a bound on the expe ted error probability of an SVM. In the se ond step we sele t features from the SVM feature spa e by removing those that have low ontributions to the de ision fun tion of the SVM.

Copyright



Massa husetts Institute of Te hnology, 2000

This report des ribes resear h done within the Center for Biologi al and Computational Learning in the Department of Brain and Cognitive S ien es and in the Arti ial Intelligen e Laboratory at the Massa husetts Institute of Te hnology. This resear h is sponsored by a grant from OÆ e of Naval Resear h Contra t No. N00014-93-13085, OÆ e of Naval Resear h Contra t No. N00014-95-1-0600, National S ien e Foundation Contra t No. IIS-9800032, and National S ien e Foundation Contra t No. DMS-9872936. Additional support is provided by: AT&T, Central Resear h Institute of Ele tri Power Industry, Eastman Kodak Company, DaimlerChrysler, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC Fund, Nippon Telegraph & Telephone, and Siemens Corporate Resear h, In .

2

1 Introdu tion The trainable system for dete ting frontal and near-frontal views of fa es in gray images presented in [Heisele et al. 2000℄ gave good results in terms of dete tion rates. The system used gray values of 1919 images as inputs to a se ond-degree polynomial kernel SVM. This hoi e of kernel lead to more than 40,000 features in the feature spa e1 . Sear hing an image for fa es at di erent s ales took several minutes on a PC. Many real-world appli ations require signi antly faster algorithms. One way to speed-up the system is to redu e the number of features. We present a new method to redu e the dimensions of both input and feature spa e without de reasing the lassi ation rate. The problem of hoosing the subset of input features whi h minimizes the expe ted error probability of the SVM is an integer programming problem, known to be NP- omplete. To simplify the problem, we rst rank the features and then sele t their number by minimizing a bound on the expe ted error probability of the lassi er. The outline of the paper is as follows: generating training and test data is des ribed in Chapter 2. In Chapter 3 we give a brief overview of SVM theory. In Chapter 4 we rank features in the input spa e a

ording to a lassi ation riterion. We then determine the appropriate number of ranked features in Chapter 5. In Chapter 6 we remove features from the feature spa e that have small ontributions to the de ision fun tion of the lassi er. In Chapter 7 we applied feature sele tion to a real-world appli ation.

2 Des ription of the Input Data 2.1

Input features

In this se tion we des ribe the pre-pro essing steps applied to the gray images in order to extra t the input features to our lassi er. To de rease the variations aused by hanges of illumination we used three prepro essing steps proposed in [Sung 96℄. A mask was rst applied to eliminate pixels lose to the boundary of the 1919 images, redu ing the number of pixels from 361 to 283. To a

ount for ast shadows we subtra ted a best- t intensity plane from the images. Then we performed histogram equalization to remove variations in the image brightness and ontrast. Finally the 283 gray values were re-s aled to a range between 0 and 1. We also omputed the gray value gradients from the histogram equalized images using 33 x- and y -Sobel Filters. Again the results were re-s aled to be in a range between 0 and 1. These 1 In the following, we use input spa e IRn for the representation spa e of the image data and feature spa e IRp ( ) for the non-linearly transformed input spa e.

p>n

1

gradient features were ombined with the gray value features to form a se ond set of 572 features2 . Additionally we applied Prin ipal Component Analysis (PCA) to the whole training set and proje ted the data points into the eigenve tor spa e. To summarize we onsidered four di erent sets of input features:

   

283 gray features 572 gray/gradient features 283 PCA gray features 572 PCA gray/gradient features

2.2

Training and test sets

In our experiments we used one training and two test sets. The positive training set

ontained 2,429 1919 fa es. The negative training set ontained 4,548 randomly sele ted non-fa es patterns. In the rst part of this paper, we used a small test set in order to perform a large number of tests. The test set was extra ted from the CMU test set 13 . We extra ted all 479 fa es and 23,570 non-fa e patterns. The non-fa e patterns were sele ted by a linear SVM lassi er as the non-fa e patterns most similar to fa es. The nal evaluation of our system was performed on the entire CMU test set 1, ontaining 118 images. Pro essing all images at di erent s ales resulted in about 57,000,000 analyzed 1919 windows.

3 Support Ve tor Ma hine Support Ve tor Ma hines [Vapnik 98℄ perform pattern re ognition for two- lass problems by nding the de ision surfa e whi h minimizes the stru tural risk of the lassi er. This is equivalent to determining the separating hyperplane that has maximum distan e to the losest points of the training set. These losest points are alled Support Ve tors (SVs). Figure 1 (a) shows a 2-dimensional problem for linearly separable data. The gray area indi ates all possible hyperplanes whi h separate the two lasses. The optimal hyperplane in Figure 1 (b) maximizes the distan e to the SVs. 2 As

reported in [Heisele et al. 2000℄, dete tion results with gradient alone were worse than those

for gray values. That is why we ombined gradient and gray features.

3 The

test set is a subset of the CMU test set 1 [Rowley et al. 97℄ whi h onsists of 130 images

and 507 fa es. We ex luded 12 images ontaining line-drawn fa es and non-frontal fa es.

2

a) b) Figure 1: a) The gray area shows all possible hyperplanes whi h separate the two

lasses. b) The optimal hyperplane maximizes the distan e to the losest points. These points (1, 2 and 3) are alled Support Ve tors (SVs). The distan e M between the hyperplane and the SVs is alled the margin. If the data are not linearly separable in the input spa e: a non-linear transformation () maps the data points x of the input spa e IR into a high dimensional,

alled feature spa e IR (p > n). The mapping () is represented in the SVM lassi er by a kernel fun tion K (; ) whi h de nes an inner produ t in IR . The de ision fun tion of the SVM is thus: n

p

p

f (x) = w  (x) + b =

X y K (x ; x) + b 0

i

i

(1)

i

i

where y is the lass label f 1; 1g of the training samples. Again the optimal hyperplane is the one with the maximal distan e (in feature spa e IR ) to the losest points (x ) of the training data. Determining that hyperplane leads to maximizing the following fun tional with respe t to : i

p

i

W ( ) =

X `

=1

i

i

1 2

X y y K (x ; x ) `

i

i;j

j

i

i

j

=1

j

(2)

under onstraints P =1 y = 0 and C   0; i = 1; :::; `. The solution of this maximization problem is denoted 0 = ( 10; :::; 0 ; :::; 0). ` i

i

i

i

k

3

l

An upper bound on the expe ted error probability EPerr of an SVM lassi er is given by:  1  EP  E R2 W ( 0 ) (3) err

`

where R is the radius of the smallest sphere in luding all points (x1 ); :::; (x`) of the training ve tors x1; :::; x`. In the following, we will use this bound of the expe tation of the leave-one-out-error to rank and sele t features.

4 Ranking Features in the Input Spa e 4.1

Des ription of the method

In [Weston et al. 2000℄ a gradient des ent method is proposed to rank the input features by minimizing the bound of the expe tation of the leave-one-out error of the lassi er. We implemented an earlier approximation of this approa h. The main idea is to re-s ale the n-dimensional input spa e by a n  n diagonal matrix  su h that the margin M in Equation (3) is maximized. However, one an trivially in rease the margin by simply multiplying all input ve tors by a s alar. For this reason the following onstraint is added jjjjF = N , where N is some onstant. This onstraint approximately enfor es the norm of radius R around the data to be onstant while maximizing the margin. The new mapping fun tion an be written as  (x) = (  x) and the kernel fun tion is K (x; y) = K (  x;   y) = ( (x)   (x)). The de ision fun tion given in Equation (1) be omes: f (x;  ) = w   (x) + b =

X i

i0 yi K (xi ; x) + b

(4)

The maximization problem of Equation (2) is now given by: W ( ;  ) =

` X i=1

i

` 1X 2 i;j=1 i j yiyj K (x ; x ) i

j

(5)

subje t to P`i=1 iyi = 0, C  i  0, jjjjF = N , and i  0. To solve this problem we stepped along the gradient of Equation (5) with respe t to  and until we rea hed a lo al maximum. One iteration onsisted of two steps: rst we held  onstant and trained the SVM to al ulate the solution 0 of the maximization problem given in Equation (2). In a se ond step, we kept onstant and performed the gradient 4

des ent on W with respe t to  subje t to the onstraint on the norm of  whi h is an approximation to minimizing the bound on EPerr a

ording to Equation (3) for a xed R. In our experiments we performed one iteration and then ranked the features by de reasing elements i of  .

4.2 Experiments on di erent input spa es We rst evaluated the ranking methods on the gray and PCA gray features. The tests were performed on the small test set for 60, 80 and 100 ranked features with a se onddegree polynomial SVM. In Figure 2 we show the 100 best gray features, bright gray values indi ate high ranking. The Re eiver Operator Chara teristi (ROC) urves of

a)

b)

Figure 2: a) First 100 gray features a

ording to ranking by gradient des ent. Bright intensities indi ate high ranking. b) Referen e 19  19 fa e. se ond-degree polynomial SVMs are shown in Figure 3. For 100 features there is no di eren e between gray and PCA gray features. However the PCA gray features gave

learly better results for 60 and 80 sele ted features. For this reason we fo used in the following experiments on PCA features only. An interesting observation was that the ranking of the PCA features obtained by the above des ribed gradient des ent method was similar to the ranking by de reasing eigenvalues. To ompare PCA gray/gradients with PCA gray features, we performed tests with 50 features on the entire CMU test set 1. Surprisingly, the results for gray values alone were better than those for the ombination of gray and gradient values. A possible explanation ould be that the gradient value features are noisier than the gray ones.

5

a)

b)

) Figure 3: Comparison of the two input spa es for a) 60 features b) 80 features and

) 100 features. 6

Figure 4: Comparison of the ROC urves for PCA gray features and PCA gray / gradient features.

5 Sele ting Features in the Input Spa e 5.1

Des ription of the method

In Chapter 4 we ranked the features a

ording to their s aling fa tors i . Now the problem is to determine a subset of the ranked features (x1 ; x2 ; :::; xn ) 2 IRn . This problem an be formulated as nding the optimal subset of ranked features (x1 ; x2 ; :::; xn ) among the n possible subsets where n < n is the number of sele ted features. As a measure of the lassi ation performan e of an SVM for a given subset of ranked features we used again the bound on the expe ted error probability.

EPerr 

1

`



E R 2 W ( 0 )



(6)

To simplify the omputation of our algorithm and to avoid solving a quadrati optimization problem in order to ompute the radius R, we approximated4 R2 by 2p where p is the dimension of the feature spa e IRp . For a se ond-degree polynomial 4 We previously normalized all the data in IRn to be in a range between 0 and 1. As a result the data points is

p

p-dimensional ube of length 2p. upper bound by

points lay within a

p2 in IRp and the smallest sphere in luding all the

7

kernel of type (1 + x  y)2 we get:

EPerr 

1

`



2 p E W ( 0 )

 

1

`



n (n + 3) E W ( 0 )



(7)

where n is the number of sele ted features5 . The bound of the expe tation of the leave-one-out error is shown in Figure 5. We had no training error for more than 22 sele ted features. The margin ontinuously in reases with in reasing numbers of features. The bound on the expe ted error shows a plateau between 30 to 60 features, then it signi antly in reases.

Figure 5: Bound on the expe ted error number of sele ted features6 .

5.2

Experiments

To evaluate our method, we tested the system on the large CMU test set 1 onsisting of 479 fa es and about 57,000,000 non-fa e patterns. In Figure 6, we ompare the ROC urves obtained for di erent numbers of sele ted features. The results show that using more than 60 features did not improve the performan e of the system. 5 As we used a se ond-degree polynomial SVM the dimension of the feature spa e p = 6 Note that we did not normalize the by the number of training samples l .

8

n (n +3)=2.

Figure 6: ROC urves for di erent number of features.

6 Feature Redu tion in the Feature Spa e In the previous Chapter we des ribed how to redu e the number of features in the input spa e. Now we onsider the problem of redu ing the number of features from the feature spa e. We used the method proposed in [Heisele et al. 2000℄ based on the

ontribution of the features to the de ision fun tion f (x) of the SVM. f (x) = w  (x) + b =

X y K (x ; x) + b 0

i

i

(8)

i

i

where w = (w ; :::; w ). For a se ond-degree polynomial kernel with K (x; y) = (1 + xp y) , the p feature p spa e IR with dimension p pp = pis given by : x = ( 2x ; 2x ; ::; 2x ; x ; x ; ::; x ; 2x x ; 2x x ; ::; 2x x ). The ontribution of a feature x to the de ision fun tion in Equation (8) depends on w . A straightforward way to order the features is by de reasing jw j. Alternatively, one an weight w by the Support Ve tors to a

ount for di erent distributions of the features in the training data. The features were ordered by de reasing jw P y x j, where x denotes the k-th omponent of Support Ve tor i in feature spa e IR . For the two methods we rst trained an SVM with a se ond-degree polynomial kernel with an input spa e of 60 features whi h orresponds to 1891 features in the feature P spa e. We then al ulated jf (x ) f (x )j for all Support Ve tors, where f (x) is the de ision fun tion using the S rst features a

ording to their ranking. The results in Figure 7 show that ranking by the weighted features of w lead to faster 1

p

2

1

n(n+3)

p

2

n

2 1

2 2

2

2

1

n

2

1

3

n

1

n

k

k

k

k

i

i

i;k

p

i;k

i

i

S

9

i

S

onvergen e of the error.

Figure 7: Classifying Support Ve tors with a redu ed number of features. The x-axis shows the number of features, the y -axis is the mean absolute di eren e between the output of the SVM using all features and the same SVM using the S rst features only. The features were ranked a

ording to the features and the weighted features of the normal ve tor of the separating hyperplane. Figure 8 shows the ROC urves for 500 and 1000 features. As a referen e we added the ROC urve for a se ond-degree SVM trained on the original 283 gray fea= 40; 469. By tures. This orresponds to a feature spa e of dimensionality (283+3)283 2

ombining both methods of feature redu tion we ould redu e the dimensionality by a fa tor of about 40 without loss in lassi ation performan e.

7 Appli ation 7.1

Ar hite ture of the system

We applied feature sele tion to a real-world appli ation where the goal was to determine the orientation (right side up or up side down) of fa e images in real-time. To solve this problem we applied frontal fa e dete tion to the original and the rotated images (180Æ ). The images in whi h at least one fa e was dete ted with high on den e were onsidered to be right side up. 10

Figure 8: ROC urves for di erent dimension of the feature spa e. We used a subset of the Kodak Database onsisting of 283 images of size 512  768. The resolution of the fa es varied approximately between 20  20 and 200  200. The average number of fa es per image was 2. Even after applying the two feature sele tion methods des ribed in this paper, the omputational omplexity of a polynomial se ond-degree SVM lassi er was still too high for a real-time system. That is why we implemented a two-layer system where the rst layer onsists of a fast linear SVM that removes large parts of the ba kground. The se ond layer onsists of a more a

urate polynomial SVM performs the nal fa e dete tion. Our system is illustrated in Figure 9. (B) and (C) show the responses of the linear lassi er for the original and the rotated images. Bright values indi ate the presen e of fa es. Thresholding these images leads to binary images (A) and (D) where the lo ations of potential fa es are drawn in bla k. At these lo ations we sear h for fa es using the polynomial se ond-degree SVM of the se ond layer. 7.2

Experiments

In the rst experiment we applied a se ond-degree SVM lassi er trained on 60 PCA features to the Kodak database. All 283 images were right side up. The results are shown in Figure 10 and ompared to the ROC urve for the CMU test set. The fa t that the ROC urve for the Kodak database is worse than the ROC urve for the CMU test set 1 an be explained by the large number of rotated fa es, fa es of babies, and hildren with masked fa es (see Figure 11). 11

Figure 9: Ar hite ture of the real-time system determining the orientation of a fa e.

Figure 10: ROC urve for the Kodak database.

12

Figure 11: Images from the Kodak database.

13

In a se ond experiment we onsidered the two-layer system. We hose the threshold for the linear SVM from previous results on the CMU test set. For this threshold we lassi ed orre tly 99.8% of fa es and 99.9% of non-fa e patterns. In the worst ase, the average number of multipli ations for the whole system is about 300 per pixel and per s ale 7 . Sear hing for a fa e dire tly with a se ond-degree polynomial SVM using gray values would have lead to 81; 000 operations. As a result, we sped up the system by a fa tor of 270.

8 Con lusion We presented a method to sele t features for a fa e dete tion system using Support Ve tor Ma hines (SVMs). By ranking and then sele ting PCA gray features a

ording to a SVM lassi ation riterion we ould remove about 80% of the input features. In a se ond step we further redu ed the dimensionality by removing features with low

ontributions to the de ision fun tion of the SVM. Overall we kept less than 2% of the original features without loss in lassi ation performan e. We demonstrated the eÆ ien y of our method by developing a real-time system that is able to determine the orientation of fa es.

Referen es [Heisele et al. 2000℄ B. Heisele, T. Poggio, M. Pontil. Fa e Dete tion in Still gray Images. A.I. memo 1687, Center for Biologi al and Computational Learning, MIT, Cambridge, MA, 2000. [Rowley et al. 97℄ H. A. Rowley, S. Baluja, T. Kanade. Rotation Invariant Neural Network-Based Fa e Dete tion. Computer S ien t Te hni al Report CMU-CS-97201, CMU, Pittsburgh, 1997. [Sung 96℄ K.-K. Sung. Learning and Example Sele tion for Obje t and Pattern Re ognition. Ph.D. thesis, MIT, Arti ial Intelligen e Laboratory and Center for Biologi al and Computational Learning, Cambridge, MA, 1996. [Vapnik 98℄ V. Vapnik. 1998.

Statisti al learning theory.

New York: John Wiley and Sons,

7 The number of operations for the rst level is equal to 283 (dimension of the spa e). For the se ond level we assume that the per entage of pixels that pass the rst level is equal to 0.001. For proje ting the data into the eigenve tor spa e we have to perform 60  283 multipli ations. Finally we have to proje t the input features into the feature spa e and al ulate the dot produ t of the 1000 sele ted features with the normal ve tor of the separating hyperplane. Overall this results in 0:001  (60  283 + 2  1000) = 19 multipli ations per shifted window.

14

[Weston et al. 2000℄ J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik. Feature Sele tion for SVM's. Submitted to Advan es in Neural Information Pro essing Systems 13, 2000.

15