Face Detection with the Modified Census Transform

Face Detection with the Modified Census Transform Bernhard Fr¨oba Andreas Ernst Department of Applied Electronics Fraunhofer Institute for Integrate...
Author: Ashlynn Shaw
0 downloads 1 Views 1MB Size
Face Detection with the Modified Census Transform Bernhard Fr¨oba

Andreas Ernst

Department of Applied Electronics Fraunhofer Institute for Integrated Circuits Am Wolfsmantel 33, D-91058 Erlangen, Germany

Abstract

Refs. [4, 5, 9]. Schneiderman [6] and Viola [8] apply a simpler normalization to zero mean and unit variance on the analysis window. Often illumination compensation demands more computational power than the classification of an image patch itself. For example Yang [9] presented the simple linear SNoW classifier for detection which operates directly on pixel intensities. It would require only to accumulate the outcome of 400 lookup operations to classify an image patch of size 20 × 20 while alone the histogram equalization on the same patch takes a higher effort. Therefore we advocate the use of inherently illumination invariant image features for object detection which convey only structural object information. We propose the new feature set Local Structure Features computed from a 3 × 3 pixel neighborhood using a modified version of the Census Transform [10]. We use this feature set with a four-stage classifier cascade. Each stage classifier is a linear classifier which consists of a set of lookup-tables of feature weights. Detection is carried out by scanning all possible analysis windows of size 22 × 22 by the classifier. In order to find faces of various size the image is repeatedly downscaled with a scaling factor of Smra = 1.25. This is done until the scaled image is smaller than the sub window size. For typical images of size 388 × 284 Pixels and a sub window size of 22 × 22 we obtain 10 downscaled images which constitute an image pyramid. To further speed up processing the pyramid is scanned using a grid search which we introduced with our former work [2].

Illumination variation is a big problem in object recognition which usually requires a costly compensation prior to classification. It would be desirable to have an image to image transform which uncovers only the structure of an object for an efficient matching. In this context the contribution of our work is twofold. First we introduce illumination invariant Local Structure Features for object detection. For an efficient computation we propose a Modified Census Transform which enhances the original work of Zabih and Woodfill [10]. We show some shortcomings and how to get over them with the modified version. Secondly we introduce a efficient four-stage classifier for rapid detection. Each single stage classifier is a linear classifier which consists of a set of feature lookup-tables. We show that the first stage which evaluates only 20 features filters out more than 99% of all background positions. Thus the classifier structure is much simpler than previous described multi-stage approaches, while having similar capabilities. The combination of illumination invariant features together with a simple classifier leads to a real-time system on standard computers (60msec, image size: 288 × 384, 2GHz Pentium). Detection results are presented on two commonly used databases in this field namely the MIT+CMU set of 130 images and the BioID set of 1526 images. We are achieving detection rates of more than 90% with a very low false positive rate of 10−7 %. We also provide a demo program that can be found on the internet http://www.iis.fraunhofer.de/bv/biometrie/download/.

2 Feature Generation 1. Introduction 2.1 Local Structure Features

Pose and illumination variations are the biggest challenges in object detection. While the first can be approached by a proper modeling of the target there exist several strategies to cope with the illumination problem. Often compensation with a simple lighting model followed by histogram equalization is performed like first proposed by Sung [7] in the context of face detection. Many other detection approaches adopt this strategy with minor modifications, see

The features used in this work are defined as structure kernels of size 3 × 3 which summarize the local spatial image structure. Within the kernel structure information is coded as binary information {0, 1} and the resulting binary patterns can represent oriented edges, line segments, junctions, ridges, saddle points, etc. Fig.1 shows some examples of this structure kernels. On a local 3 × 3 lattice there exist 1

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

29 = 512 such kernels. Actually there are only 29 − 1 reasonable kernels out of the 29 possible because the kernel with all elements 0 and that with all 1 convey the same information (all pixels are equal) and so one is redundant and thus is excluded. At each location only the best matching

Note that C(x) is not an intensity or similarity coefficient as with linear transforms. Moreover all bits of C(x) have the same significance level. Thus C(x) may be interpreted as the index of a structure kernel defined on N (x) with the center set to zero. In this interpretation the pixels in the kernel represent the outcome of the single census comparison ζ(.) at the corresponding location in the neighborhood. This interpretation links the census transform to the local structure kernels introduced in the last section. The Census Transform in it’s original form does not always result in the best describing kernel like shown below, because not all kernels are computable.

2.3 The modified Census Transform Using the original formulation of the census transform given in Equ. (1) we can only compute a subset of 2 8 = 256 of all the 511 structure kernels defined on a 3 × 3 neighborhood. This is because the value of the center pixel of the kernel is fixed to 0 by the choice of the comparison function ζ(.) as described above. In order to obtain all the structure kernel we switch the basis of the comparison, so that we also obtain a result for the center pixel. Let N  (x) be a local spatial neighborhood of the pixel at x so that N  (x) = N (x) ∪ x. The intensity mean on this neighborhood is denoted by ¯ I(x). With this we now reformulate Equ. (1) and write the modified census transform as

Figure 1: A randomly chosen subset of 25 out of 29 − 1 possible Local Structure Kernels in a 3 × 3 neighborhood. kernel is used for description. The whole procedure can be thought of as non-linear filtering where the output image is assigned the best fitting kernel index at each location. In the next section we show how the index of the best kernel can be obtained by a modified version of the census transform.

Γ(x) =



ζ(¯ I(x), I(y)).

(2)

y∈N 

With this transform we are able to determine all of the 511 structure kernels defined on a on a 3 × 3 neighborhood. To illustrate the modified transform consider a region of an image with the following pixel intensities in the first column. The structure kernels assigned by the original transform and the modified transform are displayed in column 2 and 3. The transforms are given for the center pixel of the image patch.

2.2 The Census Transform The Census Transform (CT) is a non-parametric local transform which was first proposed by Zabih and Woodfill [10]. It is defined as an ordered set of comparisons of pixel intensities in a local neighborhood representing which pixels have lesser intensity than the center. In general the size of the local neighborhood is not restricted, but in this work we always assume a 3 × 3 surrounding as motivated in the last section. Let N (x) define a local spatial neighborhood of the pixel at x so that x ∈ N (x). The CT then generates a bit string representing which pixels in N (x) have an intensity lower than I(x). With the assumption that pixel intensities are always zero or positive the formal definition of the process is like follows: Let a comparison function ζ(I(x), I(x )) be 1 if I(x) < I(x ) and let ⊗ denote the concatenation operation, the census transform at x is defined as  ζ(I(x), I(y)). (1) C(x) =

I(x) 1 5 1

1 5 1

C(x)

Γ(x)

1 5 1

From this example we can see that the original census transform will not capture the local image structure correctly in some cases while the modified transform assigns the right kernel.

y∈N

2 Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

2.4 Illumination Invariance of the Structure Kernels Ideally image features for detection and recognition systems should only be depended on the structure of the interesting object. In a simple model of image formation the image intensity I(x) is regarded as the product of object reflectance R(x) and the illuminance L(x) at each point x = (x, y). Additionally the camera influence can be modelled by a gain factor g and a bias term b, which are assumed to be constant on the image plane. Thus a simple image formation model is I(x) = gL(x)R(x) + b.

(3)

Robust detection systems should only be based on the object structure which is conveyed by it’s reflectance properties. But without any knowledge or assumptions on the illumination field gL(x) determining the object’s reflection field R(x) is an ill posed problem. A popular assumption on L(x) is that it varies only smoothly in x. The proposed local structure features used in this work are also based on the assumption that gL(x) is spatially smooth. With this assumption we consider the lighting parameters to be constant in a small 3 × 3 neighborhood, L(x) = L. As the application of constant illumination and gain defines a linear and thus monotonic transformation on the reflectance R(x). A monotonic transform preserves the order with respect to it’s arguments and thus does not change the intensity order in the neighborhood. This means that the Census Transform which relies on the local intensity order is unaffected. The same applies for the modified Census Transform. In Fig. 2 an illustrative example is shown. The Census Transform is visualized as index image where the kernel index determines the pixel intensity.

Figure 2: Example of the illumination invariance of the Census Transform (CT). Although the illumination varies considerably the CT is almost the same. The CT is visualized by regarding the kernel indices as gray values. by Viola [8], but there some thirty stages were necessary and especially the early stages are less powerful than in this method. Let Hj (Γ) be the classifier of the jth stage, which classifies the current analysis window, represented by the modified census features Γ, by  hx (Γ(x)), (4) Hj (Γ) = x∈W 

where x denotes the location within the analysis window and W  ⊆ W is the set of pixel-locations with an associated pixel-classifier hx . The pixel-classifier hx also called elementary classifier consists of a lookup table of length 511 which is the number of the possible kernel indices of the Modified Census Transform. The lookup table holds a weight for each kernel index. A response from an elementary classifier is the weight addressed by the kernel index. As the features are integer indices of the active structure kernel they are lying in the range [0..511] and are directly feed into the lookup table. Fig. 4 shows one stage classifier together with the visualization of an pixel-classifier lookup table. The white dots mark the elements of W  . For training the stage classifiers we use two different algorithms. The first three stages are trained with an AdaBoost approach like described in the next section. The last stage is trained using the Winnow update rule, which is detailed in Sect. 3.3.

3 Training and Classification 3.1 Sequence of Classifiers The face detector analyzes image patches W of size 22 × 22 pixel. In the current setup for frontal faces each window has to be classified either as face or background. For a fast detection system this decision should be computable in an efficient manner. The whole classification procedure consists of a sequence of tests performed on the analysis window. The window may be rejected as background after each stage. In the current system we use four such stages like displayed in Fig. 3. The first stage has the lowest complexity (the results of only 20 lookup-table operations have to be accumulated) but is able to reject more than 99% of all windows as background while retaining almost all of the face locations. A similar approach was recently described

3.2 Training of a Stage Classifier The training of a stage classifier is done using a version of the boosting algorithm described in [1]. In boosting a number of weak classifiers are combined to form a final strong 3

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

Analysis Window

Face

Reject

Reject

Reject

Reject

Figure 3: The Detector has four classifiers of increasing complexity. The white dots mark positions of elementary classifiers in the analysis window. Only features from this positions are used for classification. Each stage has the ability to reject the current analysis window as background or pass it on to the next stage dependent on the outcome of a thresholding operation. the lowest boosting error t and with regard to the maximal number of feature positions allowed is chosen in every boosting loop. The maximal number of classifier locations is limited on each stage. In the current version we used a maximum of 20 locations for the first stage, 40 for the second and all 484 (the window has 22 ∗ 22 = 484 locations) for the third stage. It is noticeable that that not all of the possible locations are used in the third stage (see Fig.3) even after 1500 boosting cycles. The final pixel classifier hx is the weighted sum of all weak classifiers wt at location x. The final stage classifier Hj is the sum of all pixel classifiers hx , see Equ. (4). The decision rule for a valid face at stage j is Hj (Γ) ≤ Tj , (5)

classifier. The goodness of a weak classifier is measured by it’s error t on the training set. A stage classifier consists of a set of lookup tables {hx ; x ∈ W  } for the positions x = (x, y) chosen by the algorithm. Each lookup table holds a weight for each kernel index γ, with 0 ≤ γ ≤ 511. Table 1 gives the pseudo code for the AdaBoost following the notation in [1]. One so called weak classifier wx for a single pixel position is generated in every boosting round like described in Tab. 2. For the construction of a weak classifier we first count the kernel index statistics at each position with respect to the boosting weights Dt (i) of the training data. The resulting weighted histograms gt0 and gt1 determine whether a single feature should be associated with the face or nonface class. If it is more likely to show up in the face class the weak classifier wx is assigned 0 at position γ, else 1. Finally the single feature weak classifier at position x with

where Tj is the score threshold at stage j. The score threshold Tj is tuned so that it maximizes the detection rate on a test database different from the training set.

3.3 Training of the last Stage In fourth stage we use a different training algorithm. It produces the same kind of lookup-tables, but it uses all of the given pixel locations. The decision function for this classifier is

Η(Γ)

Γ

H(Γ)

= Hb (Γ) − Hf (Γ)   = hbx (Γ(x)) − hfx (Γ(x)). x∈W

1

h

(18,16)

(γ)

where hfx is are the weights for the face class and hbx those of the background-class at location x. The pixel-classifier in this case consists of the difference of the two class specific weight tables. As the summation is performed in the same domain the pixel-weights can be summed up at training time, hx = hbx − hfx . Two separate weight-tables are only necessary for training as we shall see. The final decision is made by applying a threshold to H(Γ), see Equ. (5), which is determined to achieve a given error or detection rate on a database different from the training set. The pixel classifier hx has the same structure as the boosted

0 0

10

20 490 Kernel Index γ

500

(6)

x∈W

510

Figure 4: Illustration of a lookup-table of an elementary classifier hx (γ). 4

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

• Generate tables of local weighted kernel indices from faces and non-faces:  Dt (i)I(Γi (x) = γ)I(ci = 0), gt0 (x, γ) =

• Given (Γ1 , c1 ) . . . , (Γm , cm ) where ci = 0 for Γi ∈ F and ci = 1 for Γi ∈ B where F and B is the class of faces and non-faces. 1 1 , 2n for ci = 0, 1 where • Initialize D1 (i) = 2l l and n are the number of faces and nonfaces.

i,x,γ

gt1 (x, γ)

ln

e eαt

t

• Calculate error t for each look-up table (x):    min gt0 (x, γ), gt1 (x, γ) t (x) =

1−t t

– Update the distribution: = Dt+1 (i)  −α

if wt (Γi (x)) = ci if wt (Γi (x)) = ci

γ

Dt (i) Zt

• Select the best position xt of loop t:   x | t (x) = min {t (x)} (x) xt =  x | t (x) = min  {t (x)}

×

where Zt is chosen so that Dt+1 is a distribution.

(x) ∈ Wt

hx (γ) =

αt wt (γ)I(x = xt ),

1

where I() is the indicator function that takes 1 if the argument is true and 0 otherwise.

x

else

Table 2: Training of a weak structure feature classifier.

• The final strong classifier is based on the final face model: 

else

• Create look-up table for the weak classifier of loop t and position xt :  0 if gt0 (xt , γ) > gt1 (xt , γ) wt (γ) =

t=1

H(Γ) =

if |Wt | < n

whereas n is the maximal number of positions allowed and Wt is the set of loca = tions already chosen till loop t, thus Wt+1   {(xt ) ∪ Wt } and W1 = {}.

• The resulting elementary classifier of a single feature position x can be obtained by a combination of the appropriate weak classifiers: T 

Dt (i)I(Γi (x) = γ)I(ci = 1),

where I() is the indicator function that takes 1 if the argument is true and 0 otherwise.

– Generate weak classifier wt at position xt with error t and distribution Dt as shown in Table 2.   1 2



i,x,γ

• For t = 1, . . . , T :

– Choose αt =

=

with the Winnov Update Rule [9] which is a multiplicative update rule. There are three training parameter namely a threshold TΘ , a promotion parameter α > 1 and a demotion parameter 0 < β < 1. For training we fixed the threshold to TΘ = 128 in all of our experiments. Given (Γ1 , c1 ) . . . , (Γm , cm ) where ci = 0 for Γi ∈ F and ci = 1 for Γi ∈ B where F is the class of faces and B that of non-faces. If Hf (Γ) ≤ TΘ which means that the pattern is rejected from the face class but it’s true label is ci = 0 then the involved weights are increased

hx (Γ(x)).

Table 1: Boosting the sparse local structure net.

pixel classifiers (see Fig. 4) described in the last section. The main difference is the fundamentally different training procedure which leads to a different error distribution. The two sets of weight - tables {hfx } and {hbx } are trained using an iterative procedure. Initially all weights are set to zero. If a certain weight is addressed for the first time during training it is set to a start-value of 1. The adaptation of the weights is mistake-driven, i.e. if the current pattern is misclassified only the weights addressed by the pattern are changed. The change applies immediately (online update policy ). The weights are only updated if the current training pattern is misclassified. The weight update is done

∀x ∈ W, hfx (Γi (x)) ← αhfx (Γi (x)).

(7)

If ci = 0 and Hf (Γ) > TΘ , all involved weights are decreased, ∀x ∈ W, hfx (Γi (x)) ← βhfx (Γi (x)).

(8)

The same applies for the non-face class. The training procedure is repeated iteratively on a training set until no more weight changes occur. The update parameters have been set to α = 1.01 and β = 0.99 for this work. 5

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

,

1

Detection Rate (DET)

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55

CMU − database BioID − database

0.5 0

5

10 15 20 Number of False Accepts

25

30

cation we propose a four-stage classifier cascade consisting of simple linear classifiers. Each classifier consists of a set of lookup-tables of feature weights, ranging from 20 in the first stage to all 484 in the last. Though evaluating only 20 features of an analysis window the first stage is able to correctly reject over 99% of all background locations. Together with a coarse-to-fine grid search introduced in our former work this leads to an efficient real-time detector with high detection rates and very few false positives. The system is able to analyze a video stream of spatial resolution 384×288 at frame rate (0.06 sec/frame) on a Pentium 2GHz computer. This static detection algorithm achieves detection rates of more than 90% on three databases which are widely used in the detection community. While achieving comparable results on the CMU sets, we reach the best published results on the BioID database, see [3] for comparison.

References

Figure 5: Detection results from the CMU+MIT (130 images) and the BioID (1526 images) databases.

[1] Yoav Freund and Robert E. Shapire. A short introduction to boosting. In Journal of Japanese Society for Artificial Intelligence, number 14, pages 771–780, September 1999. [2] Bernhard Fr¨oba and Christian K¨ublbeck. Robust face detection at video frame rate based on edge orientation features. In International Conference on Automatic Face and Gesture Recognition (FG ’02), pages 342–347, Washington D.C., May 2002. [3] Klaus J. Kirchberg, Oliver Jesorsky, and Robert W. Frischholz. Genetic optimisation for hausdorff-distance based face localisation. In Intl. Workshop on Biometric Authentication 2002 (ECCV’2002), pages 103–111, Copenhagen, Denmark, June 2002. [4] Edgar E. Osuna, Robert Freund, and Federico Girosi. Support vector machines: Training and application. Technical Report 1602, Masachusettes Institute of Technology, March 1997. [5] Henry A. Rowley. Neural Network-Based Face Detection. PhD thesis, Carnegie Mellon University, Pitsburgh, 1999. [6] Henry Schneidermann and Takeo Kanade. Probabilistic modeling of local appearence and spatial relationship for object recognition. In International Conference on Computer Vision and Pattern Recognition, 1998. [7] Kah Kay Sung. Learning and Example Seletion for Object and Pattern Detection. PhD thesis, Massachusetts Institute of Technology, January 1996. [8] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 2001. [9] Ming-Hsuan Yang, Dan Roth, and Narendra Ahuja. A snowbased face detector. In Advances in Neural Information Processing Systems 12 (NIPS 12), pages 855–861. MIT Press, 2000. [10] Ramin Zabih and John Woodfill. A non-parametric approach to visual correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996.

4 Experiments 4.1 Training the Detector For training of the detector we use a set of about 6000 manually cropped upright face images and also the mirrored images. The nonface class is initially represented by 2000 non face images which were randomly collected. Each single classifier is then trained using a bootstrapping approach similar to that described in [5] to increase the number of images in the non-face set. The bootstrapping is carried out 5 times on a set of 4000 images containing no faces.

4.2 Detection Results The detector is evaluated on two in the field of face detection commonly used upright databases namely the CMU+MIT database [5] which has 130 images showing 483 upright faces and the BioID database [3] which has 1521 images showing 1522 frontal faces. The detection performance on both sets is shown in Fig. 5. With our multi-resolution setting and analysis window size the BioID database has approximately 335 ∗ 106 and the MIT+CMU 80 ∗ 106 analysis windows to classify. If we sum up the maximal number of false detects form Fig. 5 from both sets and relate it to the total number of windows classified we obtain a false positive rate of 1.3 ∗ 10−7 %

5 Summary and Discussion In this work we propose a real-time detection system for frontal faces in gray images. The detection is based on Local Structure Features which are computed with the Modified Census Transform both introduced here. For classifi6

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

Suggest Documents