Live facial feature extraction

Science in China Series F: Information Sciences © 2008 www.scichina.com info.scichina.com www.springerlink.com SCIENCE IN CHINA PRESS Springer Liv...
4 downloads 0 Views 315KB Size
Science in China Series F: Information Sciences © 2008

www.scichina.com info.scichina.com www.springerlink.com

SCIENCE IN CHINA PRESS

Springer

Live facial feature extraction ZHAO JieYu Research Institute of Computer Science and Technology, Ningbo University, Ningbo 315211, China (email: [email protected])

Precise facial feature extraction is essential to the high-level face recognition and expression analysis. This paper presents a novel method for the real-time geometric facial feature extraction from live video. In this paper, the input image is viewed as a weighted graph. The segmentation of the pixels corresponding to the edges of facial components of the mouth, eyes, brows, and nose is implemented by means of random walks on the weighted graph. The graph has an 8-connected lattice structure and the weight value associated with each edge reflects the likelihood that a random walker will cross that edge. The random walks simulate an anisotropic diffusion process that filters out the noise while preserving the facial expression pixels. The seeds for the segmentation are obtained from a color and motion detector. The segmented facial pixels are represented with linked lists in the original geometric form and grouped into different parts corresponding to facial components. For the convenience of implementing high-level vision, the geometric description of facial component pixels is further decomposed into shape and registration information. Shape is defined as the geometric information that is invariant under the registration transformation, such as translation, rotation, and isotropic scale. Statistical shape analysis is carried out to capture global facial features where the Procrustes shape distance measure is adopted. A Bayesian approach is used to incorporate high-level prior knowledge of face structure. Experimental results show that the proposed method is capable of real-time extraction of precise geometric facial features from live video. The feature extraction is robust against the illumination changes, scale variation, head rotations, and hand interference. live facial feature extraction, random walks, anisotropic diffusion process, statistical shape analysis

1 Introduction ―

Face processing has been an active research topic for decades[1 4]. During the past few years it has received significant attention due to the availability of feasible technologies and the large number of commercial and law-enforcement applications. Received March 14, 2007; accepted July 10, 2007 doi: 10.1007/s11432-008-0049-6 Supported by the National Natural Science Foundation of China (Grant No. 60672071), the Ministry of Science and Technology (Grant No. 2005CCA04400) and the Ministry of Education (Grant No. NCET-05-0534)

Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

As one of the most important aspects of face processing, facial feature extraction is fundamental for the success of vision-based human computer interaction. It plays a key role in many applications, such as facial expression analysis, face recognition, biometric authentication, animation, teleconferencing, etc. The small errors or the loss of information in facial feature extraction could easily lead to wrong results in some tasks, such as identity verification and facial expression rec― ognition. The accurate and rapid facial feature extraction remains a challenging task[2,5 7] because of variability in scale and orientation, illumination changes, and limited quality of video image. In this paper, we address the problem of precise facial feature extraction from live video. Unlike most commonly used face processing approaches which directly adopt abstract features or track some simple landmark points on the face, we precisely represent the facial components of the mouth, eyes, brows, and nose with pixels and therefore keep all detailed geometric information of the components. To meet the demands of online applications of vision-based human computer interaction, the facial feature pixel segmentation is carried out in real time. Our work is limited to the front-view facial images with moderate head rotations. For further implementation of high-level vision tasks, the geometric description of facial component pixels is decomposed into shape and registration information through the powerful statistical shape analysis. In this way, we are capable of capturing features that are invariant under the transformation, such as translation, rotation, and isotropic scale. A Bayesian approach is used to incorporate high-level prior knowledge of facial structure and enhance the robustness of the system. ― Image processing methods based on graph theory have become popular in recent years[8 16]. These graph-based algorithms exhibit fast computation and precise results. Images are treated as weighted graphs of vertices and edges. Vertices reflect the states of image pixels and weighted edges represent the relationship between pixels. Graph cut[14] and random walk[11,13,15] techniques provide powerful tools for the interactive image segmentation where some prelabeled points (seeds) are given. The graph cut approach turns the segmentation problem into finding a minimum-weight cut between the different labeled points. Segmentation with random walks is achieved according to the probability of a random walker starting from one point first reaches a seed point. This is equivalent to the solution to Dirichlet problem with boundary conditions at the locations of the seed points[9]. However, it would be infeasible for a video image segmentation algorithm to find optimal solutions for the corresponding optimization problem in real time. In our approach, we use random walks on the weighted graph to approximately implement an anisotropic diffusion process[13] with a limited number of steps. It simultaneously filters out noises and smoothes the non-facial pixels but preserves facial feature pixels. This paper is organized as follows: Section 2 explains the basics of graph representation and random walks on the weighted graph. Then, in section 3, the process of random walkers for facial feature pixel segmentation is provided. Section 4 describes the statistical shape analysis method. Experimental setup and results are given in section 5.

2 Graph representation and random walk A natural representation for the image is based on a weighted graph, where the nodes are image pixels and the edge weights encode the connecting information between the pair of nodes. In this

490

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

paper, we use a graph with an 8-connected lattice structure. The weight associated with each edge reflects strong facial features will have a high value, and that reflects weak facial features will have a low value. Now, we give a formal description of the graph model and random walk on a weighted graph. A graph G=(V, E) consists of a countable set of vertices V and a set of edges E. An edge e ∈ E is a pair e = x, y = y, x such that x, y ∈ V . If x and y are connected by an edge, then this neighbor relation is denoted by x~y. A weighted graph is Gw = (G , w) , where G is some

graph, and w is a function, w : E (G ) 6 ℜ > 0 . A random walk on a graph is a process that takes on values in G, and, in every step, looks around at the neighboring vertices (or edges) and their weights and chooses one of them with a probability that is the ratio between its weight and the sum of all of them. For ei ∈ E , let wi = w(ei ) be the weight associated with edge ei. For v ∈ V , let N (v) = {e ∈ E (G ) : e = y , v some y ∈ V } be the set of edges adjacent to v, and let W(v) be the total weight around v,

so that W (v) = ∑ e ∈N (v ) wi . Then, a random walk X on Gw = (G , w) is a process that takes vali

ues in G and the transition probabilities are given by P( X n +1 = v | X n ) = I v ~ X n

w ( v, X n W (Xn )

),

(1)

where I v ~ X n is the indicator function of the set of edges connecting v and X n .

3 Facial feature pixel segmentation Random walks on the weighted graph can be effectively used for the image filtering and segmentation. The system is usually implemented with the anisotropic diffusion equation introduced by Perona and Malik[13] to filter the image or built with Laplace equation to perform image segmentation tasks[11]. Given some prelabeled pixels (seeds), the algorithm assigns the other pixels the label of the seed pixel that a random walker starting from that pixel would be most likely to reach first. The goal of most anisotropic diffusion algorithms is to smooth the image within homogeneous regions but not across the boundaries, while the goal of image segmentation algorithms is to label the homogeneous regions. In this paper, we would like to extract facial feature pixels from live video, of which the image is usually full of noise. Thus, we need to address both goals simultaneously, namely to filter out the noise while preserving the facial expression pixels. Let F ( x, y, t ) denote a real valued function representing the current image; the anisotropic diffusion process can be expressed as follows: ⎡ ∂ 2 F ( x, y , t ) ∂ 2 F ( x, y , t ) ⎤ ∂F ( x, y , t ) = c⎢ + ⎥, ∂t ∂x 2 ∂y 2 ⎣ ⎦

(2)

where x, y are the image coordinates, t denotes the time, and c is conduction coefficient. To preserve the image structure (the facial edges) while smoothing, the conduction coefficient c is redefined to be space-varying. A common choice of the conduction coefficient is

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

491

⎛ ∇F ( x, y , t ) c( x, y, t ) = exp ⎜ − ⎜ 2λ 2 ⎝

2

⎞ ⎟. ⎟ ⎠

(3)

Thus, the conduction coefficient will have a large value in a homogeneous region to encourage smoothing and a small value at facial edges to preserve facial features. The diffusion process can be simulated with a group of random walks spread out on the weighted graph defined on the image. Let Gw = (G , w) be a weighted graph defined on the image with an 8-connected lattice structure, and let wi , j be the weight associated with edge connecting vertices i and j. Supposing the random walk on the graph is self-avoiding, namely it does not pass through the same point twice, the one step transition probability from X0 to X1 is exp {− β w0,1}

P [1, X 0 , X 1 ] =



vk ,vk +1∈{v0 →v1 }

exp {− β wk ,k +1}

,

(4)

where v0 and v1 are two specific neighboring points and {v0 → v1} is a set of points on all trajectories leading from v0 to v1. A smooth operator S is defined as S (v0 ) =



{v0 →vn }

P[n, v0 → vn ] ⋅ F (vn ),

where the sum is taken over all possible trajectories starting from v0 ending at vn. The transition probability from X0 to Xn is given by P [ n, X 0 , X n ] =

exp {− β ( w0,1 + w1,2 + ... + wn −1, n } ⎧⎪ ⎫⎪ ∑ exp ⎨−β ∑ w j , j +1 ⎬ {v0 → vn } j ⎩⎪ ⎭⎪

.

(5)

.

(6)

Thus the smooth operator S can be rewritten as S (v0 ) = F (vn ) ⋅

exp {− β ( w0,1 + w1,2 + ... + wn −1, n } ⎪⎧ ⎪⎫ ∑ exp ⎨−β ∑ w j , j +1 ⎬ {v0 → vn } j ⎩⎪ ⎭⎪

Here, β ≥0 is a control parameter, the operator S defines the moving average when β = 0, and ⎪⎧ ⎪⎫ it assigns value F (vk ) to the point v0 where k = arg min ⎨∑ w j , j +1 ⎬ as β → ∞ . Note the {v0 →vk } ⎪⎩ j ⎪⎭ exponent calculation in eqs. (4)―(6); the transition probability here is different from that defined in eq. (1). This is deliberately designed to accelerate the computation because the random walks only take place in the local area around positive seed points. It is not necessary to compute the exponent function for all the weights in the graph. In this way the random walkers eliminate the noise that consists of few pixels (less than n pixels) and keep the mutually connected feature pixels from vagueness. The smooth operator works like some morphology filters, but it does not need to define a structural element in advance.

492

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

4 Statistical shape analysis Shape plays a key role in both human and machine vision. Here, we define shape as the geometric information that is invariant under the transformation, such as translation, rotation, and isotropic scale. Analysis of shapes of complex visual objects is a very challenging task. Traditional image analysis based on the pixel statistics encounters great difficulty and only obtains limited success in this area. ― The statistical shape analysis[17 19] provides a powerful tool for the high-level computer vision given raw images of pixels. By carrying out statistical shape analysis, we are capable of (a) capturing features which are invariant under the transformation; (b) combining local features into more complex global features; (c) learning shape models with human priors. These global invariant features obtained from the statistical shape analysis are essential to the object recognition. In statistical shape analysis the geometric description of a visual object consists of two parts: the registration information and the shape information. The registration information is usually taken as the Euclidean similarity transformations (translation, rotation, and isotropic scale); and the shape information is invariant under registration transformations. Both parts of information could be important or trivial depending on different tasks. For person-independent facial expression recognition, the shape information will be the primary interest. For person-dependent face recognition, both shape and registration information may be important. In this section, we focus on the statistical shape analysis for facial features based on the segmented facial pixels (two-dimensional point sets). We first introduce the statistical shape analysis framework. Let X be the matrix of Cartesian co-ordinates of the points representing an object. There are three steps to obtain the shape of the object: 1. Remove location by centering, Xc=CX. Xc CX 2. Remove size by rescaling, Z = where S(X) is the size of the X. = S ( X ) CX

3. Obtain shape [X] by identifying all rotated versions as an equivalence class, [ X ] = {Z Γ : Γ ∈ SO(m)} , where Γ is the rotation transformation, m is the size of dimension and SO(m) is the set of all m×n rotation matrices. Another important issue of statistical shape analysis is the measurement for the shape distance. In this paper, we adopt the Procrustes/Riemannian metric[17]:

ρ ( X 1 , X 2 ) = 2 arcsin(d P ( X1 , X 2 ) / 2), where d p ( X1 , X 2 ) = inf

Γ∈SO ( m )

(0 ≤ ρ ≤ π / 2),

Z 2 − Z1Γ .

For two-dimensional point sets, if landmarks are available for two objects z = ( z1 , z2 ,..., zk ) and u = (u1 , u2 ,..., uk ) , the Procrustes shape distance is

ρ ( z , u ) = arccos

∑ ( z j − z )* (u j − u ) , 2 2 1/ 2 z − z u − u (∑ j ∑ j )

(7)

where z* is the complex conjugate of zT.

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

493

The detailed process of calculating the Procrustes shape distance for two-dimensional set points is as follows: ⎛1 n ⎞ 1 n 1. Compute the centroid of each shape. ( x , y ) = ⎜ ∑ x j , yj ⎟ . ∑ ⎜ n j =1 n j =1 ⎟⎠ ⎝ 2. Re-scale each shape to have equal size. The following Frobenius norm is used as a shape size metric S ( X ) =

n

∑ ⎡⎣( x j − x )2 + ( y j − y )2 ⎤⎦ . j =1

3. Align with respect to position the two shapes at their centroids. A singular value decomposition (SVD) is applied: (a) Arrange the size and position aligned z and u as n × k matrices (in the planar case k = 2). (b) Carry out the SVD by calculating UDVT of zTu in order to maximize the correlation between the two sets of landmarks. (c) Align with respect to orientation by rotation. The rotation matrix needed to optimally superimpose z upon u is then VU T: ⎡cos(θ ) − sin(θ ) ⎤ VU T = ⎢ . cos(θ ) ⎥⎦ ⎣sin(θ ) Modeling shape variation is essential to further extract global features. A popular method to describe the shape variation is to examine principal components from the least square matching of geometrical objects. Given n objects of k landmarks in m real dimensions, i.e., T1 ,..., Tn are k×m matrices, and Ti ∈ ℜkm . Procrustes matching involves least squares matching to give Tˆj :

μˆ = arg inf

inf

μ :S ( μ ) =1 r j > 0,Γ j ∈SO ( m ),b j

∑ μ − r j T j Γ j − b j lk

2

,

(8)

j

where the fitted configuration are Tˆj = rˆj T j Γˆ j + bˆ j lk . If variations are small the shapes lie approximately in a linear space (Procrustes tangent space)[20]. After matching, the mean is 1 1 μˆ = ∑ Ti and the estimated covariance matrix is Σˆ = ∑V (Tˆi − μˆ ){V (Tˆi − μˆ )}T , where n n V (T ) = vec(T ) . The variability can be examined through the principal components of the Procrustes matched configurations, i.e., through the eigendecomposition of Σˆ [20]. For a task such as facial expression recognition, one may expect to use some prior knowledge about the structure of the object (facial structure). In this case, it is appropriate to adopt a Bayesian learning approach with deformable templates[21]. Let S represent the deformable template of the parameterized object with a prior probability distribution π (S) and I be the observed image; the shape model is represented by the likelihood L(I | S), which can be obtained through the statistical shape analysis and learning process. By Bayesian rule, the posterior density π (S |I ) of the deformable template S given the observed image I is π ( S | I ) ∝ L( I | S )π ( S ) . There is a wide variety of choices for deformable template representations, such as geometrical parameter templates, landmarks distribution, continuous outline templates, etc.[21].

494

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

5 Experimental setup and results This section describes the implementation of the facial feature extraction and analysis system based on random walks and statistical shape analysis. Experiments are carried out to validate the proposed method. A common step of graph-based algorithms for image analysis is to define an appropriate function that maps changes in image pixels to edge weights of the graph. In this way the image structure is captured by the weighted graph. A widely used function is Gaussian weighting function[11] given by wij = exp(− β ( I i − I j ) 2 ),

where Ii and Ij are the image intensity at pixel i and j, and β is a control parameter. Obviously, this simple function is not able to capture the facial features. To handle the color and temporal information that reflect the facial features, we define the weighting function as follows:

(

)

⎛ β Hi − H j + Hi − H0 + H j − H 0 wij = exp ⎜⎜ − | I i − I k | + max | I j − I k | +ΔI j + ΔI k ⎜ 1 + max k∈Ni k∈N j ⎝

⎞ ⎟, ⎟ ⎟ ⎠

(9)

where Hi and Hj are the hue at pixels i and j, H0 is the hue of skin color, Ni and Nj are the sets of neighboring points of pixels i and j, and ΔI i and ΔI j are the temporal intensity changes of pixel i and j. The random walk used in here is slightly different from that used for the interactive image segmentation[11]. Due to the limitation of the computational resource, the random walk only happens around the seed points with a limited number of steps. All seed points are chosen from the pixels connected with exceptionally large weights for positive points and those connected with exceptionally small weights for negative points. The random walks filter out the noise and weaken the non-facial pixels but preserve the facial feature pixels. To accelerate the process the exponent calculation in eq. (9) for the whole graph is omitted. It is calculated only when a random walk takes place, as shown in eqs. (5) and (6). The segmentation results of facial feature pixels under various conditions are shown in Figures 1―5. Figure 1 shows the results of geometric facial feature extraction from live video with head rotations. The facial components are marked in different colors (mouth in green, nose in yellow, eyes in red and pink, brows in dark red and dark pink). The feature pixels of face contour and small fragments are left unchanged (pixels in blue). Figure 2 shows all seed points and extended points in different colors. All seed points are marked in red. The blue points are the facial feature pixels obtained after the random walks. Note that there are some red points on the top of the hair, which are the noise caused by the web camera. The random walks have not preserved any other candidate pixels around these red points. All the facial feature pixels are stored in linked lists where the color and the coordinates of the pixels are kept. Since the facial feature pixels are partly connected with each other in the graph, it has no difficulty dividing them into the groups corresponding to the facial components of the mouth, eyes, brows, and nose, according to the centroids of the connected parts and the number

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

495

of pixels. The priori knowledge of facial structure (the relative position of facial components) is used to guide the segmentation. The feature extraction is robust against the illumination changes, head rotations, scale variation, and hand interference. Figure 3 shows the results with an illumination change (a) and with a light-source on the right side (b). Figure 4 shows the results with an isotropic scale change (a) and with hand disturbance (b). It can be seen that the feature extraction process works well with the presentation of other blocks of skin color. The process of facial feature extraction runs in real time (>30 fps) for 320×240 resolution and the speed is reduced to 25 fps for 640×480 on a ThinkPad T60p laptop. For the convenience of implementing high-level vision, such as facial expression recognition, it is necessary to carry out statistical shape analysis to extract shape information that is invariant under the translation, rotation and isotropic scale. Figure 5 illustrates the main steps of the Procrustes shape distance calculation. For two facial feature patterns with different scale and orientation, they are re-scaled into the same size and aligned with appropriate rotation, according to the calculation process given in section 4. The Procrustes shape distance calculation for a single facial component, such as the mouth, the eye, or the brow, can be carried out in the same way.

Figure 1 Geometric facial features are automatically extracted from live video in real time with head rotations. Facial components of the mouth, eyes, brows, and nose are precisely represented with pixels in different colors.

Figure 2 Facial feature seed points (in red) and extended points (in blue) after random walks.

496

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

Figure 3

Segmentation results with an illumination change (a) and results with a light-source at the right side (b).

Figure 4 Facial feature segmentation results with an isotropic scale change by moving back from the camera (a) and results with hand disturbance (b).

Figure 5

Main steps of Procrustes shape distance calculation for two facial feature patterns.

6 Conclusions A novel approach for precise facial feature extraction from live video has been proposed in this paper. The model has been built based on random walks and statistical shape analysis. Experimental results have shown the facial feature extraction is robust against the illumination changes, head rotations, and scale variation. The facial feature pixel segmentation is carried out in real ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

497

time and there is no need to do extra tracking. Future work includes carrying out reliable facial expression analysis and developing applications in human-computer interaction. 1

Li S Z, Jain A K, eds. Handbook of Face Recognition. Berlin: Springer-Verlag, 2004

2

Zhao W, Chellappa R, eds. Face Processing: Advanced Modeling and Methods. Amsterdam: Elsevier, 2006

3

Yang M H, Kreigman D J, Ahuja N. Detecting faces in images: a survey. IEEE Trans Pattern Anal Mach Intell, 2002,

4

Lin D H, Tang X O. Recognize high resolution faces: from macrocosm to microcosm. In: IEEE Computer Society Con-

5

Mumford D, Shah J. Boundary detection by minimizing functions. In: IEEE Computer Society Conference on Computer

6

Tu Z W, Zhu S C. Image segmentation by data-driven Markov chain Monte Carlo. IEEE Trans Pattern Anal Mach Intell,

7

Zhu Z W, Ji Q. Robust real-time face pose and facial expression recovery. In: IEEE Computer Society Conference on

8

Barbu A, Zhu S C. Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities. IEEE Trans Pattern Anal

9

Biggs N. Algebraic potential theory on graphs. Bull. London Math Soc, 1997, 29: 641―682[DOI]

24(1): 34―58[DOI] ference on Computer Vision and Pattern Recognition (CVPR 2006), 2006, 2: 1355―1362[DOI] Vision and Pattern Recognition (CVPR 1985), 1985. 22―26 2002, 24(5): 657―673[DOI] Computer Vision and Pattern Recognition (CVPR 2006), 2006, 1: 681―688[DOI] Mach Intel, 2005, 27(8): 1239―1253[DOI] 10

Boykov Y, Veksler O, Zabih R. Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell, 2001, 23(11): 1222―1239[DOI]

11

Grady L. Random walks for image segmentation. IEEE Trans Pattern Anal Mach Intell, 2006, 28(11): 1768―1783[DOI]

12

Kolmogorov V, Zabih R. What energy functions can be minimized via graph cuts? IEEE Trans Pattern Anal Machine Intell,

13

Perona P, Malik J. Scale-space and edge detection using anisotropic diffusion. IEEE Trans Pattern Anal Mach Intell, 1990,

14

Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell, 2000, 22(8): 888―905[DOI]

15

Meila M, Shi J. Learning segmentation by random walks. In: Neural Information Processing Systems Conference (NIPS

16

Cheng B, Wang Y, Zheng N, et al. MRF model and FRAME model-based unsupervised image segmentation. Sci China Ser

17

Dryden I L, Mardia K V. Statistical Shape Analysis. New York: John Wiley & Sons, 1998

18

Olver P J, Tannenbaum A, eds. Mathematical Methods in Computer Vision. Berlin: Springer, 2003

2004, 26(2): 147―159[DOI] 12(7): 629―639[DOI]

2000), 2000. 873―879 F-Inf Sci, 2004, 47(6): 697―705

19

Srivastava A, Joshi S H, Mio W, et al. Statistical shape analysis: clustering, learning, and testing. IEEE Trans Pattern Anal Mach Intell, 2005, 27(4): 590―562[DOI]

20

Kent J T, Mardia K V. Shape, Tangent projections and bilateral symmetry. Biometrika, 2001, 88: 469–485[DOI]

21

Grenander U. General Pattern Theory. Oxford: Clarendon Press, 1994

498

ZHAO JieYu et al. Sci China Ser F-Inf Sci | May 2008 | vol. 51 | no. 5 | 489-498

Suggest Documents