Hand Gesture Recognition and Interaction with 3D stereo Camera

Hand Gesture Recognition and Interaction with 3D stereo Camera Jianming Guo U4692950 Supervise By Dr. Hongdong Li COMP 8740 Project Report November 2...
Author: Denis Hawkins
1 downloads 2 Views 850KB Size
Hand Gesture Recognition and Interaction with 3D stereo Camera

Jianming Guo U4692950 Supervise By Dr. Hongdong Li COMP 8740 Project Report November 2011

Department of Computer Science Australian National University

1

Abstract This report focuses on the model based hand articulation tracking. The hand detection, location, segmentation, and recognition are discussed. The data collected from a Microsoft Kinect has been used. A 3d hand hierarchy model is developed as the hand hypothesis in the tracking algorithm. The hand gesture has firstly extracted as static hand postures, and then the Particle Swarm Optimisation was used to find the hand’s 3d articulation in compare with the hand hypothesis. The range of population of the PSO algorithm is provided by a Support Vector Machines based classifier. This report also organises and discusses several latest researches proposed in this area from 2009 to 2011. It is an intermediate step in the research of hand gesture recognition and interaction. Key word: hand gesture recognition, hand articulation tracking, depth image

2

Contents ABSTRACT ...................................................................................................................................... 2 CONTENTS...................................................................................................................................... 3 1.

INTRODUCTION .................................................................................................................... 4

2.

OVERVIEW ............................................................................................................................. 5 2.1 HUMAN HAND STUDY ................................................................................................................. 5 2.2 GENERAL METHODS ................................................................................................................... 6 2.2.1 Hand detection and location............................................................................................... 6 2.2.2 Hand segmentation ............................................................................................................ 7 2.2.3 Hand Articulation Tracking ................................................................................................ 7 2.2.4 Hand Recognition ............................................................................................................ 12 2.3 METHODS ON DIFFERENT SENSORS ........................................................................................... 12

3.

METHODOLOGY ................................................................................................................. 13 3.1 LAB SKIN COLOUR CHROMINANCE MODEL................................................................................. 14 3.2 DEPTH IMAGE FROM KINECT .................................................................................................... 14 3.3 3D HAND MODELLING .............................................................................................................. 16 3.4 EVALUATION FUNCTION ........................................................................................................... 17 3.5 SEARCH OPTIMISATION ............................................................................................................. 18 3.5.1 2D posture classifier ........................................................................................................ 18 3.5.2 Particle swarm optimisation ............................................................................................. 21

4.

EXPERIMENTS..................................................................................................................... 23 4.1 2D POSTURE ............................................................................................................................ 23 4.2 HAND ARTICULATION TRACKING ............................................................................................... 24

5.

DISCUSSION AND CONCLUSION...................................................................................... 30

6.

FUTURE WORK.................................................................................................................... 31

7.

BIBLIOGRAPHY................................................................................................................... 31

3

1. Introduction Hand gesture recognition and interaction is a challenging problem. A hand gesture can be static, dynamic, or both as in the sign languages (Starner & Pentland, 1995), it can be a pose, a finger movement, a palm action with moving the arm, or even it can be seen as an articulation structure from non-human. Hand gesture recognition is to make the computer know the meaning of a hand gesture, including the spatial information, the path information, the symbolic information, and the affective information (Mitra & Acharya 2007). The hand gesture interaction is to further communicate with computer interactively. The hand gesture recognition and interaction is not only a theoretical computing problem, but also an engineering problem. There are roughly three groups of applications in this area. (i) surveillance, to monitor the motion of the hand gesture, for example the motion capture for animations (ii) control, control the user interface with hand gestures, for example some video games (iii) analysis, to research the hand gesture for further understanding, for example medical research (Moeslund, Hilton & Kruger, 2006). There are three types of sensors observe the hand gestures. (i) Mount based sensors, for instance the wired glove (Zimmerman et al., 1987), the mouse, the Tango device (Kry 2005), the Wii Remote, etc. (ii) Touch based sensors (Wilson 2009; Wilson et al. 2008), such as the touch screen for mobile phones. (iii) Vision based sensors, such as the normal camera, the depth-aware camera, and the stereo camera. The mount based sensors can get spatial and path information of a hand gesture. However it is always expensive to get the complete and accurate information, and it can be constrained in some special cases for example the remote controlling. The touch based sensors do not work well either,contact is necessary and those sensors are difficult to observe 3d spatial information. Vision based gesture recognition becomes more and more attractive with the fast development of vision sensors and powerful processors, allowing people to control without having to use a medium device attached. For an example, the Kinect 1 release the player from the traditional game controller. It is an advantage to use vision based methods on hand gestures with vision based sensors. Hand gestures are different from other biometric form, such as facial gestures or full body gestures. Hand is smaller, more flexible and has more occlusions, so that there are still many challenges on vision based hand gesture recognition. There are many reasons according to the requirement of hand gesture recognition and interaction in a human computer interface. First of all, the hand gesture should be recognised accurately, and the interaction should response correctly. Then the solution should run flawlessly, and responses in real-time. The tracking, recognition and interaction algorithm should have good speed performance. Thirdly, the solution should be robust 1

http://en.wikipedia.org/wiki/Kinect 4

to many defects on vision based sensors, including image noise, lighting conditions and motion blur. At last, the solution should not have too much constrains, for example, it should be workable at various resolutions that equals to allowing a degree of freedom on the distance between hand and the sensor, and it is better not to use markers. To solve vision-base hand gesture recognition and interaction problem, it is necessary to encounter multiple complicated steps. Firstly, it is necessary to detect and locate the hand position. Then, hand is isolated from the background noise by doing the hand segmentation. After that is to track the hand and estimate its representations with certain features, getting the hand feature description. Then a data description of the extracted feature is used to do the recognition through classification or other methods. At last, an interaction implementation will be proposed. The methods as a whole can be various in each of the steps. Although many solutions have been proposed, the scheme has been improving especially in recent years, when more and more powerful hardware allows gradually improved computation amount, or new sensor was invented. This report focuses on the model based hand articulation tracking. The hand detection, location, segmentation, and recognition are discussed. The data collected from a Microsoft Kinect has been used. A 3d hand hierarchy model is developed as the hand hypothesis in the tracking algorithm. The hand gesture has firstly extracted as static hand postures, and then the Particle Swarm Optimisation was used to find the hand’s 3d articulation in compare with the hand hypothesis. The range of initiate population of the PSO algorithm is provided by a Support Vector Machines based classifier. This report also organises and discusses several latest researches proposed in this area from 2009 to 2011. It is an intermediate step in the research of hand gesture recognition and interaction.

2. Overview 2.1 Human hand study The non-human hand articulation is not concerned in this report. The study of the hand itself is necessary as the first step throughout this hand gesture recognition and interaction research. The human hand model has been analysed base on anatomy and mathematic (Albrecht, Haber and Seidel, 2003) for computer graphics description. The skin, skeleton, muscles, mass-spring system and joint hierarchy system on hand were proposed in virtual reality for the purpose of rendering realistic hand model using computer (Albrecht, Haber and Seidel, 2003). Kry (2005) also provided a detailed overview of the hand related researches on Computer Graphics, Robotics, Neuroscience and Anatomy, Kry emphasised the hand modelling on Computer Graphics and Robotics. The human hand study in this report emphasises the 5

computation model of the human hand articulation in 3D.

2.2 General methods Many academic surveys have been done in the vision-based human motion and gestures tracking, recognition, and analysis (Moeslund, Hilton & Kruger, 2006; Mitra and Acharya 2007). Problems and algorithms were categorised in initialisation, tracking, pose estimation and recognition (Moeslund, Hilton & Kruger, 2006), this report focuses on hand, methods are categorised specifically as following.

2.2.1 Hand detection and location There are many methods for only detecting the hand on the image. A haar-like feature combining with Fourier transforms method (Kolsch & Turk, 2004) borrowed from a face detection method (Viola & Jones, 2001) used on the hand, however this method needs a huge database of different hand postures, and it is difficult to detect new hand postures not included in the database. Nevertheless, haar-like feature detection is a feasible algorithm for detection if we treat hand location problem as hand tracking problem. As long as the hand posture is firstly detected, it is essential to pass the initial hand configuration to the hand tracking algorithm that usually needs a good initial value for a good tracking. Hand location is another way of understanding as hand tracking, which is to find the general hand position on an image. However, the hand location can be solved by the hand detection per frame. The global hand position can be achieved from full body tracking at above 90% accuracy (Shotton et al. 2011) however it limits one has to stand a certain distance away from sensor to appeal body in the sensor, therefore the hand resolution is getting even lower and more difficult to extract articulation information. Shotton et al. (2011) proposed a per frame full body gesture detection with 2000TB pre trained dataset for the Kinect, they used the per-frame initialisation and recovery on depth map from the Kinect that designed to accelerate many tracking algorithms (Bregler & Malik, 1998), allowing not to use a calibration gesture 2. Skin colour detection Skin colour detection is colour based detection method used on human body parts detection and location. It is a simpler way which just considers the colour feature, and it is not robust to very dark or light conditions on cheap digital sensors which have narrower exposure latitude. Bergh and Gool (2010) provided a hybrid method to achieve the skin colour detection combining Gaussian Mixture Model and Histogram 2 The calibration gesture is a fixed pose that provides the initial tracking parameters for the tracking algorithm. For example, OpenNI full body and hand tracking algorithms need a fixed calibration pose. See OpenNI manual. 6

based probability estimation. Other provided the Bayesian skin colour model (Jones & Rehg, 2001) based on huge training data. The huge training dataset usually is not easy to achieve, a very simple method (Mahmound, 2008) that used YCbCr colour space to locate skin colour in the certain range. To increase the skin colour accuracy as much as possible with small training set, Lab colour space skin chromatic model is better (Cai, Goshtasby & Yu, 1998; Ennehar, Brahim & Hicham, 2010) due to smaller distribution range. The face can be the obstacle on hand detection and location, one method to avoid this problem is to exclude the face area with haar-like face detector (Bergh & Gool, 2010). This method does not work well when the hand overlays the face, in such case, a solution should not use only the colour feature.

2.2.2 Hand segmentation Hand segmentation is to isolate hand from image, removing the background noise, the segmentation happens before the hand is detected or after. In the case of hand is segmented after the hand detection. The skin colour detection method achieves the hand segmentation naturally in the simple background situation. However, with complex background in a working environment, the background removal 3 is necessary. The segmentation becomes simpler on stereo sensors and depth sensors which allowing extracting depth image. In the special case that the hand overlays the face, the hand can be easily segmented according to the general depth of the found location. In the case of hand is segmented before the hand detection. The segmentation on the depth map can provide clue for hand detection, the K-means clustering and Expectation Maximization technique was developed for hand segmentation (Ghobadi et al., 2007). Shotton et al. (2011) has tested their depth map clustering on hands, and they did not success 4 on segmenting the hand articulation information due to the limitation on resolution. The hand segmentation takes advantage of the depth information in both circumstances. Although the depth information has lower quality comparing with the colour information in the current depth sensors, the morphological operation can be used, and the final quality can be enhanced by the depth segmentation combines with the skin colour detection (Oikonomidis, Kyriazis, and Argyros, 2011).

2.2.3 Hand Articulation Tracking For the vision based hand articulation tracking approaches, the multidisciplinary trait 3

For example, thresholding, edge detection, contour methods Their presentation on CVPR 2011 showed a slide on the point clustering on hand which is not successful. Retrieved from url: http://techtalks.tv/talks/54235/ 7 4

can be seen as general two directions, the model based approaches (Wu, Lin and Huang, 2001; Chua, Guan and Ho, 2002; Oikonomidis, Kyriazis, and Argyros, 2011; ) and the appearance based approaches (Heap & Hogg, 1996; Rosales et al. 2001; Athitsos & Sclaroff. 2003; Romero, Kjellstrom & Kragic. 2009; Shotton et al., 2011). The other way to categorise the directions from engineering point of view depends on different sensors, such as single/multi camera approaches, depth information approaches, robotics approaches, and multi sensor approaches, etc. Mark (Wang and Popovic 2009) or markless presents two different research trends. The mark is usually to simplify the feature extraction procedure. Most of the research only consider markless situation since mark brings constrain on user experience. Model based Hand model Albrecht, Haber and Seidel (2003)’s hand model was very complex due to too many triangles and hierarchy information need to be calculated to render a hand. As a result it presented the hand accurately but not efficiently. The hand model can be simplified as the bones rotation model, and there are the rotation constrains to represent the hand gestures. Kuch and Huang (1995) presented the 26 degrees of freedom (DOF) hand model, 3 of DOFs were the hand global orientation, 23 were the parameters on the fingers, and the global position was given by the position of ring and little fingers as internal to the hand. Similar model has been modified in various forms. The DOFs reduced to 12 without significantly degrade the hand tracking performance (Chua, Guan and Ho, 2002), but there were too many constrains to allow occlusion on hand, and they used colour marks which are not practical. Physical constrains (Ying & Wu 1999) have also been used on geometric hand models in the research. A more realistic human hand model (Bray et al. 2004) with the skinning technique from computer graphics, but the model was specifically designed to one user and difficult to adapt to different hands. A simple primitive based model (Oikonomidis, Kyriazis, and Argyros, 2011) was used that inherited the similar 26 DOFs one, this model was not as accurate as a specifically designed one (Bray et al. 2004) but it is faster to render, simpler to implement and easier to fit different hands. There are also other forms of hand models. A special deformable model (Heap and Hogg, 1996) was used without building with articulation information, but it had defects on occluded vertices. Such hand model can be seen as the statistical hand models, and other statistical examples include the point distribution model (PCM) (Ying & Wu 1999), eigen dynamics analysis (EDA) (Zhou 2005). The statistical hand models are basically for appearance based approaches, they are difficult to control in the high dimensional space. In other geometric hand models, a cardboard hand model (Wu, Lin and Huang, 2001) in 2D was used, but the 2D hand model is lack of expression in the projective space, which means one of the three dimensions nature 8

information had been ignored, leading to inaccurate feature matching. In the aspect of geometric appearance, the models also can be created in different volumetric, skeletal or other 2D/3D forms, such as NURBS, polygon meshes, or even lines. A newly developed hand model (Wilson, 2009; Wilson et al., 2008) introduced the physics simulated proxy particles as the medium to interact with virtual objects. The current status5 of this method allows 3D reconstruction on the hand surface, and it assigned physics attribute to this hand surface to simulate with virtual objects. This hand model avoided to present the complete hand articulation, but used the simplified proxies to simulate the finger movement. Wilson (2009; Wilson et al. 2008) proposed a solution considered the physics, which makes the hand gesture interaction behave more naturally. The disadvantages on this method are obvious. First of all, it was designed for imaging, and the interaction is naturally but not accurately, much interpolation used to make it look right. Secondly, this method only provides partial hand information that the hand articulation is still unknown. It cannot be used on surgery, robotics or other applications that require advance accuracy yet. Since the current vision based approaches are very computation-intensive, in order to use a 3D hand model, it is reasonable to simplify the hand model. This means it is better to use less parameters and data to represent a hand model as freely as possible. As long as the computer speed allows more complex hand model, more parameters and more detailed hand model provide better result (Bray et al. 2004). Geometric hand models are preferred in this report, and the DOF hand model is convenient to make physical rotation constrains, it is the most common hand model in model based approaches. This report presents a hierarchy hand model with programmable primitives and 27 DOFs similar with Oikonomidis, Kyriazis, and Argyros’ (2011) approach. Tracking Algorithms In the model based approaches, the hand states can be found from matching the hand model hypothesis and the observed features. The tracking algorithms in the model based approaches usually consider as a search problem in multi-dimensional space (Kuch and Huang, 1995; Ying & Wu 1999; Chua, Guan and Ho, 2002; Lin, Wu & Huang, 2004; Bray et al. 2004; Hamer et al., 2009; Oikonomidis, Kyriazis, and Argyros, 2011). The hand motion consists of multiple parameters 6 , such high-dimensional search space is difficult to observe or maintain. The computation complexity is very high on model-based approach. The proposed tracking algorithms have generally three strategies. Firstly, it is necessary to formulate constrains to reduce the dimensionality of the problem (Kuch and Huang, 1995; Ying & Wu 1999; Chua, Guan and Ho, 2002). Secondly, design the appropriate search algorithm to achieve an acceptable result with limited computation resource (Lin, Wu & Huang, 2004; Bray et al. 2004; Oikonomidis, Kyriazis, and Argyros, 2011). Thirdly, design 5 6

Retrieved from url: http://research.microsoft.com/apps/video/default.aspx?id=154571 For example, 27DOFs as discussed above. 9

the evaluation function that accurately and efficiently measures the discrepancy between the hand hypothesis and the observed feature from the actual hand (Lin, Wu & Huang, 2004; Oikonomidis, Kyriazis, and Argyros, 2011). Many constrains implemented on the hand model as the discussion in the hand model section in this report, an efficient and fast algorithm is necessary to associate the hand representation with constrains. In terms of reducing the dimensionality of the problem, constrains are not only provided from the explicit hand articulation. Analyses from more general view point that consider the appearances of the 3D objects in different dimensional spaces have been proposed. The manifold of the relationship between high dimensional data and its projection on lower dimension space has been introduced in vision based researches (Basri, Roth & Jacobs, 1998; Brand, 1999). In this case, those analyses can be considered reversely to optimise the tracking algorithm. The motion dynamics is modelled by a transition probability table for tracking problem (Brand, 1999). However, it is difficult to approximate the accurate manifold since a suitable clustering algorithm influence the result heavily, leading to inaccurately tracking result. The demonstration of a general hand tracking algorithm structure successfully proposed on a pre-recorded sequence without considering the performance issue (Kuch and Huang, 1995). The algorithm’s general purpose did not change in the later researches, which is to search the parameters with a global minimum error value on the objective function. During years of development this method keeps improving. A stochastic Nelder-Mead simplex search algorithm with particle filtering framework was able to track the hand motions in complex background (Lin, Wu & Huang, 2004). A rapid stochastic gradient descent approach with a specifically designed hand model and cost function is also proved to be efficient and accurate (Bray et al. 2004). These two stochastic search methods are similar at the randomness of search method helping avoid the local minimum. The Particle Swarm Optimisation is potentially designed considering the parallel computing for the hand tracking problem (Oikonomidis, Kyriazis, and Argyros, 2011), this method is selected in this report. Other optimisation methods such as gradient decent, Gauss-Newton, Quasi-Newton, stochastic meta-descent are widely used to achieve the global minimum in the tracking algorithm. Feature observation For the efficiency of the feature observation on hand hypothesises, 2D rendered information has been used most. Edge and silhouette features (Wu, Lin and Huang, 2001; Lin, Wu & Huang, 2004) can be processed quickly and usually enough to distinguish different hypothesises from the proposed real hand gesture. Similar features used with the depth sensors (Oikonomidis, Kyriazis, and Argyros, 2011). The shading and texture features considered as the good clues to solve the self-occlusion problem in hand tracking (Gorce, Paragios & Fleet, 2008). The lighting 10

condition is estimated and the texture on hand is learned from the image sequence in their methods, these conventional features supplement the discontinuous part on the cost function, providing better objective function gradient. A complex cost function was designed considering sampling the distance of the 3D surfaces on hand hypothesis and the actual hand, which requires a hand hypothesis with accurate details on its surface (Bray et al. 2004). This method is able to calculate the error value accurately with 3D feature included, however a detailed hand hypothesis is impractical if the detailed hand surface cannot be learnt in real-time. This cost function could be usable with high resolution depth sensor which is able to reconstruct the hand mesh easily from depth information. The 2D cardboard model has advantages of quickly present the hypothesis likelihood (Wu, Lin and Huang, 2001) without having to process 3D rendering. However, the 3D rendering is not the drawback with the help of GPU processors. Tzevanidis et al. (2010) proposed a GPU-powered architecture to process parallel multi-frame rendering and data computation. And this GPU based approaches were also further developed and applied (Oikonomidis, Kyriazis, and Argyros, 2011). Appearance based Appearance based approaches (Heap & Hogg, 1996; Rosales et al. 2001; Athitsos & Sclaroff. 2003; Romero, Kjellstrom & Kragic. 2009; Shotton et al., 2011) usually learn the nonlinear mapping on the features extracted directly from images or other input data to the hand configuration. The appearance based approaches avoid the direct search problem which is generally quicker if mapping can be learned. The appearance based approaches may still require a hand model to be helpful on the cost function (Heap & Hogg, 1996; Athitsos & Sclaroff. 2003). Newly developed full body gesture tracking system (Shotton, et al. 2011) inspired by another research (Plagemann et al., 2010) separated a single tracking problem into 3 steps. The detection and feature extraction step are transferred to depth map clustering and labelling, the last step is to map the labelled pixels to the configuration of body articulation which involved the random decision forest (Breiman, 2001; Shotton, et al. 2011), and it is highly possible to implement on hand. The random decision forest is consisting of random decision trees. It is an effective multi-class classifiers which can be implement with parallel computing technique on GPU. However this system is difficult to build and highly depends on huge amount of training dataset (Shotton et al., 2011). The appearance based approaches usually fails at extracting accurate estimation of the hand articulation, they are more reliable on the recognition problem rather than tracking. The search problem can be optimised with the appearance based approaches (Shotton et al., 2011) if a direct search tracking algorithm is used. 11

2.2.4 Hand Recognition The hand articulation tracking provide the recognition naturally on its spatial and path information. As for considering the term of recognition, only tracking or capturing the hand articulation shall not solve the meaning of a hand gesture. The mapping between the features extracted and their meanings has to be built to achieve the recognition. It is an advantage to implement the appearance based methods on the recognition problem. Moeslund, Hilton & Kruger (2006) proposed a survey emphasised the statistical machine learning tools without mentioning the necessary image processing technique for hand gestures. Hands on the image are firstly transferred to the gesture description. The gesture description comes from the normalized correlation (Freeman & Weissman, 1995), the haar-like features and Fourier transforms (Kolsch & Turk, 2004), the Moment Invariants and shape features (Hu, 1968; Starner & Pentland, 1995), or the feature vector that specifically calculated combining with multiple images 7 (Malassiotis, Aifanti & Strintzis, 2002; Bergh & Gool, 2010). Then the recognition problem becomes a classification problem, and many statistical or bio-inspired algorithm can be used. The Hidden Markov Model is a good tool on recognising the meaning of a sequence of postures. A real-time HMM network has been designed for recognizing American Sign Language (Starner & Pentland, 1995) achieved above 99% accuracy. There are many recent researches about the hand or body parts recognition with depth sensor or similar 3D data (Malassiotis, Aifanti & Strintzis, 2002; Bergh & Gool, 2010; Shotton, et al. 2011). They used various combinations of features and classifiers, but most of them cannot achieve both recognition and articulation tracking for hand at the same time and consider them as separated research topics. To fulfil a mature hand gesture recognition and interaction solution, both articulation tracking and appearance based recognition methods are necessary.

2.3 Methods on different sensors Traditional single camera sensors have drawbacks on environmental conditions, for example the exposure latitude, lens distortion etc. There are many other vision based sensors used for hand gesture. A stereo camera is a type of camera with two or more lenses and image sensors, allowing the camera to simulate human binocular vision which known as the stereo photography. A depth image can be calculated by the disparity attribute of a stereo camera (Klaus, Sormann & Karner, 2006). Some hand recognition methods (Lee & 7

Colour maps, depth maps or segmented binary maps 12

Hong, 2010; Li et al. 2011) used the depth image from stereo camera. However the depth map from the stereo camera is difficult to achieve good quality which is inappropriate to extract good features. Multi-camera based method involves calibration of the camera. A multi-camera system is used to solve the problem of occlusion on hand tracking (Kato & Xu, 2006). The additional information from multi sensors revises the feature extracted from monocular sensor. The result tends to be more accurate with more sensors used, whereas this brings the computation burden. And the calibration usually needs to be done by the user since the multi-camera positions are uncertain, which is impractical. A GPU based parallel computing framework (Tzevanidis et al., 2010; Oikonomidis, Kyriazis, and Argyros, 2011) proposed a solution with multi cameras calibration. Single depth sensor such as the time-of-flight (TOF) camera was used on many researches (Ghobadi et al., 2007; Bergh & Gool, 2010; Plagemann et al., 2010). Such sensors are very expensive and have low resolution. Combining the RGB sensors, a kind of new invented RGB-D sensor is made, for example, the Microsoft’s Kinect, the PrimeSense’s PS1080, etc. The depth image from this kind of sensor has general better quality than previous sensors, and very cheap compare with the TOF camera. The core algorithm of full body tracking on the Kinect as mentioned above was proposed (Shotton et al., 2011).

3. Methodology Considering the term of interaction, the hand articulation is essential. Recognition on 2D posture is done before the 3D hand articulation tracking to provide the correct approximation of the hand configuration. Both recognition and hand articulation tracking are proposed in this report. Many steps are taken to obtain the final result. The depth image and colour image from Kinect are used in this report. A hand motion sequence was captured through depth segmentation after hand position tracked8. Then the hand is segmented into a small clear depth image. A skin colour chrominance model was firstly built to detect the skin colour for an SVM based 2D posture classifier. The classifier was trained offline and able to recognise open hand, fist, and v-shape hand postures. It uses Hu moments as the geometric descriptor. This classifier predicts 2D postures on the captured sequence. A hand hypothesis is provided by a vector contains 27 parameters. A hierarchy hand model with these 27 parameters is rendered as similar depth image as the Kinect’s output. After the 2D posture is recognised, the range of the 27 parameters is modified 8 The hand detection and position tracking are provided by a feasible algorithm in the OpenNI module, which returns the predicted hand global position X, Y, and depth value, such algorithm is out of the scope of this report. 13

according to the recognised posture to shrink the search space. The particle swarm optimisation is used to directly search the best 27 parameters with a designed evaluation function. After all, the hand articulation is provided from the best 27 parameters from the particle swarm optimisation. More details are in the following part of this report.

3.1 Lab skin colour chrominance model To construct a skin chrominance model, firstly it is necessary to crop hand skin sample and convert every pixel of the sample from RGB colour space to CIE Lab colour space so that we can determine the chrominance values (L, a, b) of every pixel. The next step is to build an accumulator array by increasing the value of a bin (Li, ai, bi ) at every occurrence of that chrominance (Li, ai, bi ) in the sample.

An example of building the similar (a, b) 2D chrominance model

Then, we need to convolve the model with a 3D Gaussian filter to blur these points and normalize them to a smaller range [0, 1]. As for using the skin chrominance model, we need to test the pixels in a test image for skin colour before we output the result. For every pixel (pi), we need to: • Determine the chrominance values (Li, ai, bi) of test image I(pi) • Look up the skin likelihood for (Li, ai, bi) using the skin chrominance model. • Assign this likelihood to Iskin (pi) Due to skin colour of hands is always different from face, it is necessary to sample skin colour on hand. About 100 samples from 35 images were sampled in this report. From the result we can see hand palm colour is usually lighter than face colour if the people have general darker skin. The nail colour should be also included in our colour model.

3.2 Depth image from Kinect The Microsoft Kinect is manufactured for Microsoft XBox360 as an accessory that combines a standard RGB camera, a depth camera consist of a class 1 laser projector 14

and a depth sensor, and microphones. This sensor system is relatively cheap compare with other depth camera such as TOF camera, which makes it highly accessible for research purpose. There are various drivers and SDKs allowing use this system out of Xbox360. The exact engineering parameters on the Kinect are not exposed, but it is possible to understand its working principle through some experiments. The depth camera on Kinect is a stereo pair of laser projector and sensor with a baseline of approximately 7.5cm. The per-pixel depth information is acquired by observing the pattern manipulation from projecting a fixed pattern of light and dark speckles to the working environment. The Kinect performs the depth computations itself and drivers provide the raw data streaming from the system. In an ideal 3D stereo camera, the depth is acquired by a formula 𝑏𝑏 z= 𝑑 z is the depth (in meters), b is the horizontal baseline between the two cameras, f is the focal length of the cameras (in pixels). d is the disparity, d = 𝑥𝐿 – 𝑥𝑅 , 𝑥𝐿 and 𝑥𝑅 are the position of a same point in the space on x axis of the image plane.

depth calculation on 3D stereo camera On Kinect, the depth is acquired differently from the normal 3D stereo camera 9, 𝑏𝑏 𝑧𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 1 �𝑑 � 8 𝑜𝑜𝑜 − 𝑘𝑑

𝑑𝑜𝑜𝑜 10 is about 1091.5 b is about 7.5 cm, kd is a disparity map built-in, 1/8 means kd 9

Retrieved from url: http://www.ros.org/wiki/kinect_calibration/technical 15

disparity map is in 1/8 pixel units, f is about 580 pixel units. The raw depth image needs to calibrate to the RGB camera to make the pixel pairs match on both colour image and depth image. Such calibration is built-in in OpenNI module used in this report. The pixel value on the depth image is a 16bit integer range in 211 . The depth value of an unobserved pixel is assigned to the invalid pixel value which can be 0 or 211 according to the depth image distribution. Knowing the depth value at the global position of hand, the depth segmentation is simply done by acquiring pixels within a certain depth range around the detected position. The hand depth image is sliced according to the global position at the same size, so that the global position of the hand in each sliced hand depth image will not change.

a hand segmentation on depth image

3.3 3D hand modelling Since there is not a useable DOF hand model for research purpose, it is necessary to program one to work with model based approaches. The hand model in this report is programed in the modern OpenGL with C++ without using OpenGL deprecated code for the requirement of presenting hand hypothesises as fast as possible using GPU power. Inspired by a proposed 3D hand model (Oikonomidis, Kyriazis, and Argyros, 2011), the entire hand is a hierarchy model consists of multiple cylinder and sphere primitives. It has 27 parameters, the first 3 are the global position, next 4 are the global orientation in quaternion, and the next 20 parameters are the finger ordinations. To adapt to different people’s hands, a global scale factor is used when firstly establish this model. Pixel values on the depth image from the Kinect sensor represent the depth distance from the object to the sensor in the real world coordinate. The evaluation function does not require a strict match between real hand image and the hypothesis, so that the 10

Retrieved from url: http://mathnathan.com/2011/02/03/depthvsdistance/ 16

virtual camera programmed in the OpenGL just needs a same FOV as the Kinect sensor. Since both values can be normalised, as long as the rendered depth value keeps linear, the real hand image the hypothesis will match. A fragment shader is programmed to simulate the depth image from the Kinect sensor. This shader outputs the reversed depth value from 1.0 (near) to 0.0 (far) according to the window coordinate space in OpenGL.

Proposed 3d hand model rendered as colour image and depth image

3.4 Evaluation function The evaluation function is used in the particle swarm optimisation. The best hand hypothesis h is achieved when E(h, O) is smallest. The meaning of the parameters can be observed in the table below. E(h, O) = 𝐷(𝑂, ℎ) + 𝜆𝑘 ∙ 𝑘𝑘(ℎ)

D(O, h) =

∑ min(|𝑜𝑑 − 𝑟𝑑 |, 𝑑𝑀 ) ∑(𝑜𝑠 ∨ 𝑟𝑚 ) + 𝜀

+ 𝜆 �1 −

− 𝜑 (𝑝 ) kc(h) = � � 0 𝑝∈𝑄

E(h, O) O od os h D(O, h)

𝜑( 𝑝 ) < 0 𝜑 (𝑝 ) ≥ 0

The discrepancy between O and h O = (os, od), the observation model The depth image of hand segmentation The binary image of hand segmentation Hand hypothesis The average discrepancy 17

2 ∑(𝑜𝑠 ⋀ 𝑟𝑚 ) � ∑(𝑜𝑠 ⋀ 𝑟𝑚 ) + ∑(𝑜𝑠 ⋁ 𝑟𝑚 )

Depth image Binary image

rd rm dm dM

kc(h) Q φ(p) ε λk A∧ B A∨ B

The depth map from rendered hand hypothesis od matches rd, if difference less than dm, then rm is 1. Otherwise or missing observation, rm is 0 A predetermined value, penalty according to experiment To clamp the maximum depth, the difference is no more than this value, otherwise the depth is not matching. It is to optimise the search space, to avoid large variations in case of an optimization strategy. To penalize impossible hand gesture. Only consider the finger interpenetration. The three pairs of fingers may have interpenetration, index & middle, middle & ring, ring & little The first finger’s orientation on z axis minus the second, where may be interpenetration if this value is less than 0 Add to the denominator in case of divide 0 The weight of the penalty function A AND B, image operation A OR B, image operation

Depth image Binary image 1 1000

10

This function is from a proposed approach without considering the skin detection binary image (Oikonomidis, Kyriazis, and Argyros, 2011). D(O, h) in this report is the same as original approach, the depth differences between the observation O and the hypothesis h. The numerator of D(O, h) is the sum of the depth difference for every pixel with a depth value on both hypothesis and observation. The denominator is the sum of the binary pixel values on the union of hypothesis and observation. The division is to average the depth difference on every pixel within the union of both binary images. The second part of D(O, h) is the discrepancy between the binary image on the hypothesis and observation. Since the hand is detected without using skin colour feature, and the depth image and its binary threshold have the same silhouette, this discrepancy is unnecessary. It is enough to calculate D(O, h) only through depth image.

3.5 Search optimisation 3.5.1 2D posture classifier This report uses Hu’s moment as the geometric descriptor to represent hand shape character. Image processing techniques were used to get the final binary image of the hand to calculate Hu’s moment.

18

a. original image

d. Detected skin binary

b. after colour balance

e. Largest connected area

g. RGB Image Removed Small Connectness got from d

i. Otsu Thresholding

c. Detect skin probabilities

f. mask for back ground removal

h. Gray Scaled Region of Interest removed Small Connected Part

j. threshold Result Combined with Detected Skin e

k. Final clear binary

Due to that the angle and distance of the camera to the object may vary every time the camera captures an image, there were geometric distortion between the tested image and the reference image (Hu, 1968). Accordingly, a recognition method has an ability to 19

recognize visual shapes and characters independent of position, size and orientation of the object. Hu moments have the characteristic of invariant to such geometric distortion, and it can be easily used to describe the binary image. Hu’s Seven Moment Invariants

ϕ is the invariant moment value, η is the image moment in different order. Scaling must be performed to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges since SVM is sensitive to numeric value. Another advantage of scaling is to avoid numerical difficulties during the calculation. The linearly scaling typically ranged between [-1, 1] in this report.

The portion of Hu’s moment training data after normalising

The SVMs (Support Vector Machines) is a useful machine learning technique to perform data classification. The advantage is that the SVM is robust on classification of two classes. The training and prediction are very fast comparing to other methods, and it is also easy to be understood and used towards a lot on different types of classification problem. Multiple classifications are able to easily achieve with multiple SVM trainers and predictors. The principle of SVM as shown in the figure below is based on finding a separating hyper plane with the maximal margin in this higher dimensional space. The maximal margin makes the SVM have its advantage.

20

The hyper plane

The support vector machines used in this report is the following optimization problem, when a training set of instance-label pairs (xi, yi), i = 1,…,l where xi ∈Rn and y∈{1,-1}l, is given (Boser et al., 1992; Cortes & Vapnik, 1995).

Subject to,

2

This report selected RBF kernel K(x, y) = e−γ��𝑥𝑖 −𝑥𝑗�� , 𝛾 > 0 due to the nonlinearity

on the Hu’s moment value. C and γ are acquired from grid-search base on a 3-fold cross validation.

3.5.2 Particle swarm optimisation Particle Swarm Optimisation (Kennedy & Eberhart, 1995; Kennedy, Eberhart & Shi, 2001) is a bio-inspired evolutionary algorithm that accelerates a search problem by the general intelligence of population. This algorithm has many advantages, (i) it is insensitive to scaling of population variables; (ii) it has simple implementation; (iii) it is easy to parallelise with concurrent processing; (iv) few algorithm parameters; (v) efficient on global search problem. A constriction factor (Clerc & Kennedy, 2002) variant of PSO is used in this report.

𝑥𝑘𝑖 𝑣𝑘𝑖

Particle position Particle velocity

21

𝑝𝑘𝑖

Best individual particle position

𝑝𝑘

Best swarm position

r1 , r2

Random numbers between 0 and 1

𝑔

c1, c2 Cognitive and social parameters, c1 = 2.8, c2 = 1.3

The range of the particle positions is controlled by the posture classifier. The range parameters are estimated by observing the 3D hand model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Default range Lower Upper -1.0000 1.0000 -1.0000 1.0000 2.0000 4.0000 -1.0000 1.0000 -1.0000 1.0000 -1.0000 1.0000 -1.0000 1.0000 -1.3960 1.0470 -0.3490 1.5700 -0.0870 1.7450 -0.0870 1.5700 -0.5230 0.3490 -0.0870 1.5700 -0.0870 1.7450 -0.0870 1.3090 -0.3490 0.3490 -0.0870 1.5700 -0.0870 1.7450 -0.0870 1.3090 -0.3490 0.3490 -0.0870 1.5700 -0.0870 1.7450 -0.0870 1.3090 -0.5230 0.3490 -0.0870 1.5700 -0.0870 1.7450 -0.0870 1.3090

Open hand range Lower Upper 0.0000 0.0000 -1.0000 -1.0000 3.0000 3.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -1.3960 -0.2000 -0.1047 0.4710 -0.0261 0.5235 -0.0261 0.4710 -0.5230 0.0000 -0.0435 0.7850 -0.0435 0.8725 -0.0435 0.6545 -0.1745 0.1745 -0.0435 0.7850 -0.0435 0.8725 -0.0435 0.6545 0.0000 0.3490 -0.0435 0.7850 -0.0435 0.8725 -0.0435 0.6545 0.2000 0.5230 -0.0435 0.7850 -0.0435 0.8725 -0.0435 0.6545

Data comparison on default range and open hand range 1112 11

Orientation data is in radians, position data is in clipping space of OpenGL. More data is in the source code 22

From the tables above, the search space is reduced heavily by the posture classifier. The overlapping issue needs to be concerned during estimating the range. The hand articulation tracking must be continuous when a new posture is detected, so that the range should overlap a bit between two continuous postures.

4. Experiments 4.1 2D posture Thousands of hand photos were taken from different people in order to build the dataset in this report. The photos were taken by a CCD camera rather than the Kinect due to Kinect has lower resolution. The photos were chosen averagely on training data and test data so that the data can be packed into SVM. The table below shows the classification of all data distribution, the test data is one third of the amount in each posture. The open hand is the largest dataset because it has more variety on different people’s hand than the other two. The entire data set used as training data to train the final 3 classifiers when the parameters are set based on the test data. Type Amount Fist 960 Open hand 1498 V shape 740 Total 3198 Distribution of data

“One against one” method is used for training 3 pairs of postures. And a voting method is used for prediction. Passing every unknown posture to each of the classifier, the result is the one has most votes. There are 3 classifiers need to be trained, 3 groups of C and γ were selected based on a 3-fold validation. After experiments, the best C for each group was locked within the range of being smaller than 24, otherwise the training results on over fitting. For fist against open hand, C turned out to be 16 while γ is 16. The best the accuracy obtained was 99.797%. In the fist against V shape classifier, C = 0.5, γ = 2, the accuracy achieve 100% on the test set. For V shape against open hand, C = 21.75 and γ = 22.5 (3.36359, 5.65685) with best rate 99.8213%. After applying the prediction to the 3 postures, a final accuracy of 99.78% is achieved on entire dataset. As a result, this 2D classifier is feasible to provide the initial approximate hand configurations for the particle swarm optimisation.

12 The global position and orientation are fixed in this report. The default range shows the actual search space. 1-3 are the global position, 4-7 are the orientation, 8-11 are the orientation on thumb, 12-15, 16-19, 20-23, 24-27 are the index, middle, ring and little fingers. 8, 12,16,20,24 are the orientation on z axis at the base of each finger. Other orientation parameters on x axis 23

“one against one” voting prediction

4.2 Hand articulation tracking Several experiments in this report were attempted to evaluate the proposed methods. The experiments based on some short depth sequence of hand motion. The output is the depth sequence rendered from the hypothesis. The PSO process is observed during each experiment. These experiments were programmed as articulation detection per frame.

A depth image 13 of hand hypothesis shows after applying a small gamma. It appears to be solid below in the captured pictures. However it actually contains depth value which can be observed by applying a very small gamma value.

13

This is actually the first rendered depth image in sequence 1. 24

Sequence 1:

Captured hand motion sequence, from fist to open hand

Rendered hand hypothesis 25

Total frame number: 30 PSO population size: 10 PSO generations: 15 Evaluation graph:

Evaluated fitness

20.0 15.0 10.0 5.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Frame

Sequence 2:

Captured hand motion sequence, from V shape to fist

Rendered hand hypothesis 26

Evaluated fitness

Total frame number: 9 PSO population size: 20 PSO generations: 25 Evaluation graph: 20.0 15.0 10.0 5.0 0.0 1

2

3

4

5

6

7

8

Frame

Sequence 3:

Captured hand motion sequence, from Open hand to fist

27

9

Rendered hand hypothesis

Evaluated fitness

Total frame number: 12 PSO population size: 20 PSO generations: 15 Evaluation graph: 10.0 8.0 6.0 4.0 2.0 0.0 1

2

3

4

5

6

7

8

9

10

11

12

Frame

From the results above, we can see that the hypothesis generally follows the correct hand gesture. In the 6th image of sequence 1, the posture was recognised incorrectly, leading to a sharp ridge on the evaluation graph. Notice that the evaluation values in these experiments are not sensitive to small features. Therefore currently this proposed method is not good at accurately tracking. One reason is the natural discrepancy between hand and proposed 3D hand model. To evaluate the influence of such case, this report used the rendered depth image from proposed 3D hand model as the input to eliminate such discrepancy.

28

Sequence 4:

Rendered depth image as input

Red chanel is the input, blue chanel is the hand hypothesis evaluated. 29

The overlaping part is therefore purple.

Evalutation fitness

Total frame number: 30 PSO population size: 10 PSO generations: 15 Evaluation graph: 6.0 5.0 4.0 3.0 2.0 1.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Frame

Sequence 4 has better result compares with Sequence 1 with the exact same PSO parameters. The scale factor and defects on the depth image from Kinect are eliminated by using rendered 3D hand model as the input. The evaluation values in all frames of the Sequence 4 are lower than Sequence 1. It presents that the evaluation function is able to measure the difference between the hypothesis and the input. And the evaluation value is influenced by the quality of the depth image too. However, the hypothesis results are not stable. Because the population in the PSO is initialised every frame from randomness within the range, the problem can be solved by putting the best overall position of last frame as the initial population for the next frame. Another disadvantage of these experiments is that the performance of PSO is not optimised which makes current implementation in this report cannot be used in real-time. With intensive GPU computing, a proposed method with only PSO gained 15Hz performance on a powerful PC (Oikonomidis, Kyriazis, and Argyros, 2011).

5. Discussion and Conclusion As an intermediate step on hand gesture recognition and interaction problem, this report provided literature reviews on recent researches. The conclusion from the literature review results on the combination of model based and appearance based method to be the potential research direction. This report proposed an attempt on a model based articulation tracking method that optimised by a 2D posture classifier using data from a single Kinect sensor. Steps (i) a 2D posture classifier is trained (ii) a 3D hand hierarchy model with 27 DOFs is build (iii) Particle Swarm Optimisation is used with an evaluation function that observes depth and silhouette discrepancy, that results sufficient 2D recognition and roughly 30

articulation tracking for hand gesture recognition and interaction. Extensive programming is done with OpenCV, OpenGL and OpenNI in C++ for this report.

6. Future work Many potential enhancements can be taken in this report. Firstly, the 2D classifier just recognises 3 postures, so that more postures can be trained for better optimisation, or the dynamic postures sequence can be recognised through HMM which will greatly enhance the accuracy of initial population for PSO. Secondly, the proposed 3D hand model has few parameters to fit to different people’s hand, a more detailed and programmable model with skinning technique may work in the future. Thirdly, the PSO can involve extensive parallel processing, allowing more population and generations in calculation to quickly converge to a small difference between the hypothesis and hand segmentation. In terms of application, the proposed methods are feasible for a vision based human computer interface with hand gestures that not requires accurate articulation tracking. For example, a medical image browser with hand gesture controlling can be used for surgery.

7. Bibliography Albrecht, I. Harber, J. & Seidel, H. (2003). Construction and Animation of Anatomically Based Human Hand Models, Eurographics symposium on Cumputer animation, page 109. Eurographics Association. Basri, R. Roth, D. & Jacobs, D. (1998). Clustering appearances of 3d objects. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pages 414–420, 1998. Bergh, M. & van Gool, L. (2010). Combining RGB and ToF Cameras for Real-time 3D Hand Gesture Interaction, Applications of Computer Vision (WACV 2011). Boser, B. Guyon, I. & Vapnik, A. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144{152. ACM Press, 1992. Brand, M. (1999). Shadow puppetry. In Proc. of IEEE Int’l Conf. Computer Vision, pages 1237–1244, 1999. Bray, M. et al. (2004). 3D hand tracking by rapid stochastic gradient descent using a skinning model, 1st European Conference on Visual Media Production. Bregler, C.& Malik, J. (1998) Tracking people with twists and exponential maps. In 31

Proc. CVPR, 1998. Breiman, L. (2001).Random forests. Machine Learning, 45(1):5–32, 2001. Cai, J. Goshtasby, A. & Yu, C. (1998). Detecting Human Faces in Color Images. Wright State University. Chua, C. Guan, H. & Ho, Y.(2002). Model-based 3D hand posture estimation from a single 2D image, Image and Vision Computing. Vol 20. Page 191-202. Elsevier Science. Clerc, M. & Kennedy, J. (2002). The Particle Swarm - Explosion, Stability, and Convergence in a Multidimensional Complex Space. Transactions on Evolutionary Computation, 6(1):58–73, 2002. Cortes, C. & Vapnik, V. (1995). Support-vector network. Machine Learning, 20:273-297. Derpanis, K. (2004). A Reivew of Vision-Based Hand Gestures. Ennehar, B. Brahim, O. & Hicham, T.(2010). An Appropriate Color Space to Improve Human Skin Detection, INFOCOMP – Journal of Computer Science. Lavras, MG, Brasil, v.9, n.4, p. 01-42, Dez. 2010. Freeman, W. & Weissman, C (1995). Television control by hand gestures, in Proceedings of International Workshop on Automatic Face and Gesture Recognition. Zurich, Switzerland, June 1995, pp. 197–183. Ghobadi, S. et al. (2007). Hand Segmentation using 2D/3D Image’, Proceedings of Image and Vision Computing New Zealand 2007, pp. 64–69, Hamilton, New Zealand. Hamer, H. et al.(2009). Tracking a Hand Manipulating an Object, In ICCV, 2009. Heap, T. & Hogg, D.(1996). Towards 3D Hand Tracking using a Deformable model, Face and Gesture Recognition. Hu, M. (1968). Visual pattern recognition by Moments Invariants, IRE Transactions on Information Theory. Page 179 -187. Jones, M. & Rehg, J. (1999). Statistical Color Models with Application to Skin Detection. In Proc. of CVPR’99, volume 1, pages 274–280, 1999. Kato, M. & Xu, G.(2006). Occlusion-Free Hand Motion Tracking by Multiple Cameras and Particle Filtering with Prediction, International Journal of Computer 32

Science and Network, VOL.6 No.10, October 2006. Kennedy, J. & Eberhart, R. (1995). Particle Swarm Optimization. In International Conference on Neural Network, vol. 4, page 1942-1948. IEEE. 1995. Kennedy, J., Eberhart, R. & Shi, Y. (2001). Swarm intelligence. Morgan Kaufmann. Klaus, A., Sormann, M. & Karner, K. (2006). Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure, 18th International Conference on Pattern Recognition. Kolsch, M. & Turk, M. (2004). Robust hand detection. In International Conference on Automatic Face and Gesture Recognition (to appear), Seoul, Korea. Kry, P. (2005). Interaction Capture and Synthesis of Human Hand, PhD Thesis, the University of British Columbia. Lee, D. & Hong, K. (2010). Game Interface using Hand Gesture Recognition, 5th International Conference on Computer Science and Convergence Information Technology, 2010. Li, X. et al. (2011). Hand gesture recognition by stereo camera using the thinning method, International Conference on Multimedia Technology, 2011. Mahmoud, T. (2008). A New Fast Skin Color Detection Technique, World Academy of Science, Engineering and Technology, Vol. 43. Mitra, S. & Achary, T. (2007). Gesture Recognition: A Survey, IEEE Transactions on systems, man, and cybernetics-Part C: Application and reviews. Vol, 37. No.3, May 2007. Moeslund, T. Hilton, A. & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis, Computer Vision and Image Understanding, Vol. 104, page 90-126. Oikonomidis, I. Kyriazis, N. Argyros, A. (2011). Efficient Model-based 3D Tracking of Hand Articulations using Kinect, BMVC 2011. Plagemann, C., et al. (2010). Real-time identification and localization of body parts from depth images, 2010 IEEE International Conference on Robotics and Automation. Romero, J. Kjellström, H. & Kragic, D. (2009). Monocular Real-time 3D Articulated Hand Pose Estimation. International Conference on Humanoid Robots, pages, 87–92, 33

December 2009. Rosales, R. et al. (2001). 3D Hand Pose Reconstruction Using Specialized Mappings. ICCV, pages 378–385, 2001. Shotton et al. (2011). Real-Time Human Pose Recognition in Parts from Single Depth Images. In Proc. CVPR, 2011. Starner, T. & Pentland, A. (1995). Real-Time American Sign Language Recognition from Video Using Hidden Markov Models, MIT Media Lab., MIT, Cambridge, MA, Tech. Rep. TR-375, 1995. Tzevanidis, K. et al. (2010). From multiple views to textured 3D meshed: a GPU-powered approach, ECCV 2010 Workshop on Computer Vision on GPUs. Greece. Viola, P. & Jones, M. (2002). Robust Real-time Object Detection. Int. Journal of Computer Vision, 2002. Wang, R. & Popvic, J.(2009). Real-time hand-tracking with color glove, ACM Transaction on Graphics (SIGGRAPH 2009), 28(3), August 2009. Wu, Y. & T.S. (1999), Human Hand Modeling, Analysis and Animation in the Context of HCI, IEEE Int’l Conf. on Image Processing. Kobe, Japan. Wu, Y. Lin, J. & Huang, T. (2001). Capturing Natural Hand Articulation, IEEE international conference on Computer Vision. Vancouver, Canada. Zhou, H. (2005), Visual Tracking and Recognition of the Human Hand, PhD Thesis, Zimmerman, T. et al. (1987). A hand gesture interface device, CHI '87 Proceedings of the SIGCHI/GI conference on Human factors in computing systems and graphics interface. ACM New York, USA.

34

Suggest Documents