Gesture-based computer mouse using Kinect sensor

CogInfoCom 2014 • 5th IEEE International Conference on Cognitive Infocommunications • November 5-7, 2014, Vietri sul Mare, Italy Gesture-based comput...
Author: Helen Harmon
2 downloads 0 Views 759KB Size
CogInfoCom 2014 • 5th IEEE International Conference on Cognitive Infocommunications • November 5-7, 2014, Vietri sul Mare, Italy

Gesture-based computer mouse using Kinect sensor Szilvia Szeghalmy

Marianna Zichar

Attila Fazekas

Department of Computer Graphics and Image Processing University of Debrecen H-4010 Debrecen Pf. 12 Email: [email protected]

Department of Computer Graphics and Image Processing University of Debrecen H-4010 Debrecen Pf. 12 Email: [email protected]

Department of Computer Graphics and Image Processing University of Debrecen H-4010 Debrecen Pf. 12 Email: [email protected]

Abstract—This paper introduces a vision-based gesture mouse system, which is roughly independent from the lighting conditions, because it only uses the depth data for hand sign recognition. A Kinect sensor was used to develop the system, but other depth sensing cameras are adequate as well, if their resolutions are similar or better than the resolution of Kinect sensor. Our aim was to find a comfortable, user-friendly solution, which can be used for a long time without getting tired. The implementation of the system was developed in C++, and two types of test were performed too. We investigated how fast the user can position with the cursor and click on objects and we also examined which controls of the graphical user interfaces (GUI) are easy to use and which ones are difficult to use with our gesture mouse. Our system is precise enough to use efficiently most of the elements of traditional GUI such as buttons, icons, scrollbars, etc. The accuracy achieved by advanced users is only slightly below as if they used the traditional mouse.



In Human-Computer Interaction (HCI), mouse is still one of the most commonly used input devices. Its great benefit is, that it makes possible for the users to control all kinds of application with GUI. But this device can not be used in several cases in public area and/or by handicapped people. Some gesture-based systems take over the control of the mouse pointer and mouse events to solve this problem. This technique is rather popular among the head- and eye-mouse systems [2], [18], but it appeared also in some touch-free medical applications [4], [20] and of course in the computer gaming world [8], [13]. According to the definition of a new research field called cognitive infocommunications [22], HCI applications may be classified as coginfocom applications. In the case of current research this statement is extremely true, because it has links with both infocimmunications and the cognitive sciences [24]. Actually our gesture-based computer mouse belongs to the inter-cognitive communication.

data are frequently used only for hand pixels extraction. If the hand is properly segmented, the gestures can be recognised by the shape of the hand contour and other geometric features. Ren’s at al. proposed a new contour matching method in [16] using the series of the relative distance between the contour points and the hand centre. They achieved 86-100% accuracy in their own 10-gesture challenging dataset. Klompmaker at al. developed a interaction framework for touch detection and object interaction [5]. The fingertips are detected by the vertices of the polygon approximating the hand contour. Yeo et al. present a similar method, but they compute more features and give several criteria to classify a polygon vertex as a fingertip [21]. The authors in [17] used a convex shape decomposition method combined with skeleton extraction to detect fingertips and recognise the gesture. Their method accuracy is between 94% and 97% in Ren’s dataset. Detection of half-closed and closed fingertips requires other approach, such as geodesic maxima based fingertip detection [6] or 3D model fitting [7], [12] but their computational costs are large. III.

Our vision-based mouse control system relies only on depth data, thus the lighting conditions hardly influence it. The depth data are provided by the depth sensor, namely Kinect [11]. A. Arrangement of device for sitting position It is the best arrangement, when the sensor is a bit bellow the screen. The sensor should be parallel to the monitor, but it is also permitted, if it looks slightly upwards to the user. The user is sitting in front of the screen in a position, where he can hold his hand inside the sensor’s field of view (Figure 1).

Our aim was to develop a vision-based, applicationindependent cheap mouse control system. The user convenience was an important aspect, just like the small number of the gestures to memorize. Our solution ensures also some freedom in gesture presentation for the user. II.





The hand and fingertips detection and gesture recognition methods have been studied for several decades, from which methods designed for depth sensor are in focus now. Depth

978-1-4799-7280-7/14/$31.00 ©2014 IEEE 419

Fig. 1.

Arrangement of devices where K denotes a depth sensor [19].

Sz. Szeghalmy et al. • Gesture-based computer mouse using Kinect sensor

B. Controlling

A. Depth data and preprocessing

User can move the cursor in joystick-like way. If his hand or finger is in vertical position slightly lined forward, the cursor does not move, otherwise the cursor keeps going to the direction the hand is pointing. If the user tips his hand to the left/right, the cursor goes to the left/right. If he tilts downward or upward, the cursor goes downward/upward as well. Each movement lasts until the hand starts to move other direction.

The Kinect sensor can measure the distance of objects from the sensor with the infra projector and the infra cam. We use OpenNI2 SDK [13] and NiTE Middleware Libraries [14] to extract depth image in 640 × 480 resolution and track user hands. We assigned a predefined gesture (wave) to an instance of the NiTE HandTracker class. The hand tracking starts if the given gesture is recognised. If the tracking works and finds at least one hand, we retrieve the center point of hand nearest the sensor, the depth image and the corresponding real world (using the terminology of Microsoft the Kinect space) coordinates The depth data are usually quite noisy, thus we apply 3×3 median filter on the depth image.

User can trigger the following events: •

Start and Stop: The gesture-based controlling starts due to hand waving (Figure 2a) and stops if user opens all fingers (Figure 2e).

Move cursor: The user can move the cursor in the way described earlier. The gesture Move (Figure 2b) is a good position; it is easy to form the click gesture from it. But users also can control the cursor with index finger or with open palm if his fingers are closed.

In this paper P = (px , py ) denotes a pixel, pz denotes the depth values of P , and c(P ) = (c(P )x , c(P )y , c(P )z ) denotes the world coordinates corresponding to P .

Single click event with left button: Initially, the hand is in the move posture, then user should extend his thumb finger and close it again (Figure 2c).

Double click event: Two single click gestures sequentially within about a second cause a double click.

Single click event with right button: Initially, the hand is in the move posture, then user should open the index and ring fingers and close it again (Figure 2d).

Many solutions consider a big (or the biggest), foreground, connected component as a hand [6], [10], [17], [21], but in the depth image, the hand points often belong to multiply N8 connected segments, because of the self-occlusion of the hand and fingers. Based on our experience, these small parts around the biggest segment almost always are fingertips. Thus, they play an important role in hand sign recognition.

Hold button down: Initially, the hand is in the move posture, then user presents left button or the right button sign and holds his hand in this posture. After some frame, the cursor can move as well.

(a) Fig. 2.





Start (a), Move (b), Left button (c), Right button (d), Stop (e).

B. Hand points detection

In our previous work [19], we defined the hand points by using the part of the sphere around the known hand center. Since the hand is in the foreground during the control period, the method works well in most cases. Now, we complement the algorithm with a filter part to make it more robust. 1) Determining the hand candidate points: First, we create a mask image (Figure 4b) by the following formula, where 1 denotes the hand candidate points, 2 denotes the arm candidate points and 0 identifies the background.   if pz − phz < 2r and kc(P ), c(P h )k < r,   1,    2, if pz − phz < 2r and Hc (P ) =  r ≤ kc(P ), c(P h )k < r2 ,      0, otherwise,

C. Constraints Let us consider a coordinate system where the X- and Yaxis are parallel to the X- and Y-axis of the screen, and the Z-axis is used to measure the depth. It is assumed, that the absolute value of angle between user’s hand and Y-axis is less than 90◦ , although this constraint is used only by cursor moving steps. The other constraints come from the features of the sensor and arrangement. The user has to sit far enough from the device, the rotation of hand around the X- and Yaxis should be less than 30◦ and 20◦ , the hand is a foreground object. These restrictions correspond with those reported by other authors. Methods can tolerate far better the rotation of hand around Z-axis, because gestures remain parallel to plane of image [15]. IV.


where P h is the center point of the hand and k.k denotes the Euclidean distance. The r and r2 are given radius (r < r2 ). We set r to 14 cm and r2 to 17 cm based on a research of the physical expansion of hand length and the great span (distance between the extended thumb and little finger) [1]. 2) Removing non-hand part objects: Disturbing objects are usually either the (non-control) hand of the user or/and other objects on the table. When the hand is close to other objects, they can become hand and also arm candidate points. The whole object almost never becomes hand candidate, because it can be found usually on the table, while the hand is in the air. Therefore, we have to delete all the components containing arm candidate points and not containing the hand center point (Figure 4c). Our algorithm consists of the following steps. 1)

The main steps of our system are presented in the Figure 3. In the following sections we describes these steps in detail.


Create the binary version of Hc , where the arm and hand candidate pixels form the foreground and the zero value pixels form the background.

CogInfoCom 2014 • 5th IEEE International Conference on Cognitive Infocommunications • November 5-7, 2014, Vietri sul Mare, Italy

Fig. 3.


Main steps of our system

Label the N8 -connected components (C) of this binary image. Let Sc denote the set of the ”clear” hand components: Sc = {s | s ∈ C and P ∈ s ⇒ Hc (P ) = 1} and let Sm denote the other foreground component: Sm = {s | s ∈ C and s ∈ / Sc and P ∈ s ⇒ Hc (P ) 6= 0}.

3) 4)


Find sh ∈ Sc ∪ Sm containing the center of the hand. Use the following formula to determine hand mask:   if P ∈ sh or   Hc (P ),  H(P ) = ∃s : s ∈ Sc ⇒ P ∈ s    0, otherwise.



Fig. 5. Connection of separated hand segments: original mask (a) , contour approximation (b), connected mask (c).

2) Fingertips detection of extended fingers: To detect candidates of fingertips, a well-known shape-based method is applied [9] with a slight modification. First, the hand contour (sh ) is approximated by a polygon in a coarser way (epsilon = 15), so the resulting polygon contains only very few vertices: fingertips, valley between fingers, and a few other extreme points. Let us describe the polygon as a point sequence γ(sh ) = (P0 , P1 , ..., Pn , Pn+1 ), where P1 , ..., Pn are the polygon corners in clockwise order and P0 = Pn , and Pn+1 = P0 . Then the method selects the extreme points by



Fe = {Pi | Pi ∈ γ(sh ) ∩ Γ(sh )},


Fig. 4. Depth image of a user with a mug in the foreground (a), hand points candidate mask of (a), the final mask (c). (The figures contain only the relevant parts of the image and masks.)

C. Fingertip detection 1) Preparation of the hand mask: If the hand mask (H) is made up of two or more components, in this step, we connect the separated ones to the hand blob (sh ). The Figure 4c suggests to connect the two components at their closest points, which may also belong to the contour of other fingers. To avoid this error, we connect s and sh segments at the P and Q points defined by the following formula: arg min

kP, Qk

where Γ(sh ) is a set of the convex hull corners of the hand component (Figure 6a). The next step is to classify the elements of Fe into the finger candidate (Figure 6b) and non-finger classes based on the distance between point and hand center, and the ”angle” of the fingertip. The original method calculates a midpoint of two points located in a given distance before and after the candidate points along the contour. If the distance of the candidate and the midpoint is larger than the given limit, the candidate is a fingertip. We want to be more permissive, because we need to detect as a ”fingertip” not only a single finger, but the fingers close together, like our Move sign (Figure 2b). Thus, we check the angle of polygon at this corner. More precisely, the set of candidates is defined by Fc = {Pi ∈ Fe | θ(Pi ) < 90◦ and kc(Pi ), c(P h )k < 50}

P ∈γ(sh ), Q∈s

where k.k is the Euclidean distance, and γ(sh ) denotes the corners of polygon approximating the contour of sh . The approximation is performed by the Douglas-Peucker method [3] with  = 10. Finally, we draw a wide line between the P and Q on the hand mask (Figure 5).


where θ(Pi ) the angle between Pi P~i+1 and Pi P~i−1 . The fingertip detection methods often give false detection around the wrist. The authors in [6] define an ellipse around the wrist points and penalize if the path between the hand

Sz. Szeghalmy et al. • Gesture-based computer mouse using Kinect sensor

centre and the points contains ellipse points. Li [9] proposed to detect the bounding box of hand and remove candidates fall in the lower region. Our hand segmentation method detects reliably the part of the forearm (expect if the hand totally hide the forearm region), thus we can easily eliminate the points located too close to wrist region. Final set of fingertips (Figure 6.c) is defined by F = {P ∈ Pc | Hc (P ) = 2 ⇒ kc(Pi ), c(P )k > 50}.

F. The 3D orientation of hand and fingers To realize the joystick-like movement of the cursor, we need 3D orientation, but previous steps computed only 2D one. If the fingers are almost parallel to the plane of the sensor, the orientation can be calculated easily. Otherwise, it is easy to get wrong result. The orientation of full hand can be determined more precisely, but moving whole hand requires more effort from the user. This issue was solved by applying Weighted Principal Component Analysis (WPCA). The orientation of the whole hand and forearm point cloud with WPCA are computed only if the index finger was found in the previous steps. The weights of points are determined based on the distance from the index finger. The weight of the index finger is five, then it is reduced by one per each three centimetres, until the weight becomes one. G. Events




Fig. 6. Common points of approximating polygon (gray line) and convex hull (a), fingertip candidates (b), detected fingertips (grey circle) and removed candidates (black rectangle) (c).

D. Thumb and index finger recognition If we have found a fingertip in the previous step, we need to recognise the thumb and the index finger. First, we compute the orientation of the hand mask (H = 1 or H = 2) with the PCA. Then we rotate the fingertip points (F ) in clockwise direction around the hand centre by an angle between the vector of the largest eigenvalue and the vertical axis. Then fingertips are sorted in increasing order of their x coordinate. Let F10 , ..., Fn0 denote the rotated, sorted fingertips. Labelling steps: 1) 2) 3)

Set the label of F10 (leftmost fingertip) to thumb, if the angle of the line between the hand center and F10 from the Y-axis is larger than 60◦ . Otherwise, set the label of F10 to index, if the angle of the line between the hand center and F10 from the Y-axis is larger than -20◦ . Otherwise, if F20 exists set it to index, if the angle of the line between the hand center and F20 from the Y-axis is larger than -20◦ .

E. Control sign recognition The simple rule based solution can recognise the sign, since the necessary data, the number of fingertips, and their labels are already available. •

Move sign: only one fingertip is detected and this is the index finger.

Left button sign: two fingertips are detected and one is the index and the other is the thumb, and the distance between these fingers is larger than 6 cm.

Right button sign: two fingertips are detected; one is the index, the other is not the thumb. The distance between the fingers is between 2.5 cm and 7 cm.

Stop sign: the number of the fingertips equals to 5.

Start sign: this gesture is a pre-built in NiTE.

In a real-time control system, even one misclassification out of 100 can drive the user mad. A state machine can radically reduce this problem by ignoring most of the unexpected signs. For the sake of clarity, the figure does not show the stop sign and the unexpected events. The stop sign set the machine state to Stopped immediately. Sequence of unexpected signs take the machine to the Wait state, from every other state except the Stopped. Initially, the system is in the Stopped state waiting for the start signal to go to the Alarm state. From this state, we can wake up the system easily with the move sign. These two states help us to avoid involuntary movements. The Stopped state refers that the user wants to pick the cursor off for middle or long-term, while the Alarm state is applied when the user needs only a moment of pause. Basically, move sign makes the cursor move, but it is also possible to move cursor while the mouse button is down. In the second case, the cursor movement does not fire the button down event right away, in order to prevent cursor from moving while clicking. The waiting cycle in the Left button up event state ensures that the cursor does not move if one left click followed by another in about a second. Since the speed of click-gesture is lower than the speed of traditional mouse clicking, we increase the time belonging to the double-click in the operating system during the Alarm state. Here, we reduce the speed of cursor as well. Certainly, we load the original values back when the user stops the system. V.


The most common evaluation measures regarding the pointing devices are speed and accuracy. Because this section reports only about the first experimental results we do not consider other types of accuracy measures [23]. A. Speed test We have developed a very simple application to measure how fast the user can position the cursor and click on an object. This is a simple panel, on which a circle appears. The tester


CogInfoCom 2014 • 5th IEEE International Conference on Cognitive Infocommunications • November 5-7, 2014, Vietri sul Mare, Italy

has to click on the circle as fast as he can. After clicking, the size and the position of the circle change randomly (even if the user clicks on wrong place). B. Graphical User Interface test First, several buttons (requires left mouse button single click), then several icons (requires double click) appear in the screen and the user has to click on the yellow one. The icons, and also the buttons appear in descending order by size (128×128, 64×64, 100×30, 32×32, 24×24, 16×16, 12×12). Then the marked item or value of different types of controls (list box, list box with scroll bar, combo box, radio button, check box and track bar) have to be selected. C. The Scenario The position of the screen and the Kinect were fixed (Figure 1). The users were asked to sit comfortably and could keep their elbows on the armrest during the test. Some figures and a few texts helped the users learn the functions of gesture mouse. Below the description there were some controls, which were addressed to a particular function. Users could try to use some of the elements, so the learning process was short (1-2 minutes), but interactive. After that, the users did the GUI and speed test with gesture-based mouse and the traditional computer mouse as well. (The order of the control devices always changed.) Finally, they were asked to rate the comfort of controls and to share their experience. Users could solve some exercises in different ways. Let us take for example a scrollable list box. The tester could click the scroll bar more times to scroll down the list or he could pull down the thumb of the scroll bar. Nine volunteers performed the test, and two of them were ”advanced user”, who used the system on number of occasions. D. Results


We have developed a visual based gesture mouse system. The proposed method uses only depth data, thus the system works even at night. We combined successfully a well-known shape based method and the easy extrema based solution to detect fingertips needed for controlling. Based on the reliable detection of the hand points and the wrist-forearm region, we can filter out the false fingertip candidates very effectively. In the on-line system some false detections always occur, but the state machine ensures the robustness of our system. We have performed a test to examine how easy it is to use the commonly used GUI elements with our system and how quickly users can click on the certain objects on the screen. Each tester was able to handle even the smallest elements (button with 12×12 pixel) but most of first time testers found easy to use only the larger ones, at least 32×32 pixels. Users told that controlling by gestures is weird a bit initially, but it becomes easy once you get the hand of it. Some practice resulted much faster control for the user than before. The advanced users reached a bit lower accuracy with the gesturebased system than with the traditional mouse (97.60% and 98.56% on the speed test). ACKNOWLEDGEMENT ´ The publication was supported by the TAMOP-4.2.2.C11/1/KONV-2012-0001 project. The project has been supported by the European Union, co-financed by the European Social Fund. R EFERENCES [1]



The first time testers control access time were 6.4 time slower, the advanced users were only 2.4 times slower with gestures than with the traditional mouse. The track bar became the least popular control, although all the false selections belonged to the scrollable list box.



The Table I presents the result of advanced users’ speed tests. The Figure 7 presents the outcome of GUI tests. [6]




Gesture-based mouse radius


Computer mouse

speed (px/ms)



speed (px/ms)

















































A. K. Agnihotri, B. Purwar, N. Jeebun, S. Agnihotri, Determination Of Sex By Hand Dimensions. The Internet Journal of Forensic Science, 1 (2), (2006) M. Betke, J. Gips, and P. Fleming, The camera mouse: visual tracking of body features to provide computer access for people with severe disabilities, IEEE Transactions on Neural Systems and Rehabilitation Engineering, 10(1), pp. 1–10. (2002) D. Douglas and T. Peucker, Algorithms for the reduction of the number of points required to represent a digitized line or its caricature, The Canadian Cartographer 10(2), pp. 112-122. (1973) C. Graetzel, T. Fong, S. Grange, C. Baur, A non-contact mouse for surgeon-computer interaction. Technology and Health Care, 12(3), 245– 257. (2004) F. Klompmaker, K. Nebe, A. Fast, dSensingNI: a framework for advanced tangible interaction using a depth camera, In Proceedings of the Sixth International Conference on Tangible, Embedded and Embodied Interaction, ACM, pp. 217–224. (2012) P. Krejov, R. Bowden, Multi-touchless: Real-time fingertip detection and tracking using geodesic maxima, In Proceeding of 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), IEEE, pp. 1–7. (2013) N. Kyriazis, I. Oikonomidis, and A. Argyros, A GPU-powered computational framework for efficient 3D model-based vision, ICS-FORTH, TR420, (2011) P. D. Le, V. H. Nguyen, Remote Mouse Control Using Fingertip Tracking Technique, In AETA 2013, Recent Advances in Electrical Engineering and Related Sciences, Springer Berlin Heidelberg, pp. 467–476. (2014) Y. Li, Hand gesture recognition using Kinect. In Proceeding 3rd International Conference on Software Engineering and Service Science (ICSESS), IEEE, pp. 196–199 (2012)

Sz. Szeghalmy et al. • Gesture-based computer mouse using Kinect sensor

Fig. 7.




[13] [14]




The results of the GUI test: Median of the elapsed time between the appearance of the controls and the success clicking.

H. Liang, J. Yuan, D. Thalmann, 3D fingertip and palm tracking in depth image sequences, in Proceedings of the 20th ACM international conference on Multimedia, pp. 785–788. (2012) Microsoft Corporation (2012), MS Developer Network - Kinect Sensor URL Accessed 26 August 2013 I. Oikonomidis, N. Kyriazis, and A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect, in British Machine Vision Conference, (2011) R OpenNI (2013), OpenNI 2.0 Software Development Kit URL Accessed 25 July 2013 PrimeSenseTM (2013), Natural Interaction Middleware libraries version 2.2 URL Accessed 25 July 2013 A. R. Sarkar, G. Sanyal, S. Majumder, Hand Gesture Recognition Systems, A Survey. International Journal of Computer Applications, 71(15), (2013) Z. Ren, J. Yuan, Z. Zhang, Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera, In Proceedings of the 19th ACM international conference on Multimedia, ACM, pp. 1093–1096. (2011) S. Qin, X. Zhu, Y. Yang, Y. Jiang, Real-time hand gesture recognition




[20] [21]

[22] [23]

[24] nica

from depth images using convex shape decomposition method. Journal of Signal Processing Systems, 74(1), pp. 47–58. (2014) K. Sung-Phil, J. D. Simeral, L.R. Hochberg, J.P. Donoghue, G.M. Friehs, M.J. Black, Point-and-Click Cursor Control With an Intracortical Neural Interface System by Humans With Tetraplegia, IEEE Transactions on Neural Systems and Rehabilitation Engineering, 19(2), pp. 193–203. (2011) Sz. Szeghalmy, M. Zichar, A. Fazekas, Comfortable mouse control using 3D depth sensor, in IEEE 4th International Conference on Cognitive Infocommunications, pp. 219–222., (2013) Wachs J. P., K. Mathias, S. Helman, E. Yael, Vision-based hand-gesture applications, In Commun. ACM 54(2), 60–71. (2011) H. S. Yeo, B. G. Lee, H. Lim, Hand tracking and gesture recognition system for human-computer interaction using low-cost hardware, Multimedia Tools and Applications, 1–29. (2013) P. Baranyi, A. Csapo, Definition and Synergies of Cognitive Infocommunications, Acta Polytechnica Hungarica, vol. 9, 67-83, (2012) MacKenzie, I. Scott, Tatu Kauppinen, Miika Silfverberg, Accuracy measures for evaluating computer pointing devices, Proceedings of the SIGCHI conference on Human factors in computing systems. ACM 9– 16, (2001). G. Sallai, The Cradle of Cognitive Infocommunications, Acta PolytechHungarica, vol. 9, no. 1, 171-181, (2012).

Suggest Documents