Kinect: natural or involuntary interface?

Kinect: natural or involuntary interface? Stefano Bussolon Università degli Studi di Trento [email protected] ABSTRACT Released on november 2010 a...
Author: Alexina Hood
5 downloads 0 Views 189KB Size
Kinect: natural or involuntary interface? Stefano Bussolon Università degli Studi di Trento

[email protected]

ABSTRACT Released on november 2010 as a game controller for the Xbox 360 video game platform, the Microsoft Kinect sensor has been a huge success: 10 million of units have been sold in the first 4 months after the release. Shortly after, a number of hackers and developers have been able to develop an open source version of the drivers. Since then, a number of ”kinect hacks” have been published on the internet. The capabilities of the Kinect sensor, thanks to its hardware characteristics and to the software middleware that in the meantime has been published with an open source license, are really impressive, and seems to finally realize the promise of a natural, kinetic user interaction. The current work analyses the possibilities offered by the device, it’s limits as a substitute of the traditional input devices, and the fields where it can be used to implement an ubiquitous, incidental, unobtrusive interaction. The last section of the paper will shortly describe a couple of sketches, realized to test the possibility to capture the Kinect input using just open source software and libraries.

Categories and Subject Descriptors H.5.2 [Information Systems]: Information Interfaces and Presentation—User Interfaces

General Terms NUI, Kinect

Keywords ACM proceedings, Kinect, NUI, ubiquitous computing

1. INTRODUCTION The first of april of this year, Google announced a new, revolutionnary interface for their mail web application: Gmail Motion. With it, users could send and recieve emails without using the mouse and the keyboard, but using ”your computer’s built-in webcam and Google’s patented spatial track-

ing technology to detect your movements and translate them into meaningful characters and commands. Movements are designed to be simple and intuitive for people of all skill levels.”1 It was, of course, an april fool. Nonetheless, few hours after, some researchers at the University of Southern California Institute for Creative Technologies (ICT) were able to make Gmail Motion work. Thanks to the kinect and some open source libraries (OpenNI, NITE and FAAST) they were able to realize the gestures imagined by the Gmail Motion hoax: opening an email as if it were an envelope, replying by throwing a thumb back, licking the stamp to send the response. Most of their code is open source, so if you own a kinect you can install the software and use Gmail trough the gestures. The hack realized by the researchers at ICT is a clever demonstration of the impressive capabilities of the kinect hardware. To have a clue of how such a technology is changing the field, it is useful to analyze how the state of the art was just few years ago. A decade ago, [8], in a survey of the visual analysis of human movement, concluded that ”[a]lthough one appreciates from this survey the large amount of work that already has been done in this area, many issues are still open, e.g., regarding image segmentation, use of models, tracking versus initialization, multiple persons, occlusion, and computational cost.” The scenario did not change very much in the next ten years, if [17], in their ”Machine Recognition of Human Activities: A Survey” concluded that ”[s]ynergistic research efforts in various scientic disciplines, computer vision, articial intelligence, neuroscience, linguistics, etc., have brought us closer to this goal than at any other point in history. However, several more technical and intellectual challenges need to be tackled before we get there. The advances made so far need to be consolidated, in terms of their robustness to real-world conditions and real-time performance.” The body tracking mechanism of the kinect is still not perfect. The user has to assume a peculiar posture to allow the system to correctly recognize her body parts before tracking them, and this could be an important limitation in some of the possible applications I will suggest in the following sections. The system, however, is already able to recognize a human body even without completely recognize the body parts, and can automatically trace simultaneously the positions of many persons. 1

http://mail.google.com/mail/help/motion.html

2. HARDWARE AND MIDDLEWARE The kinect capabilities are based on both hardware and software components. The Kinect sensor is a horizontal bar connected to a small base with a motorized pivot. The device features an RGB camera, depth sensor and multi-array microphone, which provide full-body 3D motion capture, facial recognition and voice recognition capabilities2 . Kinect uses a software technology developed by PrimeSense, which interprets 3D scene information from a continuouslyprojected infrared structured light. OpenNI, a middleware structure, is devoted to trasform the rgb and depth images into a number of structured output: • Full body analysis middleware (typically joints, orientation, center of mass) • Hand point analysis middleware • Gesture detection middleware, that identifies predefined gestures (for example, a waving hand) and alerts the application • Scene Analyzer middleware, able to separate the foreground and the background, calculate the 3d coordinates, and identify individual figures in the scene. OpenNI (Open Natural Interaction) is a multi-language, crossplatform framework that defines APIs for writing applications utilizing the so called Natural Interaction. The open, layered structure of OpenNI allows the framework to be decoupled from a particular sensor. Asus, in collaboration with PrimeSense, will soon offer a motion sensing device quite similar to the Microsoft Kinect. The design approach of OpenNI, therefore, is to enable developers to build software and applications on the middleware, that should work using different sensors from different vendors. LINK The PrimeSense software technology is not the only one that dramatically improved the body, face and gesture recognition. Zdenek Kalal, a student at the University of Surrey, UK, during his Ph.D. thesis developed TLD (Track, Learn and Detect), an award-winning, self-improving realtime vision algorithm for tracking of unknown objects in video streams [10]. An impressive number of ”Kinect hacks” flourished in the past few months after the relase of the sensor and the open source developement kit. Web blogs like 3 and 4 post a new hack around every day. Robots that emulate the movements of a person, hip hop dance commercial videos, a ”virtual fitting room” in Moscow, 3D objects manipulation, video game proofs of concept, magic tricks are the topics of some of the latest posts on those blogs. The results are impressive, if we compare them with the state of the art as described just three years ago. Nowadays, 2

http://en.wikipedia.org/wiki/Kinect http://kinecthacks.net/ 4 http://kinect.dashhacks.com/ 3

if we fancy, we can send an e-mail using the gesture suggested by Google in their hoax. Such improvements bursted the enthusiasm of the supporters of the Natural users interfaces. The OpenNI documentation defines the term Natural Interaction (NI) as a concept where Human-device interaction is based on human senses, mostly focused on hearing and vision. The document asserts that human device NI paradigms render such external peripherals as remote controls, keypads or a mouse obsolete. NI is based on speech recognition, hand gestures, and body motion tracking. Steve Ballmer, CEO at Microsoft, affirmed that ”we will look back on 2010 as the year we expanded beyond the mouse and keyboard and started incorporating more natural forms of interaction such as touch, speech, gestures, handwriting, and vision – what computer scientists call the NUI or natural user interface.” [1]5 The enthusiasm espressed by Ballmer is understandable: the Kinect sensor sales reached 10 millions units in the first four months after it’s first relase. Others tend to be still more optimistic. Some NUI evangelists [3] believe that the NUI will substitute the GUI (graphical user interface, or WIMP, ”window, icon, menu, pointing device”) just like the GUI substituted the CLI (command line interface). The advent of touchscreen smartphones and tablet, that use forms of interaction absent in the classical, WIMP interfaces, like the multitouch interaction, partially confirm the advent of a post-WIMP generation of devices. The idea of some post-WIMP kind of interaction was in the air long time ago. It is interesting to note, however, that it is always difficult to foresee the future of technology. In his ”Noncommand User Interfaces” article, [12] bet on systems like eyetracking and software agents, but none of them became the next user interface. Should we therefore expect the decline of the GUI, as we know them, eclipsed by the advent of the natural user interfaces? I don’t think so for a number of reasons.

3. LIMITS OF THE NUI 3.1 NUI are not so natural The first reason is pointed out by [14]: ”Most gestures are neither natural nor easy to learn or remember”. Let’s think at the Gmail Motion hoax: ”to open a message, make a motion as if you were opening an envelope. To reply, simply point backward with your thumb”. For a moment, let’s forget the fact that they are a joke: someone implemented it, and we can therefore reason about it. When we interact with Gmail, we open the messages; the messages are the modern substitute of the old fashioned paper mail, and therefore the most ”natural” gesture seems to be to mimic the opening of an envelope. When interacting with a computer, however, we open a lot of stuff. We open documents, files, and so on. Probably, the most nat5 http://www.huffingtonpost.com/steve-ballmer/ces-2010a-transforming-t b 416598.html

ural movement to mimic the opening of a document is like opening a book; to open a directory we should act as we were pulling a drawer, and so on. Do the next generation interaction designers invent a different gesture for each kind of virtual object we have to open? Or should them converge to a convention, and decide that the same gesture will open everything? An interface based on gestures have to found itself on most of the same principles valid for the traditional GUIs. They have to be internally and externally consistent, they have to be visible, they have to provide feedback, they have to forgive users for their errors. And, despite the ”natural” brand, they have to be based on conventions. The lack of consistency of the interfaces has been cited by [13] as the main usability problem. This, again, means that the gestures based interface is subject of mainly the same usability principles valid for the graphical user interfaces, and that the interaction design of that kind of interface is still immature. We should expect, however, that a number of interaction standards, and libraries, will evolve in the next few months, able to overcome the consistency issues emerged in Nielsen’s analysis. In the same article, [13] noticed that, testing some videogames, some system alerts were unnoticed by the users because they were so deeply involved in playing their games. Nielsen correctly noticed that this is both an issue (a warning goes unnoticed) and a demonstration of the level of interest and immersion a user can reach while playing a game trough the Kinect: the users are sometime so engaged in their action to notice something that goes outside of the focus of action. This is a great point for the gesture interaction; it is not surprising that the sensor has been designed and relased mainly to interact with a gaming system like the xbox. It should be relatively safe to predict that gesture interfaces are changing the rules in the fields of videogaming, and would probably be able to seamless substitute the GUI in the fields of multimedia consumption and in teleconferencing (note: teleconferencing seems a strategic field in the plans of Microsoft: the recent acquisition of Skype enables the company to become the major player on the field, thanks to the interaction of the kinect audio and video sensors, the xbox and the Skype software and network).

3.2 The nui against the gui? Despite it’s power and the impressive possibilities showed by the kinect sensor and it’s open middleware software, showed by the innumerable kinect hacks published on the net, it is very improbable that it would be able to make the mouse and the keyboard obsolete. The most frequent uses of the personal computer (and, most recently, the tablets) involve the use of office software (word processor, spradsheet), e-mail, and browsers. The iPad and the tablets are changing the way people consume textual and multimedia contents, mainly because they grant higher mobility and a comparable interaction. When the user has to input some text, however, she needs a keyboard. The touchscreens allow the device to simulate it trough the screen; the

hardware keyboard is no more necessary, but it’s simulation is still needed. And the virtual keyboard on a smartphone is very useful to send a text, post a tweet or a brief reply of an email. But when you have to write a report or an essay, the old fashioned hardware keyboard is still the best option. Can we really imagine to substitute the keyboard and the mouse with gestural interfaces? Until artificial speech recognition will not reach an acceptable error rate (humans’ error rate is around 2-4%, while software speech recognition seems unable to go under the theresold of around 20% of error rates), it is difficult to imagine it. Yes, the movement of an hand in a gesture environment can simulate a pointer on the screen, and it can be used to click and to surf. It is possible to use the pointer to interact with a virtual keyboard, but it’s efficency can not be compared with those of the hardware keyboard. We can therefore imagine a scenario where the gesture user interface will dominate the context of gaming, multimedia consumption and teleconferencing; the touchscreen devices are already dominating those uses where mobility issues are more important than input efficiency; but when the user has to do a lot of text input (word processing, text editing), an when she has to perform complex and precise actions, the keyboard and the mouse seem to be still the far most efficient solution.

4.

NONCOMMAND, INCIDENTAL, UNOBTRUSIVE INTERACTION

Technologies like body tracking and artificial visual recognition are not here to substitute old fashioned but still highly efficient input system like the keyboard and the mouse; they are probably here to widen the opportunity of interaction between an human agent and an interactive system. The concept of ubiquitous computing is as old as the WIMP interface - and, interestingly, both where initially developed in the same place, the Xerox PARC. The idea (visionary, for the time it was exposed, at the beginning of the nineties), was that the computer would be ubiquitous, sparsed around the environment (the house, the office, the city) [19]. The interaction with the ubiquitous device has to be different from the interaction implicit with the computer. The computer assumes that the full attention of the user is focused on the task is performing, and the graphical user interface is mainly based on this assumption. The ubiquitous paradigm can not make the same assumption, and has therefore to develop different styles of interaction. In an Ubicomp environment, interaction has to be natural, context aware, and calm, that is unobtrusive. Most importantly, while the classical computer interaction is based on the commands of the user, the ubiquitous scenario assumes that most of the interactions can not be seen as commands: it is the concept of noncommand user interfaces exposed by [12]. Such kind of interaction is mostly incidental. [6] defines incidental the kind of interaction where actions performed for some other purpose or unconscious signs are interpreted

in order to influence, improve or facilitate the actors’ future interaction or day-to-day life. In my opinion, the Kinetic User Interface - as proposed by [15] - could constitute the most effective, low-cost way to implement those kind of interaction that have to be incidental and unobtrusive.

5. FIELDS OF APPLICATION The fields of applications of this kind of interaction are very numerous, and I will review just some of them in the next section.

5.1 Marketing: grocery shopping paths Studying consumers’ in-store behavior is an important topic for academic researchers and industry practitioners alike. Researchers are particularly interested in better understanding the factors that drive the dynamics of a consumer’s shopping trip. Understanding the movements of the consumers may lead to important managerial implications regarding the design of retail space and product placement [9]. In the past, the study of the consumers’ behaviour where conducted in laboratory settings, in virtual environments [4] or using qualitative analysis. Only recently, the path taken by individual shoppers in an actual grocery store, has been traked using RFID (radio frequency identification) tags located on their shopping carts [11]. The RFID is a relatively new data collection technology, and it does have certain caveats. For example, shoppers who do not use shopping carts are not tracked. Thus, the measure of shopper density is not exact; however, it is assumed to be a reasonable proxy for the actual density. Integrating the RFID system with the body tracking technology would probably improve the quality of the data collected; for instance, it would be easier to misure the time a consumer does spend while deciding which product to chose.

5.2 Museum Visitors’ Behavior Patterns The behaviour of museum visitors has been studied for decades. [18] proposed 4 clusters of visitors, based on the ethnographic observations of the behaviour of a number of visitors in several museums: the ANT visitor tends to follow a specific path and spends a lot of time observing almost all the exhibits; the FISH visitor most of the time moves around in the centre of the room and usually avoids looking at exhibits’ details; the BUTTERFLY visitor does not follow a specific path but rather is guided by the physical orientation of the exhibits and stops frequently to look for more information; finally, the GRASSHOPPER visitor seems to have a specific preference for some preselected exhibits and spends a lot of time observing them while tending to ignore the others. Recently, the ethnographic observation has been substituted with the use of handheld devices, again provided with the RFID thecnology [20]. The thecnology has here the same potentialities and issues described in the previous paragraph. In this case, however the major limit is that just a minority of visitors would use the handheld device during their visit. The integrative use of the RFID systems with the

completly unobtrusive body traking capability of the kinect sensor would therefore improve the quantity and the quality of the data researchers would be able to collect and analyze.

5.3

Intelligent queue management

[2]6 , in his article for the New York Times, discovers how the crew at the Walt Disney World try to manage the queue of the visitors to the attractions to the park: “Located under Cinderella Castle, the new center uses video cameras, computer programs, digital park maps and other whiz-bang tools to spot gridlock before it forms and deploy countermeasures in real time”. Intelligent queue management is an issue for many organizations. Tesco, the UK’s largest grocery chain, in 2007 won an award for their effort to minimizing the in-store queue waiting time [16]. The chain uses SMARTLANE queue-busting camera technology, developed by IRISYS, to measure and predict customers’ arrival at checkouts. Minimizing the waiting time is considered a key factor in boosting Tesco profits. A body tracking system like the Kinect could be programmed to be a cost effective device to measure customers’ arrival and to analyse and manage people’s queues.

5.4

Health Smart Homes

One of the promises of the ubiquitous computing is to make our homes smarter: intelligent systems able to sense, anticipate and respond to activities in the house and therefore improve the living quality of their inhabitants [5]. [5], however, noticed that ”expanding system capabilities can easily overstep some invisible boundary, making families feel at the mercy of, instead of in control of that technology”. This observation further confirms the idea that the fact that something is possible does not mean that this is desiderable; we have to design ubiquitous system that really help people, and to avoid to let them feel out of control in their home. One of the context where movement and body tracking could already play an important role is the monitoring of elderly or people at home, to detect a loss of autonomy as early as possible [7]. Elderly people often live alone and have to be autonomous; telemonitoring systems, able to detect a distress situation (fall for instance) or a signicant change in the habits or behavior of the person constitute an important solution to the monitoring of ederly people who live alone. Some research groups have been working on Health Smart Homes equipped with various sensors. Tipically, those houses are equipped with motion and environmental sensors, video cameras (for fall detection and activity recognition), and RFID tags. Equipping each room in the house with a kinect sensor would probably increase the quality of telemonitoring. The automatic and tempestive detection of falls - probably the most important problem for ederly people - for example, would be much more easier when using a body tracking system like the one provided by the kinect sensor. 6

http://www.nytimes.com/2010/12/28/business/media/28disney.htm

Figure 1: The virtual keyboard

5.5 Conclusions In this section, I did a short survey of the fields where a system able to track the body movements of one or more individuals could sensibly improve the state of the art of ubiquitous, incidental and unobtrusive interactions. Those applications range from the research, to the marketing, to the assistive technologies. Another area, not mentioned here, where the kinect sensors could improve the efficacy of a system is in the field of surveillance.

5.5.1 Privacy In this essay I did not take into account the issue of privacy, that is a big problem but is out of the focus of this work. Technically speacking, however, frameworks like the OpenNI, if used appropriately, could increase the privacy of the people under observation. Let’s take, as an example, the telemonitoring system. If the goal of the system is to track the movements of the person, the designer could decide that the system output would be not the image of the camera, but just the position of the body tracked by the middleware. The system could be switched to the camera image only if something unusual has been signalled by tracking the position of the body, for instance to recognize if the user has been fallen.

6. USING THE KINECT WITH PROCESSING To test the capabilities of the Kinect sensor and the OpenNI middleware, I decided to create a couple of very simple applications. The programs are just proof of concepts, realized with the main goal to test the difficulty to realize a software able to interact with the sensor. Both the prototipes have been realized on a Dell Latitude E5500 laptop, where a linux operating system (Ubuntu 11.04 - Natty) is installed. The choice of linux as OS slightly limited my possibilities, as some libreries are, at the date, developed only for the Microsoft Windows operating system. Nonetheless, the availability of the most important libraries for the linux platform allows a developer to use the Kinect sensor as the imput for interacting with the software. I used libfreenect and OSCeleton to collect the input of the

Figure 2: Tracking head and hands movements sensor and transform it to the skeleton data, and the processing programming language. Libfreenect7 is the core library for accessing the Microsoft Kinect USB camera. It is developed by the OpenKinect community . OSCeleton is a program that takes kinect skeleton data from the OpenNI framework and spits out the coordinates of the skeleton’s joints via OSC messages. To realize the two processing sketches, I mainly modified the example provided by the OSCeleton library. The sketches read the output provided by OSCeleton (using the OscP5 Processing library), and use it to calculate the body coordinates of the user.

6.1 The first sketch When discussing about the ability of the Kinect to substitute old interfaces, I stated that the most difficult input device to substitute is the keyboard: for long texts, it’s efficency is far greater than those of virtual keyboards. Difficult, however, does not mean impossible: the kinect can be used as a pointer, and it is possible to use it, trough a virtual keyboard, to input some text. It was to test this hypothesis, that I decided to implement a kinect virtual keyboard. Using the right hand, the user is able to highlight a letter, and moving the hand forward she can click on it. The action of clicking on the letter, however, increased the risk to accidentally move the virtual cursor on a different letter, increasing the errors rate. I therefore decided to use a two hands interface: the right hand is used to select the letter, and the forward movement of the left hand is used to click. This little trick decrased the error rate. The pointer, however, is still fairly unstable (the issue should be on the software side of my sketch), and therefore it is still not usable as an input device.

6.2 The second sketch 7

http://www.openkinect.org

The second sketch uses the skeleton coordinates in a different manner. It tracks the position of the head and of the hands, and saves them on three separate vector of 20 measurements. Then, it draws, using 3 different colors, those positions. As depicted in figure 2, the result is the representation of the movement of the head and the hands of the user. I want to repeat here that the two sketches have been developed just to test the environment. The good news is that, using few open source libraries (Libfreenect, OSCeleton, OscP5) and hacking the examples provided for the Processing language, it is fairly easy to track the skeleton and the body parts of an user and utilise them as an input for any kind of interaction.

6.3 Conclusions With a very sophisticate sensor, that costs less than 130 euros, and few open source libraries, it is fairly simple to recognise and to track the the coordinates of the skeleton’s joints of an user. The opportunities opened by such a technology, and exemplified by the huge number of hacks published on the internet, open a fairly new range of opportunities for the design of smart interactions. In this work I tried to highlight the advantages and the limits of the technology. My opinion is that the so called Natural (or Kinetic) user interfaces will not substitute the mouse and the keyboard, that offer a still uncompared efficiency when dealing with textual inputs and complex and precise interactions. This technology, however, will probably revolutionize those fields of human-computer interaction where the mouse and the keyboard have never been very successfull: the interaction with videogames, the fruition of multimedia contents, and probably in videoconferencing. Furthermore, such a cheap and open technology can buster the developement of ubiquitous computing devices, where the interaction does not need to be conscious and directive. The kinetic user interface can improve the interaction with noncommand, incidental, unobtrusive systems. The fields of applications are many, and I just listed few of them. If I have to conclude with a prediction, I believe that, in the next years, we will see a number of different devices, with different sensors and input and output systems, focused on specialized tasks. The mouse and the keyboard are here to stay, but they will be just two of the many way users will use (often unconsciously) to interact with the ubiquitous devices that already are around us.

7. REFERENCES [1] S. Ballmer. Ces 2010: A transforming trend – the natural user interface. www.huffingtonpost.com, 2010. [2] B. Barnes. Disney tackles major theme park problem: Lines. html page, The New York Times, dec 2010. [3] J. Blake. The natural user interface revolution. Technical report, Deconstructing the NUI blog, 2009. [4] L. Chittaro and L. Ieronutti. A visual tool for tracing users’ behavior in virtual environments. In Proceedings of the working conference on Advanced visual interfaces, pages 40–47. ACM, 2004.

[5] S. Davidoff, M. Lee, C. Yiu, J. Zimmerman, and A. Dey. Principles of smart home control. UbiComp 2006: Ubiquitous Computing, pages 19–34, 2006. [6] A. Dix. Beyond intention-pushing boundaries with incidental interaction. Technical report, Lancaster University, 2002. [7] A. Fleury, M. Vacher, and N. Noury. Svm-based multimodal classification of activities of daily living in health smart homes: Sensors, algorithms, and first experimental results. Information Technology in Biomedicine, IEEE Transactions on, 14(2):274–283, 2010. [8] D. Gavrila. The visual analysis of human movement: A survey. Computer vision and image understanding, 73(1):82–98, 1999. [9] S. Hui, E. Bradlow, and P. Fader. Testing behavioral hypotheses using an integrated model of grocery store shopping path and purchase behavior. Journal of Consumer Research, 36(3):478–493, 2009. [10] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In 2010 International Conference on Pattern Recognition, pages 2756–2759. IEEE, 2010. [11] J. Larson, E. Bradlow, and P. Fader. An exploratory look at supermarket shopping paths. International Journal of Research in Marketing, 22(4):395–414, 2005. [12] J. Nielsen. Noncommand user interfaces. html page, useit.com, 1993. [13] J. Nielsen. Kinect gestural ui: First impressions. html page, useit.com, 2010. [14] D. Norman. The way i see it: Natural user interfaces are not natural. interactions, 17(3):6–10, 2010. [15] V. Pallotta. Kinetic user interfaces for unobtrusive interaction with mobile and ubiquitous systems. Technical report, 2008. [16] P. relase InfraRed Integrated Systems. Tesco ’one-in-front’ campaign wins prestigious retail week award using irisys queue busting camera technology. html page, InfraRed Integrated Systems, mar 2007. [17] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. Circuits and Systems for Video Technology, IEEE Transactions on, 18(11):1473–1488, 2008. [18] E. Veron and M. Levasseur. Ethnographie de l’exposition bibliotheque publique dinformation. Centre Georges Pompidou, Paris, 1983. [19] M. Weiser. The computer for the 21st century. Scientific American, 265(3):94–104, 1991. [20] M. Zancanaro, T. Kuflik, Z. Boger, D. Goren-Bar, and D. Goldwasser. Analyzing museum visitors behavior patterns. User Modeling 2007, pages 238–246, 2009.