New Video Applications on Mobile Communication Devices

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and f...
Author: Isaac Riley
0 downloads 2 Views 5MB Size
Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

New Video Applications on Mobile Communication Devices Olli Silv´ena , Jari Hannukselaa , Miguel Bordallo-L´opeza , Markus Turtinenb , Matti Niskanenb , Jani Boutelliera , Markku Vehvil¨ainenc and Marius Ticoc a Machine

Vision Group, University of Oulu, Oulu, Finland b Visidon Ltd, Oulu, Finland c Nokia Research Center, Tampere, Finland ABSTRACT

The video applications on mobile communication devices have usually been designed for content creation, access, and playback. For instance, many recent mobile devices replicate the functionalities of portable video cameras and video recorders, and digital TV receivers. These are all demanding uses, but nothing new from the consumer point of view. However, many of the current devices have two cameras built in, one for capturing high resolution images, and the other for lower, typically VGA (640x480 pixels) resolution video telephony. We employ video to enable new applications and describe four actual solutions implemented on mobile communication devices. The first one is a real-time motion based user interface that can be used for browsing large images or documents such as maps on small screens. The motion information is extracted from the image sequence captured by the camera. The second solution is a real-time panorama builder, while the third one assembles document panoramas, both from individual video frames. The fourth solution is a real-time face and eye detector. It provides another type of foundation for motion based user interfaces as knowledge of presence and motion of a human faces in the view of the camera can be a powerful application enabler. Keywords: User interaction, camera, motion analysis, panorama construction, face detection

1. INTRODUCTION Modern mobile communication devices are becoming attractive platforms for multimedia applications as their display and imaging capabilities are improving together with the computational resources. Many of the devices have two built-in cameras, one for high resolution still and video imaging, and the other for obtaining lower, e.g. VGA resolution (640x480 pixels) frames. Table 1 points out the versatility of user interfaces of handheld devices in comparison to laptop computers.1 To illustrate the current typical designs we see two modern cellular phone designs in Fig. 1 below, both with two cameras. The flip phone has a high resolution camera on the cover, on the same side with a display. However, it is intended to be operated with the lid open, exposing a higher resolution display, and the user to an additional camera. The monoblock design is similar to digital still cameras with the high resolution display and cameras on opposite sides. It is obvious that with the display side cameras the designers have aimed at hand held video telephony, while at the same time satisfying the needs for occasional photography, video capture, and playback. The usability of mobile communications devices in portable imaging applications is on par with laptop computers despite the order of magnitude disparity between the computing power budgets. The sizes and semi-dedicated user interfaces of the hand-held devices are significant benefits over the general purpose personal computer technology based platforms, despite their apparent versatility. On the other hand, even the most recent mobile communication devices have not used their multimedia and computing resources in a novel manner, but are merely replicating the functionalities already provided by other portable devices, such as digital still and video cameras. Also the popularity of laptop PCs as portable DVD players, and as a means to access multimedia content via public WiFi networks, have clearly influenced the hand held application designs. Consequently, most of the hand held devices rely on keypad-and-pointer user interfaces, while their applications rely on content provided via the Internet or broadcast services such as DVB-H, to supplement locally stored music, movies, and maps. Although the users can create content and stream it to the network for redistribution, and make video calls, these uses are not very common. Corresponding author: Olli Silv´en, E-mail: [email protected].fi, Telephone: +358 8 553 2788

6821-7 V. 4 (p.1 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Table 1. Characteristics of typical laptop computers and recent hand held mobile devices.

Still image resolutions Number of displays Number of cameras Video resolution Display size (inches) Processor clock (GHz) Display resolution (pixels) Processor DRAM (MB)

2 MPix camera

Laptop computer up to 1 Mpixel 1 0–1 720x576/25Hz 12–15 1–3 1024x768– 1600x1200 256–2044

CIF+ resolution (384x320) camera

Hand-held device up to 352x288–1944x2582 2 1–2 640x480/30Hz 2–4 0.3–0.5 176x208– 800x352 64–256

Typical ratio 0.33x 0.5x 0.5x 1x 5x (area 20x) 10x 15x 16x

VGA resolution (640x480) 2.4" TFT 320x240 pixels camera 262k colors 3.2 MPix camera

LED illuminator / flash 2.2" TFT 320x240 pixels 16M colors

1.36" TFT 128x160 pixels 262k colors

LED illuminator / flash Keypad with three buttons Typical mobile keypad

Figure 1. Two mobile communications devices with two cameras (Nokia 6290 and Nokia N73).

As more and more applications are being crammed into hand held devices, their limited keypads and small displays are becoming too overloaded, potentially confusing the user who needs to learn to use each individual application. Based on the personal experiences of most people, increasing the number of buttons as with remote control units is not the best solution from the usability point of view. The full keyboard, touchpad or mouse, and higher resolution displays of laptop PCs appear to give them clear benefits as platforms for multiple simultaneous applications. However, the size of hand-held devices is an under-exploited asset as are their multiple cameras. Properly combined these characteristics can be used for novel user interfaces and applications that are ideal for hand-helds, but may not make much sense with lap-top computers. In this paper, we show how image sequences captured by the cameras of mobile communications devices can be used for new self intuitive applications and user interface concepts. The key ideas rest on the utilization of the hand held nature of the equipment and the user being in the field of view of a camera. Four actual implementations are described, all running on multimedia capable cellular phone platforms. The first solution is a real-time motion based user interface that can be used for browsing large images or

6821-7 V. 4 (p.2 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

documents such as maps on small screens. The motion information of the device itself, the face, or the hand of the user is extracted from the image sequence. The second solution is a real-time panorama builder, while the third one assembles document panoramas; both are from individual video frames based on the motion information. The fourth solution is a real-time face and eye detector that can be used with auto-focusing and red eye reduction techniques, essentially providing a basis for user interfaces that are aware of the presence of human faces and the direction of the gaze. When combined, face or limb, and motion information can be powerful application enablers and may change the expectations on how hand held devices are supposed to be aware of the user and react to his/her actions. In the following, we initially describe the typical platform characteristics of the mobile communication devices, and then proceed to the video based user interface and application solutions. The limitations and the potentials of the realizations are analysed against the current state-of-the-art. Finally, the desirable future platform developments are considered from the camera based user interface and application point of view.

2. MOBILE COMMUNICATIONS DEVICE PLATFORMS A typical top level hardware organization of a mobile communications device with multimedia capability is shown in Fig. 2. Two main interconnects are used for transfers between system units that are partitioned to avoid bottlenecks. The Interconnect 2 provides for high transfer bandwidths between the cameras, memories, and the video and image processing unit that in turn has high speed local buses between its internal subsystems. Interconnect 1 is the system bus that interfaces the essential units with the master application processor.

Figure 2. Organization of portable multimedia device.

We notice that the application processor has rapid access to the data produced by the camera, so the design is not a simple replacement for a camcorder. Instead, potential for software based image and video applications has been engineered into the system architecture. Transfer resources and energy can be conserved if the camera images need not to be shown. In contrast, the camcorder mode is rather transfer intensive as it requires realtime encoding and display functions. Video calls are even more complicated as they require the simultaneous operation of a video encoder, display, and decoder together with both uplink and downlink streaming via the baseband unit. The power budgets of the mobile devices are designed and optimized on the basis of the worst case uses. Video applications in their various forms are among the most demanding ones as the users tend to demand long, at least 3-4 hour active use from small devices without connecting to the electrical mains network for recharging the battery. As the capacity of the batteries depend on the discharge current in a non-linear manner, relatively small cuts in power consumption can significantly extend the battery life.2 Consequently, the manufacturers are tempted to employ hardware accelerators for video due to their power efficiency. For the builders of alternative video based applications this can be a blessing, along with the availability of graphics processors. Table 2 presents power breakdowns of three devices in video playback mode to illustrate the impacts of design philosophies. The PDA device can be characterized as a scaled down personal computer. The early 3G mobile phone is mostly a communications device with an application processor, while the future device adds hardware based video and a GPU to achieve a 3h battery life that is often considered critical in entertainment use. In both

6821-7 V. 4 (p.3 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

cellular devices, the modem and the RF consume a major part of the power budget. Although it is tempting to expect improvements, the future air interfaces are unlikely to use less power as the increasing data rates and multiple radio protocols are adding to their complexity. Also the miniaturization of RF-components may actually decrease their power efficiency. Table 2. Power consumption breakdown examples of pocket sized devices.

System component

Application processor and memories Display, audio, keyboard and backlights (UI) Misc. memories RF and cellular modem Total Battery capacity mAh/usage time

3G phone in video streaming mode 3 600

Power consumption (mW) PDA device Expected future in MPEG-4 playback mobile devices 4

833

100

1000

2441

400

200 1200 3000 1000mA/1h

754 N/A 4028 N/A

100 1200 1800 1500mA/3h

So far a key to improved power efficiency has been in augmenting software solutions with hardware or Application Specific Instruction-set Processor (ASIP) support for computing intensive tasks. On the other hand, enabling the applications to access the services of these subsystems requires system software architectures and interfaces that hide the actual nature of the implementations, and the differences between the platforms of product generations. These benefits are not achieved without overheads, but they can be optimized for the typical expected uses such as streaming video playback. However, any novel uses of the system resources are likely to encounter the full interface costs. Table 3 presents the estimated power costs of using the camera for user interface purposes in a mobile device. Three options are considered: first, the only computing resource is the application processor that needs to run at full speed, second, the system hardware resources such as the GPU and parts of the video codec can be re-used to conserve power, while the third option is a conventional keypad and display user interface. Obviously, with proper system design the cost of a camera based user interface is reasonable. Table 3. Estimated power requirement to different types of user interface purposes.

Application processor and memories Display, audio, keypad, backlights Camera (VGA) Misc. memories Total

Software based camera UI [mW] 600 400 50 100 1150

Hardware for camera UI [mW] 200 400 50 100 750

Conventional UI [mW] 100 400 0 100 650

As the usability of a user interface critically rests on its latency. This is most obvious with computer games in which many perceive joystick action-to-display delays exceeding about 100-150 milliseconds disturbing,5 but this applies even to key press-to-sound or display. If we employ a camera as a user interface component, its integration time will add to the latency, as well as the image analysis computing. If we sample the scene at 15 frames/s rate, our base latency is 67 ms. Assuming that the integration time is 33 ms, the information in the pixels read from the camera is on an average 17 ms old. Consequently, computing and display/audio latencies achieving a total below 100-150 ms reprensent a challenge.

6821-7 V. 4 (p.4 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Table 4 summarizes the latency budget for two frame rates, 15 and 30 frames/s that are typical of mobile devices. The integration time of the camera is assumed to be 33 ms at both rates. If the computing is done in a pipeline that contains more than a single processor, the image analysis may effectively be longer than the base latency. Interestingly, at the lower frame rate less time is available, while on the other hand, the camera operated at a higher rate demands more power. Table 4. Latency budgets for camera based user interfaces at two frame rates.

Component Base latency (ms) 50 % camera integration time (ms) Display latency (ms) Image analysis max (ms) Total max (ms)

Rate (frames/s) 15 30 67 33 17 20 46 80 150 150

An obstacle to camera based user interfaces is the turn-on time that is not only dependent on the power-up delay of the camera, but is mostly caused by software. The current multimedia frameworks intended for use on mobile platforms have substantial latencies when the resources are reconfigured for the applications. For instance, Rintaluoma et al6 found that the Symbian MMF (MultiMedia Framework) consumed approximately 60000 processor cycles for accessing a device driver from the application layer. OpenMAX is claimed to be a lighter weight interface for use in streaming multimedia hardware and software components. However, it has not so far been intended for user interface purposes. We may find some models on how to proceed from OpenGL ES that have been designed with an eye on game applications. It is a highly optimized graphics system designed for accelerators used on embedded and mobile devices. In addition to the user interfaces, the applications described next provide insight into the computing needs and characteristics of camera based user interfaces. If cameras become standard user interface components in mobile devices, energy efficiency requires that the bulk of the computing is carried out using hardware acceleration. These resources can cause an outgrowth of the current graphics or codec solutions, or both.

3. MOTION BASED USER INTERFACE The motion based user interface enables a new flexible way of interacting with mobile phones. With this interface, the user can operate the phone through a series of hand movements whilst holding the device. During these movements the motion is extracted from the image sequence captured by the camera. As an application example, the solution has been implemented on Nokia Nseries mobile phones, allowing the user to browse large image documents on small screens as shown in Fig. 3. In the application, only a small part of the high resolution image is visible at a time (See Fig. 3 b) and the measured motion information is used as control input (See Fig. 3 a). For instance, the lateral movement upwards scrolls the focus towards the upper part of the display, and back and forth motion is interpreted as zooming in and out. The rotation component is utilized to change the orientation of the display. In practise, the user can also tilt the device in order to navigate over the display, which is a more natural and convenient way of controling the device. Compared to the use of HW accelerometers alone,7 a camera based approach allows a more convenient way of controling zooming effects and adapts better to cases where the user is moving (walking etc.). A typical use example is illustrated in Fig. 4. The user browses the large image on the small screen of the mobile device by moving the device in his hand. We estimate the ego-motion of the device while the user operates the phone by determining the parametric model which approximates the dominant global motion between two images in the sequence captured by the camera.8 Our approach utilises the feature based motion analysis where a sparse set of blocks are first selected from one image and then their displacements are determined. In order to improve the accuracy of the motion information, an uncertainty of these features is also analysed.

6821-7 V. 4 (p.5 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

(a)

(b)

Figure 3. Motion based user interface estimates the motion of the device relative to the user enabling also zooming functionalities (a). It can be used, for example, to browse large image documents on the screen (b).

Zooming in command

Scrolling command

Figure 4. A use example of a motion based user interface.

The main steps of motion estimation are presented in Fig. 5. For details, please see the paper by Hannuksela et al.8 The blocks in the top left present the selected image regions to be used, while the lines in the top right image illustrate the block displacement estimates, d, and ellipses show the related uncertainties. The bottom left image shows the trusted features that are used for parametric model fitting. In this case, the ellipses illustrate the weight that a particular displacement estimate has in the fitting. By combining the feature selection with uncertainity information, we obtain a very robust motion estimate of the sequence. This information can be directly used to estimate the motion of the device in the user’s hand. We have implemented our method using only fixed-point arithmetic due to the lack of a floating-point unit in most of current devices. The use of integer operations in the inner loops guarantees high performance. The solution can also take advantage of the hardware acceleration used with other video processing applications. Acceleration hardware is designed to support the block-based and pixel-level processing tasks that are not efficiently handled by the CPU architecture. Typically such hardware contains highly optimised motion estimation instructions on blocks from 16x16 to 4x4 pixels which are also usual sizes for the blocks in our method.

6821-7 V. 4 (p.6 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Selected blocks

Motion features

Trusted features

Estimated motion

Figure 5. Example of global motion estimation.

4. PANORAMA BUILDER The panorama building solution analyses the frames in the video for motion and moving objects, quantifies the quality of each frame, and stitches up to 360 degree panoramas from the best available images.9 We have developed a method where the devices are able to stitch images in real time obtaining a result image that grows with the frame acquisition.10 The advantage of using on-line panorama building is related to the memory requirements that are smaller compared to the saving of all the frames for later usage. Also immediate feedback and the possibility of reviewing the panorama images is very useful. Three examples are shown in Fig. 6.

Figure 6. Efficient panorama builder stiches high quality images even if there are moving objects in the scene.

The panorama capturing procedure is illustrated in Fig. 7. In order to get a final panorama image, the user focuses the camera on the desired starting point of the mosaic. The camera starts turning around up to 360 degrees and a sequence of images starts to be captured. Each image is individually processed to estimate the

6821-7 V. 4 (p.7 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

shift and rotation. The blurriness of each picture is measured and moving objects are detected. Based on the quality of each individual frame, a selection process takes place. The idea of selection is to consider only good quality frames for creating the best possible output. The selection process is shown in Fig. 8. Each frame is either accepted or discarded. For every selected frame, if a moving object is present and it fits the sub-image, the image is blended, drawing a seam that is outside the boundaries the object. If only a partial object is present, the part of the frame without the object is the one that is blended.

Figure 7. During the panorama capturing, the user first focuses the device on the desired direction and then turns around in order to create a panorama of the view.

Image registration relies on the method of Vandewalle at al.11 that offers shift and rotation estimation that is robust against blur. Only a fixed square template on the central part of each frame, where the image quality is better, is used. This square is downscaled by a factor of two, and filtered to allow faster performance, interpolating then the results of the registration estimation. The amount of motion blur in the frame is computed with summed derivatives.12 The method estimates the image’s sharpness by summing together the derivates of each row and each column of the overlapping part. Blur calculation produces one single number that expresses the amount of high-frequency detail in the image. The value is sensible if it is used to compare images: if a certain image Ia acquires a higher result than image Ib, it means that Ia has more high-frequency detail than image Ib (implying that both images depict approximately the same scene). Usually this means that Ia is a sharper than image Ib, but on some occasions the difference in the image content distorts the result. To perform motion detection, the difference between the current frame and the previous frame is computed. The result is a two-dimensional matrix that covers the overlapping area of the two frames. Then, this matrix is low-pass filtered to remove noise and is thresholded against a fixed value to produce a binary motion map. If the binary image contains a sufficient amount of pixels that are classified as motion, the dimensions of the assumed moving object are determined statistically. First, the centerpoint of the object is approximated by computing the average coordinates of all moving pixels. Second, the standard deviation of coordinates is used to approximate the dimensions of the object. Frame selection is performed using the score of the blur measurements and the motion detection. Among the set of images that present an overlap with the previous blended frame, only the best frame is selected, while the others are discarded. The frame blending happens with the feathering method,13 where a linear function gradually merges one frame to the next by changing the frames’ weight.

6821-7 V. 4 (p.8 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Figure 8. Automatic frame selection based on blur and motion estimation assures that only the best quality frames are considered in the panorama construction.

The application has been implemented using only fixed-point arithmetic to achieve good performance on most of devices. The solution can also take advantage of the GPU if it is available. The mean processing time on a Nokia N95 (ARM 11 prcessor) is about 8 frames per second (125ms/frame), with 320x240 frames. Other configurations on the frame resolution (from 160x120 to 640x480) can be chosen. The size of the final image is limited only by the memory.

5. DOCUMENT PANORAMA BUILDER A document panorama builder is essentially a camera based scanner as shown in Fig. 9. Instead of using devices such as flatbed scanners, the users can capture a high quality images with their mobile phones. Mobile cameras enable portable and non-contact image capture of any kinds of documents. Although they cannot replace flatbed scanners, they are more suitable for several scanning tasks in less constrained situations. We have developed a method14 where the device interactively guides the user to move the device over, for example, a newspaper page in a manner that a high quality image can be assembled from individual video frames. During online scanning, motion determined from low-resolution image sequences is used to control the interaction process. As a result, good high-resolution images of the document page can be captured for stitching. Images with coarse alignment information are used to construct a mosaic automatically using a feature based alignment method. In the first stage, partial images of the document are captured with the help of user interaction. The basic idea is to apply online camera motion estimation to the mobile phone to assist the user in the image scanning process. The user starts the scanning by taking an intial image of some part of the document (see Fig. 10 a). Then, the user is asked to move the device to the next location. The scanning direction is not restricted. One possible way is to use a zig-zag style scanning path shown in Fig. 10 b. The camera motion is estimated during movement and the user is informed when the suitable overlap between images is achieved (for example 25 %). When the device motion is small enough, a new image is taken. The movement should then be stopped because otherwise the images are blurred.

6821-7 V. 4 (p.9 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Figure 9. Mobile device can be used as a camera based document scanner.

In order to measure device motion the same principle as that used in Sec. 3 is utilized. These estimates are then used for computing cumulative displacement estimates. The requirement here is that the error in this estimate does not become too high, so that sufficient overlap between stored images is guaranteed.

START

MOVE

SLOW DOWN

MOVE

SLOW DOWN

(a)

(b)

(c)

Figure 10. An example is illustrated for building a large document page image. An interactive user interface helps to acquire good quality initial images (a). One possible scan style is zig-zag scanning (b). A final mosaic obtained is shown in (c).

After online image capturing, the partial images of the document page can be stitched together. The automatic mosaicing is based on a robust estimator (RANSAC15 ) with a feature point detector (SIFT16 ). Also, graph based global alignment and bundle adjustment steps are performed in order to minimize image registration errors and to further improve quality. Finally, warped images are blended to the mosaic using simple Gaussian weighting. Fig. 10 b) illustrates the mosaic (1397x1099 pixels) constructed for an A4 document page from eight VGA images. By using high resolution images instead of raw VGA frames as input, the device needs to alternate between the high and low resolution modes, which has relatively large latency and is a platform limitation.

6. FACE AND EYE DETECTOR Faces and eyes are important feature sources for comminication devices. The camera directed towards the user (as in the models shown in Fig. 1) is usually intended for video call purposes. The field of view of the camera is

6821-7 V. 4 (p.10 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

optimized for the user’s face regions. This provides advantages for various HCI solutions. In different imaging applications, the faces are also important. Typically, users are interested in searching for people in the images and good quality face regions are among the main concerns in customer imaging. There are already commercial solutions for different auto focus and auto white balance methods that utilize the detected faces during the image capturing. Another examples are smile shutter or other new additional features based on face detection. The combined detection of the user’s face and eye has application especially in different user interface solutions where the relative position of the user can be obtained via the camera. For example, the knowledge of presence and motion of a human faces in the view of the camera can be a powerful application enabler. We have built a very fast object detection and tracking method based on efficient gray scale invariant texture features and boosting.17, 18 Our method searches for the faces or eyes in the images or image sequences, and returns the coordinates of the detected objects. This information can be directly used by face based approaches, such as auto focusing or color enhancement as illustrated in Fig. 11. Another example application is shown in Fig. 12 where the face detection is combined with the motion based UI. The user can give motion gesture commands to switch the face in the screen. The frontal camera can also observe if the user is actually looking at the device by estimating the gaze direction with respect to the device. This requires that the face and the eyes are robustly detected in the images.

(a)

(b)

Figure 11. Real-time face and eye detection is very useful in various image enhancement applications such as autofocusing (a) and automatic red-eye removal (b).

The algorithm was implemented in the Symbian OS to detect and track a maximum of five objects (faces) in the images. The demo software processes the QVGA view finder images with a high frame rate, and the detection performance is very good, also in demanding illumination conditions. The minimum face size was set to 20×20 pixels. We run simulation experiments with the RealView Development Suite using the 180MHz ARM9 CPU and the mean processing time with various image sequences is 68ms/frame (about 15 fps). The face detection Symbian demo running on a Nokia N95 (ARM11) runs even faster (approximately 26-30 fps). This is fast enough for real-time auto focusing or white balancing applications, for example. Fig. 13 shows some example detection results. Input QVGA frames are captured by a Nokia N95. The algorithm is robust in varying illumination conditions and detects very efficiently faces of different sizes and poses. To demonstrate the detection performance, we built a new kind of automatic red-eye removal solution. The approach is frame based where the input image is processed after it has been captured by a high-resolution camera. The algorithm automatically finds and corrects the red-eyes caused by the flash light from the images. Face and eye detectors are used to verify each correction to increase the performance of the algorithm. We experiment the approach with a set of 393 images having a total of 794 red-eyes. The image set contained typical home album images of varying quality, having faces and eyes of different sizes and orientations as well. The set also contained specific red-eye images created for testing purpose. Overall, the image set was very demanding from the automatic red-eye removal point of view. We compared our face and eye detector based approach with the dedicated red-eye removal solutions of HP19 (HP RedBot http://www.redbot.net) and Volken

6821-7 V. 4 (p.11 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Motion gesture command

Motion gesture command

Figure 12. Face detection can be utilized in various applications. Here the face detection is combined with the motion based UI and the user switches between the faces in the image by giving motion gesture commands.

Figure 13. Face detection examples on Nokia N95 view finder frames.

et al.20 (http://ivrgwww.epfl.ch/software/red eye remover/RedEyeRemover.html). The results are shown in Table 5. Our approach gave better correction accuracy than the HP method with a similar false positive rate. The method of Volken et al.20 performed relatively poorly with our test image database. Table 5. Detection results of different frame based automatic red-eye removal approaches. Method HP RedBot19 RedEyeRemover20 Combination of face and eye information

TP [%] 59 24 70

FP [%] 9 71 9

Keeping in mind that our detector was in this case applied to individual frames separately, the accuracy was very good. In different user interface solutions we can utilize the sequence of frames and it is likely that the performance would be even better. One concern is the low-light illumination conditions, where the image can be too noisy for automatic face detection. In UI cases special attention needs to be paid for designing a proper lighting system. One possibility is perhaps to use special infrared LEDs for illuminating the user’s face. Energy consumption is, of course, the main limitation of such designs.

6821-7 V. 4 (p.12 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

7. PLATFORM SUPPORT FOR THE NEW VIDEO APPLICATIONS The development of the demonstration applications has contributed to the identification of features that on mobile platforms would benefit from alternative video applications and camera based user interfaces. We have also discovered latency and computing bottlenecks tha can be removed via software and hardware developments. First, it should be possible to use two cameras at the same time or quickly alternate between cameras as image sources. The motivation for this capability is a practical one: sunlight, lamps, or reflections may saturate one of the cameras, so the trivial automatic adaptation method is to switch to another image source, although that may be a more power consuming high resolution device. However, the current mobile devices have single camera interfaces, and alternating between cameras requires reconfiguration that may take hundreds of milliseconds. Second, a stand-by mode for the cameras should exist, perhaps initiated by the handling of the device recognized by built-in accelerometers, to reduce the start-up latency of the vision based user interfaces. In the stand-by mode, the camera could capture images, say, at the rate of a frame per second, adjusting to the ambient illumination. The miniature VGA camera modules used in mobile devices require about 1-1.5 mW/frame/s, a cost that needs to be weighted against the gained benefits. The cold start power-up latencies of the camera hardware modules alone are around 100ms. At least two images are needed to determine the first motion estimates even if no gain correction is needed to bring the image information into the useful dynamic region. These plain hardware dependent delays in total amount to 150-200ms, but would be only 50-100ms from stand-by. Third, the data formats of the camera and GPU/display units should be compatible and for a number of image processing functions, such as interpolations and warps, it is desirable to use the GPU as a hardware accelerator. The OpenGL interface is highly efficient, but the necessary format changes result in needless copying of data, resulting in reduced energy efficiency, increased computational burden, and latency. Finally, motion estimation and face detection are potential platform level services to be offered via multimedia APIs. They play key roles in the demonstrated applications, and are most likely to be employed in many others. Implementing them in the camera modules, or the camera interfaces, would reduce the power hungry data transfers over the system interconnects. Furthermore, distributing the computational load to processing resources tightly coupled to the sensors could result in lower latency.

8. SUMMARY In the cases presented above, the cameras of multimedia capable mobile devices are employed as motion and feature sensors in user interface and imaging applications. Motion information coupled with image quality information helps in implementing panorama capture solutions, and is a self-intuitive means of interacting with a small screen and minimal keypad. Face detection from images and image sequences is an exceptionally powerful user interface componen, as the user is almost invariably looking at the device. Also the big resolution disparities between the cameras and displays have generated a need to conveniently browse through the salient information in the images, usually human faces. The general approach, extraction of motion and features from sequential video frames, has clear usability potential, and it can augment the information provided by accelerometers and touchscreens in a complementary manner. In fact, the cameras in future mobile devices may for, the most of time, be used for sensory purposes rather than capturing images for human viewing. Energy efficiency is a significant challenge in exploiting camera based user interface ideas, but according to our judgment a solvable one. Camera sub-systems on mobile device platforms are a rather recent add-on, and designed just for capturing still and video frames, considering the latter as a performance problem. At the same time the energy efficiency features of the platform architectures, computing resources, and displays have been optimized for video playback. From the point of view of the demonstrated panorama applications, compatible data formats for the camera and graphics systems would be a major improvement. For motion and face based user interfaces lower camera start-up latencies would improve the usability, but require careful balancing with energy efficiency demands.

6821-7 V. 4 (p.13 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

ACKNOWLEDGMENTS The financial support provided by TEKES, The National Technology Agency is gratefully acknowledged. We also want to thank Mr Gordon Roberts for the language revision.

REFERENCES [1] O. Silven and T. Rintaluoma, “Energy efficiency of video decoder implementations,” in F. Fitzek and F. Reichert (eds.) Mobile Phone Programming and its Applications to Wireless Networking, pp. 421–439, Springer, 2007. [2] D. Rakhmatov, S. Vrudhula, and D. Wallach, “A model for battery lifetime analysis for organizing applications on a pocket computer,” Very Large Scale Integration (VLSI) Systems 11(6), pp. 1019–1030, 2003. [3] Y. Neuvo, “Cellular phones as embedded systems,” in Solid-State Circuits Conference, 1, pp. 32–37, 2004. [4] H. Shim, System-Level Power Reduction Techniques for Color TFT Liquid Crystal Displays. PhD thesis, School of Computer Science and Engineering, Seoul National University, Korea, 2006. [5] J. Dabrowski and E. Munson, “Is 100 milliseconds too fast?,” in Conference on Human Factors in Computing Systems, pp. 317–318, 2001. [6] T. Rintaluoma, O. Silven, and J. Raekallio, “Interface overheads in embedded multimedia software,” in International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 4017/2006, pp. 5–14, 2006. [7] P. Eslambolchilar and R. Murray-Smith, “Tilt-based automatic zooming and scaling in mobile devices a state-space implementation,” in Mobile Human-Computer Interaction MobileHCI 2004, pp. 120–131, 2004. [8] J. Hannuksela, P. Sangi, and J. Heikkil¨ a, “Vision-based motion estimation for interaction with mobile devices,” Computer Vision and Image Understanding: Special Issue on Vision for Human-Computer Interaction 108(1–2), pp. 188–195, 2007. [9] J. Boutellier, M. Bordallo-Lopez, O. Silv´en, M. Tico, and M. Vehvil¨ ainen, “Creating panoramas on mobile phones,” in Proceeding of SPIE Electronic Imaging 2007, 6498, (7), 2007. [10] M. Bordallo-Lopez, J. Boutellier, and O. Silven, “Implementing mosaic stitching on mobile phones,” in Finnish Signal Processing Symposium, 2007. [11] P. Vandewalle, S. Susstrunk, and M. Vetterli, “A frequency domain approach to registration of aliased images with application to super-resolution,” EURASIP Journal on Applied Signal Processing (special issue on Super-resolution) 24, pp. 1–14, 2006. [12] J. Liang, D. DeMenthon, and D. Doermann, “Camera-based document image mosaicing,” in International Conference on Pattern Recognition, pp. 476–479, 2006. [13] R. Szeliski, “Video mosaics for virtual environments,” in IEEE Computer Graphics & Applications, pp. 22– 30, 1996. [14] J. Hannuksela, P. Sangi, J. Heikkil¨ a, X. Liu, and D. Doermann, “Document image mosaicing with mobile phones,” in 14th International Conference on Image Analysis and Processing, pp. 575–580, 2007. [15] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications ACM 24, pp. 381–395, 1981. [16] D. Lowe, “Distinctive image feature from scale-invariant keypoints,” International Journal of Computer Vision 60(2), pp. 91–110, 2004. [17] T. Ojala, M. Pietik¨ ainen, and T. M¨ aenp¨ a¨a, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), pp. 971–987, 2002. [18] A. Hadid, G. Zhao, T. Ahonen, and M. Pietik¨ ainen, “Face analysis using local binary patterns,” in Handbook of Texture Analysis, M. Mirmehdi, ed., Imperial College Press, 2007. [19] H. Luo, J. Yen, and D. Tretter, “An efficient automatic redeye detection and correction algorithm,” in Proceedings of the 17th International Conference on Pattern Recognition (ICPR), 2, pp. 883–886, 2004. [20] F. Volken, J. Terrier, and P. Vandewalle, “Automatic red-eye removal based on sclera and skin tone detection,” in IS&T Third European Conference on Color in Graphics, Imaging and Vision (CGIV), 2, pp. 359– 364, 2006.

6821-7 V. 4 (p.14 of 14) / Color: No / Format: Letter / Date: 1/21/2008 4:55:52 AM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Suggest Documents