Face Extraction From Live Video SITE Technical Report TR-2006

Face Extraction From Live Video SITE Technical Report TR-2006 Adam Fourney School of Information Technology and Engineering University of Ottawa Ottaw...

Author: Anne Murphy

0 downloads 3 Views 115KB Size

Report

Download PDF

Recommend Documents

Live facial feature extraction

Video Face Replacement

LIVE ON-SITE & WEBCAST

DBpedia and the Live Extraction of Structured Data from Wikipedia

LIVE VIDEO STUDIO RECOMMENDATIONS

Morphology Based Text Detection and Extraction from Complex Video Scene

Heart Rate Variability extraction from video signals. Karim Alghoul

Multilingual Artificial Text Extraction and Script Identification from Video Images

Extraction and Coding of Face Model Parameters

AUCTION BIDITUP AUCTIONS & APPRAISALS WORLDWIDE LIVE ON-SITE & WEBCAST HEAD 5-FACE TON

LIVE VIDEO BROADCAST REQUEST FORM REQUEST A LIVE BROADCAST

Video. US Video Advertising Report

Eigen-PEP for Video Face Recognition

TR Technical Report. Validation of coastal oceanographic models at Forsmark. Site descriptive modelling SDM Site Forsmark

Porewater Extraction from

INSTRUMENT PANEL. Volkswagen Technical Site -

Site SD032 Short-term Soil Vapor Extraction Test Evaluation Technical Memorandum

The new RES web site is live!

Weekly Technical Report Weekly Technical Report

LIVE VIDEO PRODUCTION. Switching Streaming Recording

Live Video Edge Detection in Simulink

INF1510 Video and technical documentation

Technical guide to network video

Face Extraction From Live Video SITE Technical Report TR-2006 Adam Fourney School of Information Technology and Engineering University of Ottawa Ottawa, Canada, K1N 6N5 Project supervised by: Dr. Robert Laganière

i

Table of Contents Introduction.............................................................................................................................................1 Technology..............................................................................................................................................1 The Open Source Computer Vision Library (OpenCV).......................................................................1 Microsoft's DirectShow.......................................................................................................................2 The Background Segmentation Component .......................................................................................3 Architecture.............................................................................................................................................3 Graphical User Interface..........................................................................................................................4 FACE Dialog......................................................................................................................................4 Input Configuration Dialog.................................................................................................................4 Output Configuration Dialog...............................................................................................................5 Graph Manager...................................................................................................................................5 Face Extractor.....................................................................................................................................6 Face Export Rules...............................................................................................................................7 Modes of Operation............................................................................................................................8 Mechanism By Which Face Images are Exported...............................................................................9 Possible Improvements.......................................................................................................................9 Face Detector.........................................................................................................................................10 The OpenCV Face Detector .............................................................................................................10 Measurements Used to Assess Image Quality...................................................................................11 Inferring About Image Quality Using Haar-Classifiers.................................................................11 Gaze Direction..............................................................................................................................11 Motion, Skin, and “Motion & Skin” Content................................................................................12 Pixel Motion Detection............................................................................................................13 Skin Detection.........................................................................................................................13 Quality of Lighting.......................................................................................................................14 Measuring the Width of a Histogram.......................................................................................14 Image Sharpness...........................................................................................................................15 Image Dimensions and Area.........................................................................................................16 Possible Improvements.....................................................................................................................16 Pedestrian Tracker.................................................................................................................................16 Possible Improvements.....................................................................................................................17 Appendix A: Building From Source.......................................................................................................18 References.............................................................................................................................................20

1

Introduction The first step in any biometric face identification process is recognizing, with a high degree of accuracy, the regions of input video frames that constitute human faces. There has been much research focused on this particular task. Thankfully, this has resulted in some very robust solutions for detecting faces in digital images. However, frames from live video streams typically arrive at a rate of between 15 and 30 frames per second. Each frame may contain several faces. This means that faces might be detected at a rate which is much higher than the original video frame rate. As mentioned above, these faces are destined to be input into a biometric face identification software package. This software is likely complex, and certainly requires some finite amount of time to process each face. It is very possible that the high rate of input could overwhelm the software. Even if the face identification software is very efficient, and can keep up with the high rate of incoming faces, much of the processing is wasteful. Many of the faces will belong to individuals who have already been accounted for in previous frames. The software described in this paper aims to alleviate the situation by detecting faces as fast as possible, but only exporting select faces for post processing. In fact, the software aims at exporting one image for every pedestrian that enters the camera's field of view. This is accomplished by associating face images to individual pedestrians. Each time a new image is associated to a pedestrian it must be compared to the best image previously associated to the individual. If the new image is an improvement, then it replaces this best image. When a pedestrian leaves the camera's field of view, the pedestrian's best image is exported.

Technology The current implementation of the project relies very heavily on three important technologies. Without these technologies, the project would not have been possible. The following section lists the technologies that were used, and it will discuss why they are so invaluable.

The Open Source Computer Vision Library (OpenCV) The open source computer vision library is a development library written in the C/C++ programming language. The library includes over 300 functions ranging from basic image processing routines all the way up to state-of-the-art computer vision operations. As the OpenCV documentation describes:

2

Example applications of the OpenCV library are Human-Computer Interaction (HCI); Object Identification, Segmentation and Recognition; Face Recognition; Gesture Recognition; Motion Tracking, Ego Motion, Motion Understanding; Structure From Motion (SFM); and Mobile Robotics. (“What is OpenCV”, 2006) The importance of OpenCV to this project cannot be stressed enough; the representation of all images processed by the face extraction software is defined by a structure located in one of the OpenCV libraries. OpenCV routines are used in almost every instance where image processing is done. Finally, without OpenCV's object detection routines, none of this project would have been possible; the object detection routines are used to detect faces in the video sequences, and the results are truly amazing.

Microsoft's DirectShow Microsoft's DirectShow is an application programming interface that allows developers to manipulate multimedia data in various useful ways. Microsoft describes the DirectShow API as follows: The Microsoft® DirectShow® application programming interface is a media-streaming architecture for the Microsoft Windows® platform. Using DirectShow, your applications can perform high-quality video and audio playback or capture. (Microsoft, 2006) DirectShow is also occasionally known by its original codename “Quartz”, and was designed to replace Microsoft's earlier Video For Windows technology (“DirectShow”, 2006). Like Video For Windows, DirectShow provides a standardized interface for working with video input devices as well as with multimedia files. It also provides a technology called “Intelligent Connect” which makes it even easier to program for a wide range of input devices and video encodings. DirectShow is a very complicated API, with an equally complicated history. It is often criticized as being overly complex. Perhaps the Wikipedia article describes this situation best: DirectShow is infamous for its complexity and is often regarded by many people as one of Microsoft's most complex development libraries/APIs. A long-running semi-joke on the "Microsoft.public.win32.programmer.directx.video" newsgroup is "see you in 6 months" whenever someone wants to develop a new filter for DirectShow. (“DirectShow”, 2006) Thankfully, this project did not require the development of any new DirectShow filters, and in general, the technology seemed relatively manageable.

3

The Background Segmentation Component The final technology used for the project was a background segmentation component, contributed by Dr. Robert Laganière. This component is a C++ class that, among other things, is able to determine which pixels of a video frame constitute the foreground. This is accomplished by building a statistical model of a scene's background, and comparing each video frame to this model. This project uses background segmentation for motion detection and object tracking. Both the face detector and pedestrian tracker components require the segmented image, which is output from the background segmentation component.

Architecture The face extraction software is composed of six main components, as described by the following diagram. Each component will be the subject of a lengthy discussion

4

Graphical User Interface The graphical user interface (GUI) of the face extractor software is arguably the least important component of the entire project. For this reason, the level of detail in this section will be quite minimal. Additionally, this document focuses more on design than on usability. Thus, the following discussion will not cover the individual interface widgets, nor will it serve as a manual for anyone operating the software. Instead, it will simply discuss where certain aspects of the implementation can be found, and what functionality should be expected. The GUI for the face export application is composed of four main classes. There is the main dialog, the input configuration dialog, the output configuration dialog, and the graph manager. The graph manager is the most important (and complex) sub-component of this part of the system. In addition to the aforementioned classes, there are a few other classes that simply provide some custom controls to the GUI.

FACE Dialog Header File: ./FaceDlg.h C++ File: ./FaceDlg.cpp Namespace: C++ Class Name: CFaceDlg The entire application was originally designed as a Microsoft Foundation Classes (MFC) dialog project. Every dialog application project – including this project – begins by displaying a single main dialog window. The face extractor project uses the FACE Dialog for exactly this purpose. From this dialog, users can: 1. 2. 3. 4.

Configure the input settings Configure the output settings Start and stop video capture Configure the individual settings for DirectShow capture graph pins and filters.

Input Configuration Dialog Header File: ./ConfigInputDlg.h C++ File: ./ConfigInputDlg.cpp Namespace: C++ Class Name: CConfigInputDlg The input configuration dialog allows users to select a video input device, or a file to which video has been previously saved. The list of available input devices includes all DirectShow filters that

5 are in the CLSID_VideoInputDeviceCategory category. These typically include web cameras, TV tuner cards, and video capture cards.

Output Configuration Dialog Header File: ./ConfigOutputDlg.h C++ File: ./ConfigOutputDlg.cpp Namespace: C++ Class Name: CConfigOutputDlg Unlike input configuration, output configuration is entirely optional. These settings allow users to save the processed video sequences to a file. It also allows users to specify a directory where the exported face images can be saved (currently, all images are saved in the JPEG format). If a user decides to save the video to a file, then the user is prompted for a valid file name. They may also select and configure a video compressor.

Graph Manager Header File: ./GraphManager/GraphManager.h C++ File: /GraphManager/GraphManager.cpp Namespace: C++ Classes: GraphManager, FilterDescriptor The GraphManager class is one of the largest and most complicated classes of the entire project. This class is responsible for the construction and destruction of the Microsoft DirectShow capture graphs used by the application. The graph manager makes heavy use of the intelligent connect technology (by constructing graphs using the CaptureGraphBuilder2 interface). Therefore, it supports many different video capture devices and multimedia file encodings. In fact, the face extractor software has been tested with various brands of web cameras and at least one brand of TV tuner card (Hauppauge WinTV). Interestingly, when using the TV tuner card, intelligent connect is wise enough to include all filters required to control the TV tuner. The graph manager was inspired by the SequenceProcessor class included in Dr. Laganière's OpenCV / DirectShow tutorial. In fact, both the SequenceProcessor and the GraphManager use functions defined in the file “filters.h”, which was also included with the tutorial. There are, however, some major differences between these two classes. The first major difference is the use of the CaptureGraphBuilder2 interface, which was described above. The second difference is that the GraphManager uses display names (rather than friendly names) to identify the filters internally; this allows the system to distinguish between many physical devices that each have the same friendly name. The final difference is that the graph manager class provides support for displaying a filter's properties or an output pin's properties. For example, to change the channel on a TV tuner device, one simply displays the properties of the TV tuner filter, and then selects the appropriate channel. Different devices have different properties, and these property sheets are built into the filters themselves.

6 So far, the discussion has focused on how the graph manager controls video input. Not surprisingly, it also controls the video output. In all cases, video is rendered directly to the screen; however, the graph manager also allows users to save the output to disk. Video can be saved in one of many supported video encodings, and in most cases, the user can specify the level of quality and compression of the video encoder. At the time of the writing of this paper there are some outstanding issues regarding the graph manager. In particular, the graph manager has trouble constructing filter graphs when using an uncompressed video file as the source. Additionally, the ability to pause playback has not been entirely implemented. As a result, the face extractor software does not have a pause button on the main interface. Another issue that needs consideration involves the installation of filter graph event handlers. At this time there is no clean way to forward events to the controlling window. The final issue is that there are currently no provisions for seeking within video files. Despite these issues, the graph manager was able to provide some very powerful functionality. Also, the above issues will likely be resolved in the near future.

Face Extractor Header File: ./FaceExtractor.h C++ File: ./FaceExtractor .cpp Namespace: C++ Classes: Face, FaceGroup, ExtractorObserver, Extractor The face extractor component is the next highest level of abstraction below the GUI. Essentially, it is the only interface that any programmer using the system is likely to need. The face extractor receives input video frames, and exports face images according to a well defined set of rules. All image processing is accomplished by three of the face extractor's subcomponents: the face detector, the background segmentation component, and the pedestrian tracker. The face extractor merely interprets the results of its subcomponents, and uses this information to associate faces to pedestrians. This association is achieved by assigning an identifier to each instance of a face. If two face images have the same identifier, then it implies that both images are from the same individual. The face extractor is also responsible for determining when a face image should be exported. As mentioned in the introduction, the idea of the entire project is to export only the best face captured for each pedestrian. The face extractor, however, provides slightly more flexibility. It exports a face image if any one of the following three conditions are met: 1. No previous face image has been captured for a given pedestrian. 2. The current face image is an improvement over all previously exported images. 3. If a pedestrian leaves the scene, then the best face ever captured is re-exported. In addition to identifiers, exports are also labeled with an event. Events describe which of the above three conditions caused the export to take place. Finally, each exported face is also given a score value. These scores are simply numerical

7 values that indicated the quality of the exported image. As a consequence of the export conditions, the score values of a sequence of images exported for a single individual are monotonically increasing in value. Using the output of the face extractor component, developers can devise several high-level postprocessing rules. For example, developers can decide to process a face as soon its score crosses some pre-determined threshold. An alternative rule would be to process the best available face as soon as the pedestrian leaves the scene.

Face Export Rules The three rules described in the previous section assume that faces can be uniquely associated to pedestrians. Unfortunately, the relationship between the faces returned by the face detector and the pedestrians returned by the pedestrian tracker is not usually one-to-one. For example, two people might be walking side-by-side, and the pedestrian tracker might incorrectly assume that they are a single moving object. In this case, the face detector might associate two faces to one “pedestrian”. Worse yet, the face detector might locate one face in some frames, and two faces in other frames. Therefore, the face export rules must be slightly more complex then previously stated. In total, five rules are used in order to determine when faces are exported. These rules operate on face groups rather than on individual face images. Face groups are simply unordered collections of face images that are all from the same video frame, and are all associated to the same pedestrian. At any given time, two face groups are maintained for each pedestrian: the current face group and the historical best face group. The current face group represents faces detected in the current video frame. The historical best face group represents the best faces ever associated to the pedestrian. The rules given below describe the circumstances under which the historical best face group is updated to reflect the current face group. Whenever such an update occurs, all faces in the current face group are exported. In addition to this simple behavior, the face extractor re-exports a pedestrian's historical best face group whenever the pedestrian tracker indicates that the individual has left the camera's field of view. The five rules are as follows: 1. If a pedestrian is considered new, then the first face group associated to the pedestrian is considered the historical best. 2. If the current face group contains a single face image, and the historical best face group also contains a single image, then the historical best group is updated only if the new face is an improvement over the previous best image. The exported face image is assigned the same unique identifier as the image it is destined to replace. 3. If the historical best face group contains more than one image, and this number does not increase with the current face group, then update the historical best face group when the best image from the current group is better than the worst image in the historical group. This rule is necessary because it is impossible to determine the association of new faces to old faces. Therefore, it is impossible to determine which of the many faces may have improved. For this

8 same reason, all face images in the group are given new unique identifiers. 4. If the current face group contains more faces than the historical best face group, then automatically update the historical face group. Of all rules, this one is perhaps the most obscure. The reason that the historical best face group is updated is that it causes the faces to be exported. This is important because it is impossible to determine which face in the group is new (and thus not yet exported). The current faces are all given new unique identifiers in order to ensure that none of the previously exported faces are replaced. 5. If none of the previous rules are applicable, then take no action. These rules are designed so that the face extractor errors on the side of caution. Whenever a pedestrian confuses the face extractor, the software exports every face image that might be associated to the pedestrian. In order for these rules to capture all possible scenarios, the pedestrian tracker must also be programmed to error on the side of caution; if the tracker ever gets confused about a pedestrian, it must assign the pedestrian a new identifier, and treat it as a new entity.

Modes of Operation The face exporter component has two modes of operation. These modes simply control the sequencing of operations that are applied to the video frames. In the current version of the software, these modes are not accessible programmatically; instead, they are specified using complier constants. For example, to search for faces before locating pedestrians, one would compile the application with the OPERATION_MODE constant set to the value FIND_FACES_THEN_PEDESTRIANS. However, to located pedestrians and then search for faces, the constant should be set to FIND_PEDESTRIANS_ THEN_FACES. The following section describes these modes in detail. FIND_FACES_THEN_PEDESTRIANS: (Suggested mode of operation) The find faces then pedestrians mode of operation searches for faces before searching for (and tracking) pedestrians. The location of each face is then input into the pedestrian tracker in the form of a “hint”. Currently, the tracker uses hints to ensure that all faces are associated to a pedestrian. For example, if a face is detected, and no pedestrian is nearby, then a new pedestrian record is created. This new pedestrian is described by the smallest rectangular region that contains the orphaned face. Under normal circumstances, faces are associated to whichever pedestrian yields the largest intersection with the rectangular region describing the face. This mode of operation ensures that all faces detected by the face detector have the opportunity to be processed. FIND_PEDESTRIANS_THEN_FACES: The find pedestrians then faces mode of operation is based on the idea that faces should only be located in regions where pedestrians are found. This idea seems rather sound, however, the pedestrian tracker occasionally returns regions that do no encompass the entire pedestrian. In this case, the face detector may fail to detect faces that are cut off by a pedestrian's overly-small bounding rectangle. This problem can be remedied by expanding the search region around each pedestrian. However, if there are several pedestrians in the scene, then the search regions may overlap; portions of the frame may be

9 searched twice. Finally, this mode of operation is guaranteed to call the face detection routines once per pedestrian per frame. Without careful attention to detail, the overhead of multiple calls may be significant. All of the above issues can be resolved, but the sources of error are numerous and the benefits are not significant. For this reason, this mode of operation is not recommended. Selecting this mode will work, but the various algorithms still need to be tuned to address the above issues.

Mechanism By Which Face Images are Exported Up until now, the discussion has simply mentioned that face images should be exported, but did not explain the mechanism by which this occurs. The face extractor component implements the observer / observable design pattern. Developers interested in the output of the face extractor simply register their classes as observers. Whenever a face image is to be exported, the face extractor class simply notifies the observers by calling their updateFace() method. The arguments provided to this method are as follows:

Parameter Name color

Description This is the RGB color of the rectangle drawn around the face in the output video. This information is not required for the main functionality, but it improves the usability of the GUI (it allows users to easily associate exported images with faces outlined in the video).

previousFace This is a pointer to the face record being replaced by the newFace. If this value is NULL, then no face has previously been exported on behalf of the pedestrian (i.e: indicating a new pedestrian). newFace

This is a pointer to the face record being exported. If this value is NULL, then it indicates that the pedestrian has left the scene. In this case, the previousFace is provided as a record of the best face exported on behalf of the pedestrian.

The individual observers are responsible for determining which of the export events are important, and which ones to ignore.

Possible Improvements The five export rules used by the face extractor may be overly cautious; when many individuals

10 are erroneously grouped into a single pedestrian, the rate of exported images is quite fast. There are at least two ways to correct this problem: • •

one can improve the pedestrian tracker so that there are fewer grouping mistakes, or a mechanism can be developed to help pair faces across face groups. For example, it can use a person's shirt color to help match faces belonging to the same individual.

Additionally, the face extractor should offer better support for the find pedestrians then faces mode of operation. This would allow the software to function on much higher resolution images, provided that the pedestrians do not occupy the entire viewing area. Finally, the face extractor occasionally fails to export faces when pedestrians leave the scene. It is not known if this problem is the result of a bug with the face extractor, or with the pedestrian tracker. This issue will hopefully be resolved in the near future.

Face Detector Header File: ./FaceDetector/FaceDetect.h C++ File: ./FaceDetector/FaceDetect.cpp Namespace: FaceDetector C++ Classes: Face, Detector The face detector is the most important – and most complex – of all the project components. The face detector is responsible not only for detecting faces in image sequences, but also in assessing their quality. The OpenCV library provided the mechanism by which faces can be detected, but the mechanism used to assess image quality needed to be built from the ground up. Measuring image quality was certainly the most challenging aspect of the entire project.

The OpenCV Face Detector In OpenCV, face detection is accomplished by invoking a single library method: cvDetectObjects. This method uses a technique known as Cascading Haar Classifiers in order to recognize certain objects (in this case, faces). A tutorial on the OpenCV documentation Wiki describes this technique as follows: First, a classifier (namely a cascade of boosted classifiers working with haar-like features) is trained with a few hundreds of sample views of a particular object (i.e., a face or a car), called positive examples, that are scaled to the same size (say, 20x20), and negative examples - arbitrary images of the same size. After a classifier is trained, it can be applied to a region of interest (of the same size as used during the training) in an input image. The classifier outputs a "1" if the region is likely to show the object (i.e., face/car), and "0" otherwise. To search for the object in the whole image one can move the search window across the image and check every location

11 using the classifier. The classifier is designed so that it can be easily "resized" in order to be able to find the objects of interest at different sizes, which is more efficient than resizing the image itself. So, to find an object of an unknown size in the image the scan procedure should be done several times at different scales. (“Face Detection using OpenCV”, 2006) Currently, the face detector component uses several of these classifiers to identify faces that are facing in different directions. In order to improve the runtime of the face detector, the classifiers are applied to a half-scale copy of each input frame. This means that faces smaller than 40x40 pixels will not be detected. However, the software does allow developers to decide if scaling should take place.

Measurements Used to Assess Image Quality In order to assess the quality of face images, a series of metrics are used to measure various aspects of the input. These measurements are then fed into a linear function that returns the final score. The scores increase in value as the image quality improves; therefore, larger scores are better than smaller scores. The following list enumerates all of the measurements used for this purpose: 1. 2. 3. 4. 5. 6.

The particular Haar-Classifier that detected the face Gaze direction Motion and skin content Quality of lighting Sharpness of the image The size of the detected face

Each of the above criteria will be discussed in great detail in this section.

Inferring About Image Quality Using Haar-Classifiers The face detector uses several classifiers to detect faces in the video frames. Some classifiers are more accurate than others, and some classifiers detect certain gaze directions, but not others. For example, results returned from a frontal face classifier are more desirable than those returned by a profile face classifier.

Gaze Direction In the introduction to this paper, a biometric face recognition scenario was used to introduce the concept of face extraction from live video. Face recognition packages perform best when the individuals are facing towards the camera. For this reason, gaze direction is an enormously important metric in determining the quality of a face image. Since gaze direction is simply defined as the direction in which an individual is looking, it can be estimated by locating an individual's eyes in relation to the rest of their head. For example, if an individual is looking directly forward, the midpoint

12 between the eyes should be horizontally centered on the head. However, if the individual is facing slightly to the left or to the right, then the midpoint between the eyes will be slightly off-center. Measurements of gaze direction, as mentioned above, are highly sensitive to error. Experimentation has shown that gaze direction is the least reliable metric, and leads to vast inconsistencies between results: on some occasions it works wonderfully, and in other cases the method fails outright. For this reason, the gaze direction measurements are considered accurate only when they fall within a very particular range of values. Despite the inaccuracies, eye detection and gaze direction measurements are quite worthwhile. Many of the operations needed for eye detection are also needed for other metrics, so partial results can be reused. For example, the OpenCV face detector returns square regions around people's faces. In many cases, the faces are better represented by a rectangular region, not by a square; the square regions include too much background. The first step in eye detection is locating the side edges of the face. The result is a rectangular region that cuts off much of the unwanted background. This new rectangle is then used in all other metrics. Additionally, if the eyes can be successfully located, a wealth of other information immediately becomes available. For example, locating the eyes will also reveal the vertical axis of symmetry of the face. This measurement can be used to test various hypotheses about a candidate face image. The mechanism by which eye detection is accomplished is described in the paper “A Robust Algorithm for Eye Detection on Gray Intensity Face without Spectacles” written by Kun Peng et al. The method uses the horizontal gradient of the input image in order to detect the vertical location of the eyes, and the horizontal locations of the sides of the face. It then estimates a region where the eyes are likely to be found, and searches this region for the brightest point. The brightest point is assumed to be the region of flesh between the individual's eyes, and directly above the nose. Thus, the horizontal coordinate of this point can be assumed to describe the axis of symmetry of the face. This method only works when face are oriented in the usual way (i.e: the faces are not upside down or otherwise rotated).

Motion, Skin, and “Motion & Skin” Content The OpenCV face detector is not always perfect; in many cases it returns regions that are not human faces. In order to detect false positives, it is assumed that human faces are part of the foreground; will move from frame to frame; and will be represented by pixels having colors that are normally associated with human skin. Motion content is defined to be the percentage of pixels in the face region that have recently experienced motion. Similarly, skin content is defined to be the percentage of pixels in the face region that are considered likely to be human skin. Finally “motion and skin” content is the percentage of pixels that have recently moved and are likely to represent human skin (in other words, this is the number of pixels that exhibit both qualities at the same time). In all of the above cases, the face region is not the region returned by the OpenCV face detector, but is instead the improved region provided by the gaze direction analysis.

13 The above discussion assumes that pixels exhibiting motion can be easily detected. It also assumes that pixels representing human skin are equally simple to identify. Thankfully, this is in fact the case. The following sub-sections will describe how this is accomplished. Pixel Motion Detection The face detector constructs a motion history image to determine which pixels have recently experienced motion. Rather than identifying a color or a shade of gray, each pixel in a motion history image encodes the most recent time that the pixel was considered part of the foreground. This is accomplished by numbering input video frames, and using the output of the background segmentation component to selectively update the pixels of the motion history image. For the purpose of this project, pixels that have experienced motion are exactly those that are/were recently considered part of the foreground. These pixels can be easily identified by a simple thresholding operation applied to the motion history image. Skin Detection The skin detector used by the face extraction software is based on the research of Margaret M. Fleck and David A. Forsyth as described in their paper “Naked People Skin Filter”. This filter uses texture and color information to determine pixels that likely represent human skin. The main idea is that pixels representing skin are generally tightly confined in a small region of the hue-saturation color space; in particular, skin tends to range from red to yellow in hue, and it tends to only be moderately saturated. Additionally, skin tends to be rather smooth in texture. This is well represented by areas where the variance in pixel intensity is low (although, texture is not considered in the current implementation of the face detector). Currently, the skin detector uses the Hue/Saturation/Luminosity color space, in which the hue ranges from 0º (red) to 360º (red again), and in which the saturation ranges from 0 to 1. Hues between 0º and 38º tend to be described as reddish-yellow, while values between 330º and 360º are considered reddish-blue. During informal experimentation, pixels representing skin fell within one of these two range. Additionally, pixels representing skin tended to be more saturated as the hue approached what might be considered yellow. Of course, these results closely agree with the results described in the aforementioned paper, although a different color space and pair of regions was used. The particular regions used by the face detector are described in the following table:

Region

Hue

Saturation

Reddish-Yellow

0 – 38º

0 – 1.0

Reddish-Blue

330 – 359º

0 – 0.6

The above values were heavily influenced by an articled entitle “Skin Color Analysis” authored by Jamie Sherrah and Shaogang Gong

Once the pixels representing skin are identified (producing the skin mask image), a second filter

14 is applied to all neighboring pixels. This second filter uses a less-strict set of rules in an attempt to intelligently close gaps that might occur in the original skin mask. Interestingly, images illuminated by natural or fluorescent lighting tended to be shifted toward the blue portion of the color spectrum, while images illuminated by incandescent lighting were shifted towards reddish-orange. For this reason, the above regions are slightly larger than would be necessary if the light source could be controlled.

Quality of Lighting When developing the face detector component, one of the most frustrating phenomena occurred when testing the application after a long night of programming; at night, the software was carefully tuned, and measurements like skin detection and face feature detection worked wonderfully. In daylight, however, the lighting was harsh, the colors were shifted, and the carefully adjusted measurements needed to be re-calibrated. The harsh lighting also greatly confused the edge detector (horizontal gradient map) used when detecting face features. For this reason, the quality of lighting is an important metric for assessing image quality. Even if the face extraction software could cope with poor lighting, it is not known how biometric face recognition software (or other post processing) might cope with such images. The first step in assessing the quality of lighting is converting the color input frames to gray scale intensity images. Once a gray scale image has been acquired, a histogram of the image's pixel intensities is computed. The general assumption is that the quality of lighting is directly proportional to the width of this histogram. This is not always a valid assumption; it fails when a subject is not evenly illuminated. In order to help address the problems caused by uneven illumination, one can rely on the assumption that faces are symmetric across a central-vertical axis. This central-axis is determined when the face detector locates the face features. If the lighting is soft and even, then the distribution of gray scale values on one side of an individual's face should be similar to the distribution of values on the other side of the face. This comparison is done by computing the histogram of the left and right halves of the face, normalizing each of these histograms, and then computing their intersection. The final lighting score is computed by multiplying the weight of this intersection with the width of the original histogram. Measuring the Width of a Histogram In the above discussion, histogram width was not well defined. For the purpose of this application, a histogram's width is defined as the smallest number of consecutive histogram bins that account for 95% of the pixels in the input image. To compute this value, a greedy algorithm is used. This algorithm locates the mean of the histogram which is the first bin to be added to a region called the histogram body. Histogram bins with indices higher than the largest index in the body are said to be in the head of the histogram. Similarly, bins with lower indices are said to be in the tail.

15

The greedy algorithm iteratively grows the body of the histogram by claiming the lowest index from the head, or the largest index from the tail. If the head of the histogram accounts for more pixels than the tail, the body's expansion is in the direction of the head. Otherwise, the body expands in the direction of the tail. This expansion continues until the body accounts for 95% of all pixels.

Image Sharpness In addition to the quality of lighting, image sharpness is another good indicator of image quality. In this sense, the adjective “sharp” is used as the antonym of blurry. Images can be blurred for several reasons, including motion blur or a camera that is incorrectly focused. In all cases, blurred images are certainly less desirable than sharp images. The challenge is determining a viable method for measuring image sharpness. Currently, the face extraction software attempts to measure the amount of high-frequency content contained in an image in order to judge its sharpness. In images, high frequency content can be defined as content that encodes edges, lines, and areas where the pixel intensities change significantly over short distances. With faces, the high-frequency content tends to concentrate around the eyes, lips, and other face features. In order to find high-frequency content, the software uses the Laplacian operator as a highpass filter. Pixels that survive the highpass filter (have values greater than some pre-determined threshold) are counted, and the result is divided by the total image area. Thus, the current measure of sharpness is simply the percentage of pixels that are considered to encode edges, lines, and other high-frequency content. In general images, this approach may not always be valid; one perfectly focused image may contain fewer edges than another complex, but blurry, image. Thankfully, the face export software should only ever compare face images to similar images acquired in previous frames. Thus, any change in the high-frequency content can generally be attributed to such things as motion blur.

Image Dimensions and Area

16 The final assumption about face image quality is that larger images are better than smaller images. Now, it should be noted that the OpenCV face detector returns regions that are square. If the face regions were not square, then it would likely be the case that certain aspect ratios would be better than others. Unfortunately, one cannot always assume that large images are best. In fact, large images often indicate a false positive returned from the OpenCV face detector. Often, only a certain range of sizes will be reasonable for a given scene. For example, a security camera responsible for observing a large lobby will expect smaller face images than a personal web camera sitting on top of a user's computer monitor. This suggests that the range of acceptable image sizes should be configurable by the end-user of the system. Unfortunately, at this time, these parameters can only be changed by modifying constants defined in the face detector source code. Another option worth mentioning is that it might be possible for the software to learn on its own about the range of acceptable image sizes. For example, an average face size can be determined by considering the dimensions of all face regions that have scored well in the other face quality metrics. Once this is accomplished, candidate face regions that are significantly different than this model can be discarded. At this time, this option has not been explored.

Possible Improvements The main problem with the face detector is that an image's score is not always a reasonable indicator of quality. This is not the fault of the individual quality metrics, but is the result of the function which combines these individual values into the final score. As mentioned earlier, this function is nothing more than a simple weighted sum of the aforementioned quality metrics. The weights associated to each metric were chosen almost arbitrarily. These weights were then hand-tuned over a series of tests until the results seemed adequate. There is almost certainly a better approach that needs to be taken.

Pedestrian Tracker Header File: ./PedestrianTracker/PedestrianTracker.h C++ File: ./PedestrianTracker/PedestrianTracker.cpp Namespace: PedestrianTracker C++ Classes: Pedestrian, Tracker The pedestrian tracker is an important component of the system, but could be the subject of an entire project. For this reason, the tracker was kept as simple as possible, while still producing acceptable results. The current implementation processes each frame of the video sequence. This processing operates in four distinct phases: 1. The first phase uses the background segmentation component to identify the foreground pixels. It then locates all of the connected foreground components, and their bounding rectangles.

17

2. At this point the second phase attempts to associate each of the connected components to a pedestrian detected in the previous video frame. A pedestrian is nothing more than a rectangular region of interest, and the association is achieved by determining to which pedestrian each component intersects with the greatest area. If no association is possible, then the component is considered a new pedestrian. 3. The third phase of the tracker groups the connected components based upon the pedestrian to which each is associated. A bounding rectangle is computed for each of these groups. These large bounding rectangles are then assigned to the appropriate pedestrian (replacing their bounding rectangle from the previous frame). 4. The fourth, and final, phase determines the foreground-pixel-density of each of the resultant pedestrians. This density is simply the percentage of foreground pixels that occupy the pedestrian's current bounding rectangle. If this density is low, then it is assumed that the results are incorrect. In this case, the tracker re-divides the pedestrian into its individual components. The above algorithm is based upon the assumption that pedestrians from the current frame should be associated to nearby pedestrians from the previous frame. It also assumes that pedestrians may be composed of several distinct foreground components.

Possible Improvements As mentioned above, the current generation of the pedestrian tracker is very simple. It can, and probably should, be replaced at a later date. At present, the tracker is slightly over-zealous about merging objects. It also has difficulty tracking fast-moving objects; but, pedestrians are not usually moving very fast. Finally, the tracker fails to recognize when objects merge – although this latter issue could probably be easily resolved.

18

Appendix A: Building From Source Assumptions: The following discussion assumes that OpenCV beta 5 is installed in the following default location on Windows XP Service Pack 2: “C:\Program Files\OpenCV” This implies that the Haar classifier data sets are located as follows: “C:\Program Files\OpenCV\data\haarcascades” Finally, it assumes that OpenCV has been properly installed, and that the system path has been modified to include: “C:\Program Files\OpenCV\bin” It also assumes that developers are using Microsoft VisualStudio 2003.

If the Assumptions Fail: If any of the above assumptions fail then, let be the location where OpenCV was actually installed. The following modifications are then necessary: Changes to “./FaceDetector/FaceDetect.h” The following constants need to be modified to point to the appropriate Haar datasets: #define DEFAULT_FRONTAL_CLASSIFIER_PATH \ “\\data\\haarcascades\\haarcascade_frontalface_default.xml” #define DEFAULT_PROFILE _CLASSIFIER_PATH \ “\\data\\haarcascades\\haarcascade_profileface.xml” Changes to VisualStudioProject “Project -> Properties -> C/C++ -> General -> Additional Include Directories” must be set to: "\otherlibs\highgui"; "\filters\ProxyTrans"; "\cxcore\include"; "\cvaux\include"; "\cv\include" Also,

19 “Project -> Properties -> Linker -> General -> Additional Library Directories” must be set to: "\lib";

Finally, just to be thorough, make sure that “Project -> Properties -> Linker -> Input -> Additional Dependencies” is set to: “strmiids.lib quartz.lib cv.lib cxcore.lib highgui.lib”

Running the Binaries on Systems without OpenCV If one wishes to run the face extractor software on a system where OpenCV is not installed, then the following files must be included in the same directory as the binary executable FACE.exe: cv097.dll cv097d.dll cvaux097.dll cxcore097.dll cxcore097d.dll haarcascade_frontalface_default.xml haarcascade_profileface.xml highgui096d.dll highgui097.dll proxytrans.ax Additionally, the constants DEFAULT_FRONTAL_CLASSIFIER_PATH and DEFAULT_FONTAL_PROFILE_PATH defined in ./FaceDetector/FaceDetect.h must be modified to load the classifiers from the local ./ directory. Finally, the proxy transform filter must be registered. This can be achieved by executing the following command at the command shell: “regsvr32 proxytrans.ax”

20

References Fleck, M. & Forsyth, A (nd.). Naked People Skin Filter. Berkeley-Iowa Naked People Finder. Retrieved April 17, 2006 from http://www.cs.hmc.edu/~fleck/naked-skin.html Laganière, R. (2003). A step-by-step guide to the use of the Intel OpenCV library and the Microsoft DirectShow technology. Retreived April 17, 2006, from http://www.site.uottawa.ca/~laganier/tutorial/opencv+directshow/ Microsoft Corporation (2006). Microsoft DirectShow 9.0. MSDN Library. Retrieved April 17, 2006 from http://msdn.microsoft.com/library/default.asp?url=/library/enus/directshow/htm/directshow.asp OpenCV Comminity (2006). Face Detection using OpenCV. OpenCV Library Wiki. Retrieved April 17, 2006 from http://opencvlibrary.sourceforge.net/FaceDetection OpenCV Comminity (2006). What is OpenCV?. OpenCV Library Wiki. Retrived Retrieved April 17, 2006 from http://opencvlibrary.sourceforge.net/ Peng, K, et al. (2005). A Robust Algorithm for Eye Detection on Gray Intensity Face without Spectacles. Journal of Computer Science and Technology. Retrieved April 17, 2006 from http://journal.info.unlp.edu.ar/Journal/journal15/papers/JCST-Oct05-3.pdf Sherrah, J. & Gong, S (2001). Skin Color Analysis. CvOnline. Retrieved April 17, 2006 from http://homepages.inf.ed.ac.uk/cgi/rbf/CVONLINE/entries.pl?TAG288 Wikipedia contributors (2006). DirectShow. Wikipedia, The Free Encyclopedia. Retrieved April 17, 2006 from http://en.wikipedia.org/w/index.php?title=DirectShow&oldid=48688926.