AN INTRODUCTION TO EMOTIONALLY RICH MAN-MACHINE INTELLIGENT SYSTEMS

AN INTRODUCTION TO EMOTIONALLY RICH MAN-MACHINE INTELLIGENT SYSTEMS Themis Balomenos(1), Amaryllis Raouzaiou(2), Kostas Karpouzis(2), Stefanos Kollias...
Author: Guest
0 downloads 0 Views 49KB Size
AN INTRODUCTION TO EMOTIONALLY RICH MAN-MACHINE INTELLIGENT SYSTEMS Themis Balomenos(1), Amaryllis Raouzaiou(2), Kostas Karpouzis(2), Stefanos Kollias(2) and Roddy Cowie(3) (1) ALTEC S.A., European R&D projects, 71 Grammou str, 15124 Maroussi, Greece email: [email protected] (2) School of Electrical and Computer Engineering, National Technical University of Athens 9 Iroon Polytechniou str, 15780 Zografou, Greece email: [email protected], [email protected], [email protected] (3) Department of Psychology Queen’s University of Belfast Northern Ireland, United Kingdom email: [email protected]

ABSTRACT: In this paper we present issues regarding the development of an artificial system that will register its user’s emotional state and act accordingly. We discuss these issues in the context of the ERMIS project. ERMIS stands for Emotion-Rich Man-machine Interaction Systems. ERMIS is a useful context because the project sets out to explore what we regard as the natural approach to developing an emotion-sensitive system. The nature of the consortium allows it to draw on expertise from a substantial range of backgrounds, academic and applied. That breadth is critical, because one of the key issues in handling emotion is that no one discipline covers all the relevant ground. As a result, there is a very real risk that experts in one of the relevant disciplines will make fatal assumptions about other aspects of the problem without recognising that there is an issue.

KEYWORDS: emotion recognition, man machine interaction, multimodal system, visual, linguistic, paralinguistic analysis

INTRODUCTION Emotion analysis is a key issue in the attempt to improve the effectiveness of the interaction between people, information appliances and information services and develop easier and more flexible access to information devices and services. Machines and devices that can integrate speech, image and video interfaces capable of analysing their users’ emotional states will be capable of providing communication effectiveness, avoiding clumsiness in the interaction with rich and complex information sources and services. Man-Machine Interaction (MMI) systems that utilize multimodal information about their users' current emotional state are presently at the forefront of interest of the computer vision and artificial intelligence communities. Such interfaces give the opportunity to less technology-aware individuals, as well as handicapped people, to use computers more efficiently and thus overcome related fears and preconceptions. Besides this, most emotion-related facial expressions are considered to be universal, in the sense that they are recognized along different cultures. Therefore, the introduction of an “emotional dictionary” that includes descriptions and perceived meanings of facial expressions, so as to help infer the likely emotional state of a specific user, can enhance the affective nature [1] of MMI applications. Despite the progress in related research, our intuition of what a human expression or emotion actually represents is still based on trying to mimic the way the human mind works while making an effort to recognize such an emotion. This means that even though image or video input is necessary to this task, this process cannot come to robust results without taking into account features like speech or body pose. These features provide means to convey messages in a much more expressive and definite manner than wording, which can be misleading or ambiguous. While a lot of effort has been invested in examining individually these aspects of human expression, recent research [2] has shown that even this

approach can benefit from taking into account multimodal information. Consider a situation where the user sits in front of a camera-equipped computer and responds verbally to written or spoken messages from the computer: speech analysis can indicate periods of silence from the part of the user, thus informing the visual analysis module that it can use related data from the mouth region, which is essentially ineffective when the user speaks. Inversely, the same verbal response from the part of the user, e.g. the phrase “what do you think”, can be interpreted in a different manner when pronunciation or facial expression are also taken into account and indicate question, hopelessness or even irony. The ERMIS project aims at generating a prototype system, based on robust speech analysis and enhanced by the analysis of visual attributes, that will be able to provide cues about the emotional state of the persons which it interacts with and to generate emotionally coloured speech, thus improving effectiveness and friendliness in human computer interaction. A variety of applications and services can take advantage of the project results, including user-friendly callcentres and community services, tele-education, e-health and personal assistants, next generation mobiles and electronic commerce [3]. Specific applications include automatic phone banking, call-centres, or booths, where analysis of the client’s voice in the beginning and the end of the call or of the client’s image can indicate whether the customer is satisfied by the service; tele-education, where the vocal and visual responses of the participants can indicate the effectiveness of the process; personal assistants, wearables, or next-generation mobile phones which can integrate and transmit both speech and visual patient information for health care purposes. Analysis of the emotional state of the user can greatly enhance middleware, agent technologies, especially during search and retrieval of multimedia content from large heterogeneous databases, since the system will be aware of a variety of user’s reactions to the information presented to them and use these reactions in an on-line relevance feedback framework. Alert systems, including in-car safety improvement, can be also assisted, by examining the drivers’ reactions and appearance and informing them, for example, regarding their capability to drive the car.

THE ERMIS APPROACH The above mentioned results indicate that emotion analysis is an area that is well placed to develop, while there already exists a substantial body of prior knowledge and an increasingly clear picture of what current techniques can achieve. Based on these results and on the expertise of the Consortium partners on both the speech and image areas, the ERMIS project will conduct a systematic analysis of speech and facial input signals, in separate as well as in common; the aim will be to extract parameters and features which can then provide MMI systems with the ability to recognise the basic emotional state of their users and respond to them in a more natural and user friendly way. There are a number of issues that have to be investigated in the creation of such a system. The first has to do with the definition of the problem with respect to its targets. Although the archetypal emotions are the most important ones, a pragmatic problem would be to find terms that cover a wider range, especially with respect to MMI applications, without becoming unmanageable. Recently, a descriptive framework suitable for use in automatic emotion recognition was assembled; it was based on a two-dimensional activation-emotion space, expressed in terms of emotional orientation and emotion strength. Emotions are ordered in terms of orientation. The basic advantage of this representation, for MMI applications, lies in its ability to provide an hierarchical structure of cues regarding the user’s emotional state; rough categorisation is the first target, and when required, more refined cues are derived. Research results showed that negative emotional orientation generally meant a balance in favour of withdrawal, and positive a balance in favour of engaging. For example, boredom, disgust, fear, and anxiety were marked by an unusually strong inclination to withdraw. Moreover, by assigning a category term such as ‘worried’ means, e.g., when interacting with a service or information providing centre, that the user is in need of information. Since speech communication constitutes the best way of generating user friendly human computer interaction, specific attention should be paid in the integration of emotion analysis with speech recognition, creating a prototype that is able to analyse and respond to its users’ commands, taking into account the cues about their emotional state, in real time. Development of facial expression analysis tools, especially in the framework of the MPEG-4 standard for audiovisual coding, constitutes an alternative for retrieving the user’s emotional state. Facial expression analysis can be applied in separate, or be used in addition to emotional speech analysis. It should be added that signs, which are relevant to emotion may have alternative meanings; lowered eyebrows may signify concentration as well as anger. This makes coordinating information from different modalities a crucial aspect; joint analysis of audio-visual features, when available, will constitute a valuable tool for this purpose. Another issue refers to the ability of the techniques to extract features, from the speech and visual data, with the accuracy that is required for reliable detection of the users’ emotions. The system to be generated should, on the one hand, be able to rely on prior knowledge related to the emotional analysis of speech and/or facial expressions, and, on the other hand, be capable of accommodating the different expressive styles of humans. The continuity of emotion space and the uncertainty involved in the feature estimation process favour the adoption of fuzzy approaches in the

recognition procedure. Moreover, the required ability of the system to use prior knowledge, while being also capable of adapting its behaviour to its users’ characteristics, call for intelligent hybrid, e.g., neurofuzzy, approaches; the project will adopt such approaches for achieving the best possible recognition performance. Finally, one of the most crucial factors which will constitute the criterion for the success of the project developments refers to the effectiveness and the improvement of user friendliness achieved by the proposed system when compared with the current state of MMI. Effects, which are currently known, such as that customers of a voice operated information system tend to hyperarticulate when they can’t get through the dialogue (which usually is a bad strategy to get better recognised), or other user reactions that might be caused by the system poor performance or specific responses, should be analysed and taken into account by the intelligent hybrid system. More important, deep investigation of the responses of users of the ERMIS system will be required, to evaluate the reactions of humans when dealing with a system that seems capable of analysing their own behaviour. Ethical issues related to this will be analysed and clarified, so that users have a clear understanding of what the system can do when they interact with it. SYSTEM ARCHITECTURE The general objective of the ERMIS project is to create a prototype system, which will be capable of emotionally enriched interaction, by processing and analysing both verbal and non-verbal information. To do so, the system will use robust speech recognition techniques, while at the same time extract appropriate features from either, or both, speech and visual input signals, and deduce cues about the underlying human emotional states. Moreover, it will be capable of generating emotionally coloured speech, responding to its users in a more natural and friendly way. To accomplish this goal, a prototype system described in this document will be designed. This overall system consists of the following subsystems: 1. Linguistic Analysis 2. Paralinguistic Analysis 3. Face Analysis 4. Automatic Emotion Recognition A schematic view of the overall system architecture can be found in Figure 1:. As it can be seen in the Figure 1:, a suitable user interface will be employed for collecting certain speech and video signals, i.e. the user’s voice and facial video, which will feed the analysis modules. As far as speech analysis is concerned, it will be carried out with respect to both types of information the speech signal carries: linguistic information, i.e., the qualitative targets that the speaker has attained (or approximated), conforming to the rules of language and paralinguistic information, i.e., allowed variations in the way that qualitative linguistic targets are realised [4]. The linguistic analysis subsystem will deal with the first type of information, i.e. linguistic information. Standard speech recognition systems consist of four main modules: a) feature extraction, by converting each speech frame (each 10 milliseconds of the speech signal) into a set of cepstral coefficients, b) acoustic phoneme modelling, which gives estimates of the probability of the features given a sequence of words, c) language modeling, which provides an estimate of any sequence of words and d) a search engine, finding an optimal solution among all possible sentences. Such commercially available systems will go through certain modifications in order to be used for linguistic speech analysis [5]. Analysis of paralinguistic information, i.e., of variations in pitch and intensity having no linguistic function, and voice quality, related to spectral properties that are not relevant to word identity, will be performed by the paralinguistic analysis model. Speech consists of words spoken in a particular way. Paralinguistic analysis of speech is mainly concerned with the way the words are spoken [6]. The type of speech is greatly related to some specific characteristics of the speech signal. In particular, the main energy source in speech is vibration of the vocal cords. At any given time, the rate at which they vibrate determines the fundamental frequency of the acoustic signal that is the pitch. Pitch is shown to be a statistical indicator of some speech types (e.g. clear and soft). The duration of speech sounds can be established because of the labelling, and is also indicative of speech type - particularly the duration of semivowels. The distribution of energy is another indicator of speech type - for instance, energy shifts towards vowels and away from consonants in loud speech. It is well known that the spectral distribution of energy varies with speech effort - effortful speech tends to contain relatively greater energy in low and mid spectral bands. Wavelet transforms provide a flexible method of energy decomposition; discrimination is increased by distinguishing the spectra associated with different speech sounds and exploring time variation in the energy distribution [7]. The facial analysis subsystem will perform facial detection and evaluation, as well as facial feature and gesture analysis. At first, the user’s face will be detected using techniques based on detection and evaluation of either skin segments or blobs. The following step is to detect the position and shape of the mouth, of the eyes, of the eyelids, of wrinkles and extraction of features related to them. Of particular interest are the facial animation (FAP) and definition (FDP) parameters defined in the framework of the ISO MPEG-4 standard. FAPs are based on the study of minimal facial

actions and are closely related to muscle actions. Automatic extraction of FDPs will form the basis for identifying FAPs and creating higher order representations to feed the emotion analysis module [8], [9]. The emotion analysis subsystem will perform the analysis of the features extracted from either, or both audio and visual signals, and provide cues about the attitude or emotional state of the user. The feature sets generated in separate from the analysis of linguistic/paralinguistic speech data, as well as of facial images, will be considered, on the one hand in separate; on the other hand, they will be combined in the form of audio-visual feature sets which will feed the emotion recognition system in cases that both speech and image data are captured and analysed. Different emotional states are represented in the activation-evaluation space. Because of the continuity of the above-described emotion space and the uncertainty that is involved in both speech and visual feature extraction, fuzzy logic and neurofuzzy approaches will be adopted, within hybrid approaches that can offer natural ways of coding and enriching prior knowledge related to expression/emotion analysis, while adapting it to accommodate the different specific expressive styles of humans.

User Interface

Speech

Linguistic Analysis Subsystem

Speech

Paralinguistic Analysis Subsystem

Video

Facial Analysis Subsystem

Phonetic Parameters

Linguistic Parameters

Emotion Recognition Subsystem

Facial Parameters

Emotional State

Figure 1: General Architecture of the ERMIS system

The Visual Module The visual module (Figure 2:) includes four sub-processes: face detection, face tracking; extraction of facial feature (eyes, mouth, nose); extraction of the facial animation (FAP) and definition (FDP) parameters (using the MPEG-4 standard). Face detection is the most general approach to solving the most difficult problem associated with the visual modality, which is finding the region in an image that corresponds to the user’s face. Its input is a frame which may or may not contain an image of a face, in any position and of any size, looking towards the camera or turned through various angles. Its output is a number of rectangles, each of which bounds a credible face. The process of examining candidate rectangles is made tractable by heuristics which reject a high proportion quickly on the grounds of inappropriate levels of variation and/or skin colour. If rectangles survive that stage, their contents are preprocessed to extract features in a high-dimensional space. It is then projected into a subspace, and a support vector machine classification routine is applied. It reports the extent to which the contents of the rectangle display the characteristics that would be expected if it contained a face. The remaining processes are linked in a single system, and they are conceptually related. Face tracking is a less general approach to the problem of finding the user’s face. It is ideal when there is prior information, e.g. a recent position of the face is known. Once the general position of the face is known, it is used to find points that are likely to lie on the contour of the face. Those in turn are used to identify an area that is believed to be the image of the face. Areas likely to contain the eyes and mouth are then identified. Those areas in turn are used to identify informative points on boundaries

associated with the eyes and mouth. These are a subset of the points defined in the MPEG-4 standard as significant facial points (FPs). Painting and still photography have encouraged research to think that information about emotion resides in instantaneous ‘snapshots’ of the face. In reality, facial movements over time are probably a more basic source of information. That is well recognized in systems that use descriptions of action patterns as features for the recognition of expressions/emotions. We have explored those issues [10], but they are not yet fully integrated into the ERMIS framework. Like others, they pose problems related to the timescale over which information becomes available. Video Input

Face Detection and Evaluation Module

Automatic Facial Feature Extraction Module

Face Tracking Module

Facial Feature Tracking Module

Video Input

Facial Parameters

Figure 2: The Face Analysis Subsystem

The Auditory Module The auditory module carries out two distinct functions: linguistic analysis, which aims to extract the words that the speaker produces; and paralinguistic analysis, which aims to extract significant variations in the way words are produced - mainly in pitch, loudness, timing, and ‘voice quality’. Both are designed to cope with the less than perfect signals that are likely to occur in real use. The linguistic subsystem (Figure 3:) first processes the speech signal to enhance the signal and remove noise prior to recognition. A second important source of variability is due to the difference between speakers, e.g., male/female, adult/child, specific individuals. That will be handled by normalizing the input speech against speaker variability. In that way, feature extraction for the current speaker can be adapted to the acoustic models, instead of models being adapted to the input.

Speech Signal

Short-term spectral domain

Enhanced Speech Signal

Text Speech Recognition Module

Singular value decomposition

Text Post-Processing Module

Linguistic Parameters

Signal Enhancement/ Adaptation Module

Figure 3: The Linguistic Analysis Subsystem The output of this module will feed the feature extraction subsystem, by converting each speech frame into a set of cepstral coefficients. Acoustic modeling will be achieved by using models which represent individual phones by hidden Markov model (HMM), with state-tying used to link states which are acoustically indistinguishable Speech recognition will follow the established principles of statistical pattern recognition used for Large Vocabulary Recognition (LVR) systems. The output of this processing is a representation of the text a user speaks. The paralinguistic subsystem (Figure 4:) is concerned with more slowly changing variables – voice intensity, pitch, and spectral balance, and the structure and timing of phrase-like units and pauses. This module will provide non-verbal

speech analysis and feature extraction. It is concerned primarily with the information about emotion that resides in the way words are spoken. The main target in this module will be the extraction of phonetic parameters, such as pitch, pitch range, average pitch, energy, feature boundaries, intensity, and sound duration, all measured across the entire utterance after endpointing. Speech Signal

Pause Estimation Module

Speech Segment

Phonetic Parameter Extraction Module

Phonetic Parameters

Figure 4: The Paralinguistic Analysis Subsystem An existing prototype (ASSESS) uses a combination of techniques to extract these reliably in imperfect inputs, and generates descriptors that have been shown to correlate with descriptors of emotion state. ASSESS extracts several basic kinds of structure from the raw speech signal – an intensity contour, a coarse spectrum, estimates of the points at which the vocal cords open, and a smooth pitch contour based on those estimates. Within these structures are identified boundaries marking units such as edits, pauses, rises and falls in the intensity and pitch contours, and frication. Statistical parameters which describe these structures are then generated. The prototype system generates a comprehensive battery of descriptors. The final version will be adapted to generate those which are shown to be both robust and diagnostic in the application context. It will also use the sophisticated pause detection of the linguistic subsystem to identify pauses which mark off natural units, and will provide descriptive parameters for the preceding natural unit.

Integration The evidence associated with different sources needs to be integrated. The core techniques used in ERMIS will be based on fuzzy set theory [11]. That is the natural approach because experiments on multimodal emotion perception in humans are consistent with a fuzzy set model [12]. Refinements have been explored to various extents. Flexibility depends on ensuring that partial results are fully accessible rather than submerged in premature decisions (in line with the well-known principle ‘don’t do what you might later have to undo’). There are also many points in the process where attention-like effects may be needed, for instance emotionally salient words and phrases need to be pinpointed in the test stream; paralinguistic analysis needs to be triggered at appropriate times (so that it can deal with natural units); there will be times when evidence from mouth shape is uninformative because the person is speaking; and there may be times when one channel or feature is giving much more reliable evidence than others. We aim to handle these and related issues systematically, using evidence from human brain systems for emotion, attention and consciousness. As shown in Figure 1:, the previous subsystems provide output, used as input of the Automatic Emotion Recogniser. Parameters like text, phonetic parameters and basic facial parameters will be the input to the Emotion Recogniser (Figure 5:). Because of the continuity of emotion space and the uncertainty involved in the feature estimation process, the project will adopt neurofuzzy approaches that can offer natural ways of coding and enriching prior knowledge related to emotion analysis, while accommodating the different specific expressive styles of humans. The output of this module is real-life user sentimental state. Linguistic Parameters

Facial Parameters

Phonetic Parameters

Emotional State Detection Module

Emotional State

Figure 5: The Emotion Recognition Subsystem

Training Data

TESTBEDS ERMIS does not aim to produce a product for sale, but it aims to prove the concept in a number of testbeds. The most developed so far is a ‘Sensitive Artificial Listener’ (to be implemented on the user’s own PC and able to learn his or her own characteristics). Our Sensitive Artificial Listener is a descendant of ELIZA, an early AI program that ‘chats’ with users (via text I/O). ELIZA had no real understanding, but it used various tricks to simulate a ‘conversation’ - mainly stock responses and rephrasing the user’s last comment. Versions still exist, presumably because they provide quirky kind of interaction that people enjoy [13]. By analogy, our Artificial Listener will simulate a conversation using input from the user’s voice and facial expressions, and stock responses keyed to the signs of emotion that it finds. Preliminary simulations suggest that systems like that could have a market because they are fun in their own right, like ELIZA, and perhaps mildly therapeutic. They have a serious function, though. They provide a context where it is possible to study the signs of emotion that occur in spontaneous discourse without having to develop massively complex AI.

ACKNOWLEDGEMENTS This work has been supported by the EU funded ERMIS project (IST-2000-29319).

REFERENCES [1] Picard, R.W, 2000, “Affective Computing”, MIT Press, Cambridge, MA. [2] Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J., 2001, “Emotion Recognition in Human-Computer Interaction”, IEEE Signal Processing Magazine. [3] Batliner, A.; Fischer, K.; Huber, R.; Spilker, J.; Nöth, E., (in press), “Desperately seeking emotions or: Actors, wizards, and human beings”, To appear in Speech Communication special issue on Speech and Emotion. [4] Douglas-Cowie, E.; Campbell, N; Cowie, R; Roach, P, (in press), “Emotional speech: towards a new generation of databases”, To appear in Speech Communication special issue on Speech and Emotion. [5] Young, SJ, 1996, “Large Vocabulary Continuous Speech Recognition”, IEEE Signal Processing Magazine, 13(5), pp. 45-57. [6] Ekman, P.; Friesen, W., 1969, “The repertoire of non verbal behavior: categories, origins, usage and coding”, Semiotica, 1, pp. 49-98. [7] McGilloway, S; Cowie, R.; Douglas-Cowie, E; Gielen, S; Westerdijk, M.; Stroeve, S, 2000, “Automatic recognition of emotion from voice: a rough benchmark”, In R. Cowie, E Douglas-Cowie & M. Schroeder (eds) Speech and Emotion: Proceedings of the ISCA workshop, Newcastle, Co. Down, pp. 207-212. [8] Tekalp, A.M.; Ostermann, J., 2000, “Face and 2-D Mesh Animation in MPEG-4”, Signal Processing: Image Communication, Vol. 15, pp. 387-421. [9] Raouzaiou, A.; Tsapatsoulis, N.; Karpouzis, K.; Kollias, S., 2002, “Parameterized facial expression synthesis based on MPEG-4”, EURASIP Journal on Applied Signal Processing, Vol. 2002, No. 10, Hindawi Publishing Corporation, pp. 1021-1038. [10] Tsapatsoulis, N.; Raouzaiou, A.; Kollias, S.; Cowie, R.; Douglas-Cowie, E., 2002, “Emotion Recognition and Synthesis Based on MPEG-4 FAPs”, In I.S. Pandzic, R. Forchheimer (eds.). MPEG-4 Facial Animation: The Standard, Implementation and Applications, Chichester: Wiley [11] Klir, G.; Yuan, B., 1995,“Fuzzy Sets and Fuzzy Logic, Theory and Application”, Prentice Hall, New Jersey.

[12] Massaro, D.; Cohen, MM., 2000, “Fuzzy logical model of bimodal emotion perception: Comment on ‘The perception of emotions by ear and eye’ by de Gelder and Vroomen”, Cognition & Emotion 14, pp. 313-320. [13] Ward, N.; Tsukahara, W., 1999, “A responsive dialog system”, In Wilks, Y (ed) Machine Converstations, Kluwer, pp. 169-174.