Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs

1 Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs Vicente P. Minotto, Claudio R. Jung and Bowon Lee∗ ...
Author: Valerie Kennedy
2 downloads 0 Views 1MB Size

Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs Vicente P. Minotto, Claudio R. Jung and Bowon Lee∗

Abstract—Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers’ spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06 % with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers. Index Terms—Multimodal Fusion, Voice Activity Detection, Sound Source Localization, Hidden Markov Model, Support Vector Machine, Optical-Flow, Beamforming, SRP-PHAT.

I. I NTRODUCTION Thanks to the advancement of computing resources across desktop and mobile platforms, more sophisticated humancomputer interfaces (HCI) are richer and more functional [1], particularly those that allow human-to-human-like interactions, such as speech. One of the main problems with a speechbased HCI, such as Automatic Speech Recognition (ASR), is that practical acoustic environments often include factors such as noise, reverberation, and competing sound sources that significantly compromise the recognition accuracy. It is important to preprocess these degraded signals to extract clean speech signals as input to ASR systems. This is of particular importance when there exist competing speakers who are not using the HCI, because oftentimes the intended user may Vicente P. Minotto and Claudio R. Jung are with Institute of Informatics, Federal University of Rio Grande do Sul. Avenida Bento Gonc¸alves, 9500. Porto Alegre, RS, Brazil 91501-970. E-mails: {vpminotto, crjung} Bowon Lee (corresponding author) is with the Department of Electronic Engineering, Inha University, Incheon, South Korea. E-mail: [email protected]

not be the only person speaking. Without proper knowledge of which sound sources to capture, the system may process signals from unwanted sound sources and generate inaccurate results. This is one of the major technical barriers preventing systems such as ATM machines or automated information booths from using voice input for HCI in public places with large crowds. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the most important examples of such front-end techniques. The main goal of VAD is to distinguish segments of a signal that contain speech from those that do not, such that a speech recognizer can process only the segments that contain voice information. In addition, speech enhancement algorithms [2] can benefit from the output of the VAD because accurate noise estimation is crucial and is often updated during noise only periods [3]. In SSL, the main goal is to identify the location of the active sound source, so that it is possible to enhance its speech signal using spatial filtering techniques such as beamforming with microphone arrays, which is useful for separating or suppressing signals from unwanted sound sources spatially separated from the source to capture [4]. Most existing VAD and SSL approaches only consider single speaker scenarios. For applications such as HCI, videoconferencing, or gaming, it is often desired to distinguish different users that may be speaking simultaneously, and algorithms designed for single speaker cases may not be suitable for such applications. Recently proposed techniques for simultaneous speaker VAD [5]–[9] rely solely on the acoustic modality. While audio-only-based techniques might present promising results, leveraging visual information is often beneficial when a video camera is available. In this context, other studies use the joint (multimodal) processing of both image and audio [10]–[13]. The main idea is that by fusing more than one data modality it is possible to exploit the correlation among them in such a way that one modality compensates for the flaws of the others, making the algorithm more robust in adverse situations, especially with competing sources. Our method performs VAD and SSL for simultaneous speaker scenarios, using the audio and video modalities. To process the visual information, we use a face tracker to identify potential sound sources (speakers) followed by optical-flow analysis of the users’ lips with Support Vector Machines (SVMs) to determine whether or not that specific user is actively speaking. For the audio analysis, we use a Hidden Markov Model (HMM) competition scheme in conjunction with beamforming to individually evaluate the spatio-temporal behavior of potential speakers that are pre-identified by the


face tracker. We propose a mid-fusion approach in between the visual VAD (VVAD) and audio VAD (AVAD) to construct our final multimodal VAD (MVAD). Then we perform SSL for active speakers using information from the face tracker as well as the Steered Response Power with Phase Transform (SRPPHAT) beamforming algorithm [4]. Our experiments show an average accuracy of 95.06% with an average error of 10.9 cm for VAD and 3D SSL respectively, on up to three simultaneous speakers in a realistic environment with background noise and interfering sound sources. The remainder of this paper is organized as follows. Section II summarizes some of the microphone array techniques used in our work followed by the most recent research in the fields of VAD and SSL. In Section III, the proposed approach for multimodal VAD and SSL is presented. Section IV presents the experimental evaluation of our technique, and conclusions are drawn in Section V. II. R ELATED W ORK AND T HEORETICAL OVERVIEW Typical existing VAD techniques are based on voice patterns in the frequency domain, pre-determined (or estimated) levels of background noise [14], or zero crossing rate [15]. These approaches, however, do not tend to perform well in the multiple speaker scenario since simultaneous speech reflects as overlapped signals in the time-frequency plane. Therefore, most approaches (as in this work) use microphone arrays due to their capability to exploit spatial characteristics of the acoustic signals through beamforming [16] or Independent Component Analysis (ICA) aided by beampattern analysis [5]. Furthermore, adding visual information is highly beneficial, since speech-related visual features are invariant to the number of simultaneous active sound sources. Another consequence of the simultaneous sources case is that both VAD and SSL eventually become the same problem. When extending VAD from single to multiple sources, for example, we can employ the microphone array to analyze different speakers separately. In these cases, SSL is necessary not only to identify the number of active sources, but also to detect which are the active ones among all possible candidates. Reciprocally, for extending SSL from single to multiple sources, VAD must be used for validating the located sound events, so that noise sources are detected as active speakers. A. Related Work Maraboina et al. [5] use frequency-domain ICA to separate the speech signals of different sound sources, and beampattern analysis to solve the permutation problem in the frequency components. Unmixed frequency bins are then separately classified using thresholding aided by K-means clustering. This approach, however, assumes that the number of sound sources is known, and it was only explored for two speakers scenario. Other approaches using ICA have also been proposed, such as [6], where they claim reasonable VAD accuracy for simulated data. In [17], simultaneous VAD is performed for robot auditory system purposes as a preprocessing step for ASR. The authors showed that by applying sound source localization through

delay-and-sum far-field beamforming, they can separate overlapping speech signals to the point that two sources are well detected. Their results are evaluated in terms of ASR accuracy, and are obtained in a fixed environment. Gurban and Thiran [18] proposed a supervised multimodal approach for VAD. They use the energy of the speech signal as an audio feature and Optical Flow of the mouth region as visual feature. A Gaussian Mixture Model is trained from labeled data, and the Maximum Likelihood (ML) is applied to classify each data frame. Their approach, however, does not deal with simultaneous speakers and it was tested in a controlled environment. The authors also mention that scenes having background movements may degrade the algorithm’s performance, since no face detection/tracking algorithm is used. In [10], background subtraction using stereo cameras is combined with expectation maximization (EM) using microphone array; a Bayesian network trained with a particle filter is used to estimate the direction of the sound sources. Naqvi et al. [19] performed multimodal analysis of multiple speakers to tackle the similar problem of blind source separation. They used multiple cameras with a three-dimensional (3D) face tracker to provide a priori information to a leastsquares-based beamforming algorithm with a circular microphone array to isolate different sources. They further enhanced the separated audio sources by applying a binary time-frequency masking as a post-filtering process in the cepstral domain. Results are not shown in terms of VAD or SSL accuracy, because they attempted speech separation only assuming the speakers are always active. A different approach for joint VAD and SSL has been proposed in [16], where the steered response power with phase transform (SRP-PHAT) method is used to iteratively detect the speakers’ locations. For distinguishing the sources, the SRPPHAT’s gradient is used to separate different speakers’ regions in its power map. VAD is then automatically performed when the iterative algorithm finds the last speaker, which happens when the maximum’s position corresponds to the SRP-PHAT’s null point. This approach, however, is based on the assumption of diffusive noise and may cause false positives for directional noise such as door slams. Other works also leverage the proven robust SRP-PHAT algorithm [20], for instance by integrating it with clustering techniques in attempt to separate the maxima that belong to different speakers. Do et al. [21] use Agglomerative Clustering (AC) with the Stochastic Region Contraction optimization method; they later propose Region Zeroing and Gaussian Mixture Models [8], achieving up to a 80% correct classification rate for two speakers using a large-aperture microphone array; Cai et al. also explore AC, but with spectral sub-band SRPPHATs [22]. Alternatively, in [23] (where the SRP-PHAT is named Global Coherence Field - GCF), multiple speakers are located by a de-emphasis approach of the GCF; the dominant speaker is localized and then the GCF map is modified by compensating for the effects from the first speaker, then the position of the second speaker is detected. While this method can considerably increase the localization rate of the second speaker, its computational cost is very high, and the localization accuracy of the second speaker depends on that


of the first one. From the above mentioned work, we may observe that many of the existing multi-source SSL/VAD algorithms extended the SRP-PHAT in a way that speakers other than the dominant one are localized and identified. In our work, we also use the SRPPHAT for beamforming due to its robustness. Therefore, we briefly describe SSL using the SRP-PHAT algorithm in the next section. B. Sound Source Localization using the SRP-PHAT Algorithm For an array of Nmic microphones, the signal xm (t) captured at the mth microphone can be described using a simplified acoustic model [20], xm (t) = αm s(t − tqm ) + um (t),


where s(t) is the source signal, um (t) represents the combination of reverberation, interferences, and background noise, and αm and tqm denote the propagation attenuation and delay of the signal s(t) from a source location q to the mth microphone, respectively. Equivalently, Eq. (1) can be represented in the frequency domain as q

Xm (ω) = αm S(ω)e−jωτm + Um (ω),


2πf Fs

is the normalized frequency in radians where ω = corresponding to the frequency f (in Hz) of the continuoustime signal xm (t) that is sampled with the sampling rate of q = tqm Fs . We assume that the signal is sampled Fs Hz, and τm above the Nyquist rate. Therefore, given a vector of Fourier transforms of observed signals, [X1 (ω), X2 (ω), · · · , XNmic (ω)] for the normalized frequency ω, SSL may be seen as the problem of finding a source location q that satisfies some optimality criteria such as Maximum-Likelihood [24], [25] or maximum power of the filter-and-sum beamformer like the SRP-PHAT method [20]. As previously mentioned, the SRP-PHAT is currently one of the state-of-the-art algorithms for SSL due to its robustness against noise and reverberation. It finds a source location by comparing, for a frame of data, the output energies of PHATweighted filter-and-sum beamformers of different potential sound source locations in a search region. The filter-and-sum beamformer steered at location q may be represented in the frequency domain as

Y (ω, q) =

Nmic X


Wm (ω)Xm (ω)e−jωτm ,

Once P (q) has been computed for all candidate positions using Eq. (4), we can estimate the sound source location as ˆ = argmax P (q), q



where Q denotes a set of points in space that represent all candidate locations. This maximization approach robustly finds the dominant sound source given a relatively short time window (e.g. 50 ms). While the SRP-PHAT may be implemented in different ways [20], e.g., Eq. (4) has been shown to be suitable for GPU implementation [26]. In general, one drawback of using the SRP-PHAT’s global maxima to localize potential speakers is that the precision of such approaches tend to drop as the number of simultaneous speech sources increases. This is a rather common problem with beamforming techniques, since one speaker’s voice acts as noise to the others’ [27]. Moreover, to assume that a set of largest P (q) values represents the speakers’ positions is somewhat inaccurate, given that the SRP-PHAT’s power map contains many local maxima due to noise and reverberation. Therefore, for evaluating which P (q) values truly characterizes a speaker, proper VAD technique must be employed using some kind of a priori knowledge (or assumptions) about the acoustic scenario, e.g., known number of speakers [23], noise is diffuse [16], or speakers’ P (q) are above a minimum noise level [8]. III. T HE P ROPOSED A PPROACH Our work approaches multiple speaker VAD and SSL as joint problems. We employ a linear microphone array and a conventional RGB camera. Our setup expects the users to be facing the capture sensors and to be within the Field of View (FOV) of the camera, as it is the case of most HCI systems [1]. A schematic representation of the required setup is provided in Fig. 1(a), and our prototype room based on such setup is given in Figure 1(b). It is important to mention that for the entire work, we adopt a Cartesian coordinate system, where the width dimension (x) is parallel to the array, and positive to the right; the height dimension (y) is positive upwards; and the depth dimension (z) is positive towards the camera’s FOV. The same convention is used for the image’s x and y coordinates. Figure 2 summarizes the pipeline for the proposed VAD and SSL approach, which we describe next.



where Wm (ω) denotes a generic weighting function applied to the mth microphone’s signal. When this weighting function −1 is chosen to be the PHAT, that is, Wm (ω) = |Xm (ω)| we may define the SRP-PHAT of a point q by computing the energy of the PHAT-weighted filter-and-sum of that point. Using Parseval’s theorem and ignoring the constant scaling 1 factor 2π , this energy may be described as Z P (q) = 0

|Y (ω, q)|2 dω =

Z 0

N 2 mic X q Xm (ω) −jωτm e dω. |Xm (ω)| m=1 (4)

RGB Camera

Microphone Array

Camera’s Field of View



Fig. 1. (a) Schematic representation of the proposed approach. (b) Our prototype system.


Face Detection and Tracking


Mouth Region Identification 3D Users’ Position Estimation

Individual SRPPHAT SSL



Optical Flow Visual Feature Extraction



HMM Observable Extraction


SVM-based Lips Movement Analysis with Video Modality Weighting

HMM Competition Scheme for VAD


Final SSL of Active Speakers



Fig. 2. Schematic representation of our algorithm’s flow. The indexes at the upper-right corner of the boxes represent the order in which the individual steps are processed. The blue arrows represent the places where information between audio and video are exchanged, that is, where the multimodal mid-fusion happens.

A. Visual Analysis We can summarize the processing related to the video modality in two steps: extracting proper visual features from different potential speakers, and then evaluate them using some classification technique. The following subsections detail these two steps. 1) Visual Feature Extraction: In order to extract a reliable visual feature for VVAD, we exploit the fact that anyone who has intention to speak moves the lips. As previously mentioned in Section II-A this has been used in other works [28]– [31]: computing the optical flow of a region enveloping the speakers’ mouths. In this work, we chose the Lucas-Kanade (LK) [32] algorithm for the task, due to its good compromise between computational cost and accuracy. The first step in this process is to use a face tracker/detector algorithm to identify the potential speakers in the captured image. We opted to use the face tracker in [33] given its low computation complexity and robustness to light changes and head rotation. After running the tracker for the current frame t, K faces in the scene are detected/tracked, and we may then find the bounding rectangle of each speaker’s mouth using the following anthropometric relations [34]: xtllips

= xmid + (−0.4r, 0.25r)


xbr lips

= xmid + (0.3r, 0.65r)


where xtllips and xbr lips respectively represent the coordinates of the top-left and bottom-right corners of the lips’ bounding rectangle returned by the face tracker; xmid is the 2D location of the face’s center, and r is the radius of the face (distance from xmid to any corner of the face’s bounding box). All these values are expressed in terms of the image coordinates. After the mouth region has been defined for each speaker, we then populate that region with punctual LK features that will be tracked between adjacent image frames using the algorithm described in [35]. More precisely, for every new image frame (at time t) we distribute NLK features inside each of the detected mouth regions of frame t − 1, in a regular grid manner, and track them to their corresponding new position at t. An illustration of this process is shown in Figure 3. Additionally, it is important to notice that for defining the points to track, a feature selection algorithm such as in [36] could be used instead. However, for our case,

it provides no extra tracking accuracy while increasing the overall computational cost of the proposed approach.

Fig. 3. Example of LK features distributed as a regular grid inside the mouths’ bounding rectangle.

For the feature extraction process, we leverage the fact that the computed optical flow vectors tend to show large magnitudes during speech, as opposed to silence situations. However, we also observe that since we analyze a region larger than the actual lip area (which is necessary for not losing track of the mouths during head translations and rotations), not all optical flow vectors have large magnitudes during speech. For this reason, we not only extract a measure of energy but also the standard deviation of the magnitudes as our visual features. Denoting xi (t) as the position of the ith LK feature, with relation to its face center, and frame at time t (that has been tracked from t − 1), we may define the magnitude of its resulting optical flow vector as Vi (t) = ||xi (t) − xi (t − 1)||, where || · || denotes the Euclidean norm. Therefore, the extracted visual features from the optical flow process of a given speaker are defined as N

µV (t)


σV (t)


LK Vi (t) 1 X , NLK i=1 r v u 2 NLK  u X 1 Vi (t) t − µV (t) , NLK − 1 i=1 r




where the division by r is used to normalize the features with respect to the image dimensions and the distance of the users from the camera. We may also note the users’ lateral velocity is automatically compensated by computing xi (t) with respect to the face center. Therefore, µV (t) and σV (t) respectively represent the mean and standard deviation of Vi (t), and are expected to be higher during speech than during silence. It is important to notice that, although we describe these measures without the k index for simplicity, they are computed K times, once for each speaker. Using these features for describing lip movements has two main advantages. They well represent the movements of the lips even if the mouth region is not precisely estimated, which is the case of the selected anthropometry-based approach, and they do not require prior knowledge about the shape of the mouth, such as the contour of the lips. However, despite such advantages, these features are not robust against small pauses during speech, since they are computed between consecutive frames only, and in HCI applications (such as ASR) it is often desired that speech hiatuses are detected as part of the spoken sentences instead of as silence moments [37]. Therefore, we analyze µV (t) and σV (t) over a longer time window of T frames. We propose four new features, extracted from Eqs. (8) and (9): f1 (t)


f2 (t)


T −1 X

1 µV (t − i), T i=0 v u −1 u 1 TX t (µV (t − i) − f1 (t))2 , T − 1 i=0



T −1

f3 (t)


f4 (t)


1 X σV (t − i), T i=0 v u −1 u 1 TX t (σV (t − i) − f3 (t))2 . T − 1 i=0



This imposes temporal coherence to the visual features to a point that weak speech does not become a problem for a further classifier. By contrast, this approach may also introduce a detection lag between speech-to-silence and silence-to-speech transitions. 2) Video-related Probability Estimation using SVM: The next step used for extracting a probability measure from our final visual features is to use some sort of supervised classifier. In this work, we chose the SVM algorithm implemented in [38] for it provides the known robustness of SVM techniques, allowing the use of non-linear kernels for better class separation (we used Radial Based Functions - RBFs) [39], and also providing a posterior probability estimation (instead of binary labeling) using the approach described in [40]. For training the SVM model, we perform a grid-search at the unknown parameters running successive turns of 5-fold crossvalidation, since employing this method is known for avoiding overfitting problems [38]. The training data used during this procedure are extracted from our labeled multimodal database (described in Section IV). Finally, once the best set of parameters are found, the SVM model Φ is trained (through the

algorithm in [41]), and a posterior speech probability υ for the video modality is extracted as υ = P (speech|fvid ; Φ),


where fvid = [f1 (t), f2 (t), f3 (t), f4 (t)] is a vector composed of the previously described visual features, and intuitively P (silence|fvid ; Φ) = 1 − υ. 3) Video Modality Weighting: At this point, it is important to notice that υ could be directly used for the final decision of a video-only VAD approach. However, its accuracy is highly dependent on the distance of the speakers from the camera: as the user move far from the camera, the mouth region appears smaller in image coordinates, and the optical flow tends to become noisier. Also, participants moving at high lateral velocities may also corrupt the extracted visual features. Despite the implicit compensation for lateral movements when computing xi (t), abrupt translations may blur the faces, also corrupting the optical flow estimate. For this reason, we propose a weighting factor wυ for Eq. (14), that is monotonically decreasing with respect to both distance of the user from the camera and his/her lateral speed: 0 wυ = exp{−zvid − vx0 },


0 max where zvid = zvid /zvid and vx0 = vx /vxmax are respectively the normalized depth and lateral velocity of the user (meamin max sured in world coordinates), and zvid ≤ zvid ≤ zvid , and max th 0 ≤ vx ≤ vx . Furthermore the k participant’s zvid is the depth component of the estimated 3D video-based position qvid k (such estimation process is described in Section III-B). In other words, the VVAD is expected to have maximum min effect for the multimodal fusion when zvid = zvid and vx = 0, and exponentially lose its effect as the users’ depths and max velocities reach zvid and vxmax respectively, having no effect max at all when zvid > zvid and vx > vxmax . An appropriate min value for zvid is chosen to be the minimum distance two users may comfortably participate in a camera-equipped HCI max system while not leaving its FOV. As for zvid , we chose the value at which the Lucas-Kanade optical flow algorithm is not able to track the movements of the lips. Finally, for finding a reasonable value for vxmax we extracted the maximum velocity a user has reached in our multimodal recordings, finding min vxmax = 0.25m/s. We also experimentally found zvid = 0.5m max and zvid = 1.4m for VGA (640 × 480) video sequences.

B. Audio Analysis Computing the SRP-PHAT for identifying competing sound sources is known to be a hard task. As mentioned in Section II-A, many works have approached this using clustering techniques [8], [21], [22] or some iterative isolation criteria [16], [23]. These approaches, however, may be rather complex and also fail under high noise conditions. We therefore propose a simple and effective process for isolating different regions around potential speakers in the SRP-PHAT’s global search space Q, which is shown to be robust for the simultaneous speakers scenario, even under noisy and reverberant conditions.


For the k th participant, we define an 1D ROI Qk as a subset of the global search region Q. Each ROI is treated as individual space regions having their own, bounded, coordinates system, that is centered around each participant’s 3D video-based location qvid k , as illustrated in Figure 4. This way, given a fixed length ` for Qk (in meters), all ROIs may form equally sized horizontal line segments (parallel to the microphone array) centered at each tracked face. Finally, Eqs. (4) and (5) can be calculated for the k th speaker using Qk instead of Q, which allows us to later separately analyze the SRP-PHAT’s behavior of each user through our HMM approach. It is important to notice that an 1D ROI is chosen (instead of 2D or 3D) as a consequence of our linear array configuration, since microphone arrays best discriminate locations parallel to the same direction most microphones are distributed along [42]. Therefore, in the case of our linear array (previously depicted in Figure 1), the SRP-PHAT is more accurate along the horizontal dimension (x), making an 1D ROI enough for our VAD approach, at low computational cost in the search process of Eq. (5). As for the choice of a linear configuration, we base on the fact that in a multimodal multi-user HCI applications, the users tend to stand side-by-side in order to remain within the camera’s FOV, emphasizing the need for better localization along x. Before running the SRP-PHAT, qvid must be computed k so the ROIs may be properly centered around each person’s 3D position. This is done by estimating qvid from k the 2D face-tracking results (the centering process must be repeated every frame), using an inverse projective mapping. Assuming a pinhole camera model and that the camera is aligned with the microphone array, the relation between image coordinates xmid = (xpix , ypix ) and world coordinates qvid = (xvid , yvid , zvid ) is given by yvid xvid , ypix = flen , (16) xpix = flen zvid zvid where flen is the focal length of the camera. Therefore, given the mean radius r1m (in pixels) of a face placed at one meter from the camera (which can be estimated experimentally or based on the projection of the anthropometric average face radius [34]), the zvid component (depth) of a detected face can be estimated through zvid = r/r1m ,


where r is the radius (in pixels) of the tracked face. This way, given zvid and the image-related face central position xmid , it is possible to obtain the width and height world components of the k th speaker by isolating xvid and yvid in Eq. (16). This allows the horizontal search region Qk to be centered at the k th speaker’s video-based world position qvid k , so that his/her audio-based location may be computed using the SRP-PHAT as qaud (18) k = argmax P (q). q∈Qk

At this point, it is important to notice that neither qaud k nor qvid k are the final location that is computed by our SSL approach. These estimates are used by the HMMs for spatiotemporal coherence analysis to perform both the final MVAD

and SSL. These topics are covered in the next subsections.

C. Multimodal mid-fusion using HMMs Given the results of the video and audio analyses of each speaker, υ and P (qaud k ), respectively, we develop a fusion method by using an HMM competition scheme, which is inspired in [43], [44]. We extend such works to the multiple speaker scenario, also weighing the importance of υ by wυ . In summary, two HMMs that model the expected behavior of the SRP-PHAT peak for the multiple speaker scenario are defined. One HMM describes speech situations, and the other, silence situations. By extracting proper observations from the SRP-PHAT, it is possible to use a competition scheme between both models in order to evaluate the SRP-PHAT’s spatiotemporal behavior for different speakers; separate scores for the same set of observations may be computed for each HMM through approaches such as the Viterbi algorithm [45], and then compared to form a final MVAD decision. Therefore, Section III-C1 explains the general idea of our competition approach; in III-C2 and III-C3 the speech and silence HMMs are described, respectively; Section III-C4 presents our MVAD approach, and III-C5 the SSL one; finally, in III-C6 we explain how the parameter estimation of the HMMs is performed. 1) The Proposed HMMs: HMMs can be used to model dynamic systems that may change their states in time. An HMM with discreteobservables is characterized by λ = (A, B, ρ), where A = aij for 1 ≤ i, j ≤ N is the transition  matrix that contains the probabilities of state changes, B = bn (O) for 1 ≤ n ≤ N describes  the observation probability for each state, and ρ = ρi for 1 ≤ i ≤ N contains the initial probabilities of each state. Clearly, the choice of the parameters is crucial to characterize a given HMM. In [43], competing HMMs were used for single-speaker VAD by exploring the expected spatio-temporal location of the sound source when the speaker is active. In this work we adopt a similar approach, but instead we build 2K competing HMMs (two for each detected face), also including the video-based VAD cue υ. More precisely, each candidate sound source location in Qk is a state of the HMMs, so that the number N of states size of the search We denote  depends on kthe Sk = S1k , S2k , ..., SN such N states for the k user, with N given by   ` + 1, (19) N= spa where spa is the real-world spacing between neighboring points in Q, and should be chosen (along with `) in a way that N is odd, allowing Sk to have a middle state. In our approach, we determine an observable that can carry information about the estimated speaker’s position as well as some sort of confidence measure of that estimate. That is, recalling the HMMs’ states are the candidate positions of the SRP-PHAT, the observation extracted for user k is a twodimensional vector Ok = (O1k , O2k ) computed based on P (q).


Fig. 4. Example of ROI-based SRP-PHAT search being performed for each user in the scene. The 3D model is rendered from the scene’s information: the cylinder represents the camera; the cones represent the microphones; the gray planes form the global search region Q; the long parallelepipeds are the 1D ROIs Qk ; and the red spheres are the audio-related locations estimated through Eq. (18).

It is given by O1k O2k

= qaud k , P (qaud k ) . = min P (q)

(20) (21)


The rationale of this approach is that, in speech situations, O1k should provide the correct location of the speaker, and O2k tends to be a large value (since the maximum is expected to be considerably larger than the minimum). On the other hand, in silence situations, all values of P (q) tend to be similar, and qaud should represent the location of the nonexistent sound k source. Additionally, in this later case, O2k will be smaller, since the maximum and minimum tend to be similar. In theory, the lower bound for O2k is 1, and the upper bound max O2 is ∞. We have observed in different experiments (with different speakers and varying background noise) that O2 gets really close to 1 during non-speech, and reaches a maximum value during speech. Therefore, we experimentally find the upper bound O2max , and the values of O2 are quantized into L possible values within the range [1 O2max ] to obtain an HMM with discrete range of observables. Values of O2k larger than O2max are quantized into O2max , and we choose L = 8 (higher values show no extra representativeness for O2k ). Variable O1k represents the position at which the SRP-PHAT peak is located in Qk , and is therefore discretized into N values. The next step for defining the speech and silence HMMs is then to define the probabilities of A, B and ρ in a way that, during true speech situations, the described observables and states behave as modeled by the speech HMM, and during silence situations, as modeled by the silence HMM. Next section details this matter. 2) The Speech HMM: For determining the parameters λ = (A, B, ρ) of an HMM, a widely used estimation approach is the Baum-Welch algorithm [45]. For our HMM, however, such an approach is impractical. The used models present a relatively high number of states (N ) and observables (M =

N L), which would require a large amount of training samples (comprising several situations such as speakers in different positions, alternation of speech and silence, presence/absence of background noise, etc.). Instead, we propose parametric probability density functions (PDF) for the HMM matrices based on the expected behavior of users in speech capture or HCI scenarios. We also highlight that the normalization process of the hereafter described PDFs is omitted for better readability, although it is important to notice that ρ and the rows of matrices A and B must sum up to unity. Recalling that the observation Ok is a two-element vector, we may define the distribution of the observation in the nth state Snk using the following joint PDF (here we omit the superscript k for readability and because the speech HMM λsp = (Asp , B sp , ρsp ) is the same for any user): sp sp sp bsp n (O) = bn (O1 , O2 ) = bn (O1 |O2 )b (O2 ),


where the superscript sp stands for “speech”, bsp (O2 ) is the distribution of O2 during speech situations (which does not depend on the state Sn ), and bsp n (O1 |O2 ) is the conditional probability of O1 given O2 , which is strongly  affected by n. The observation matrix is defined as B sp = bsp (O) . n Since sharp peaks tend to occur in the SRP-PHAT during speech situations, O2 is expected to be large. We exploit this for modeling the speech HMM. Therefore, bsp (O2 ) should be a monotonically increasing function, and the following exponential function was chosen:   O2 bsp (O2 ) = exp c1 max , (23) O2 where c1 is an estimated auxiliary constant (see Section III-C6) that controls the decay of bsp (O2 ). Function bsp n (O1 |O2 ) describes the conditional density of O1 given O2 . Since each state n relates to a position in the search space Q, the value of O1 (which is the position of the largest SRP-PHAT value) should be close to n. Furthermore, if the confidence O2 is large, the probabilities should decay


abruptly around this peak; if O2 is small, though, the decay around the peak should be smoother, allowing other O1 to be encountered with higher probabilities as well (even if the SRP-PHAT matches the actual position of the user with a low O2 , it might be a coincidence). Inspired by [43], we used an exponential function to model this behavior: bsp n (O1 |O2 ) = exp {−g(O2 )|O1 − n|} ,


where g(O2 ) controls the speed at which bsp n (O1 |O2 ) decays due to changes in O2 . This means that, as the confidence O2 gets larger, the decay around n should be faster, reducing the chances of neighboring O1 to happen. Therefore, g(O2 ) is chosen as g(O2 ) = exp {−c2 O2 − c3 } , (25) where c2 and c3 are also auxiliary constants that are estimated using the approach described in Section III-C6. In order to find an  adequate configuration for the state transition matrix Asp = asp , it is first important to observe ij the following. Given that the ROIs are always centered at each speaker’s position, we must expect O1 to move toward the central state during speech situations, even if the SRP-PHAT peaks at neighboring states with high confidence. Therefore, is maximum (for Asp is configured in such a way that asp N 2 j 1 ≤ j ≤ N ), and the probabilities decay as j distances from N/2. This way, asp ij is defined as   |N/2 − j| + 1 sp aij = exp , (26) 2σ 2 where σ is constant used for controlling the decay. As we may notice, asp ij depends only upon j, which is the state being transited to. In other words, regardless of which state the speaker is located at, the one with the highest transition probability is the middle one. For the sake of illustration, the transition matrix Asp , the observation matrix Bsp and the k probability density function bsp 8 (O1 , O2 ) related to state S8 (N = 17) are depicted in Fig. 5. 3) The Silence HMM: The silence-related HMM is characterized by λsi = (Asi , B si , ρsi ). As it was already pointed out, during silence periods the response P (q) of the SRP-PHAT at each position (state) should be similar, so that observable O2k is expected to be close to the smallest possible value, which is 1. Furthermore, qaud k will correspond to random positions inside Qk , owing to background noise and reverberation. Therefore, similarly to Eq. (22), the joint probability function of the observables, for state Snk , can be written as1 bsin (O) = bsin (O1 , O2 ) = bsin (O1 |O2 )bsi (O2 ),


where function bsi (O2 ) was obtained similarly to its counterpart in speech situations, except that higher probabilities should occur for smaller values of O2 :   O2max − O2 + 1 si b (O2 ) = exp c1 , (28) O2max where c1 has the same value and role as in Eq. (23). 1 We again omit the superscript equal to all users.


for the sake of readability. The PDFs are

For the conditional probability psin (O1 |O2 ), there are two important things to note. First, such distribution should not depend on the state Snk , since the position of the peak is related to noise and not to an actual sound source at the discrete position n. Second, all observables O1 should be equally probable, for the same reason. Hence, a uniform conditional probability function is chosen, i.e., bsik (O1 |O2 ) = 1/N . While in speech situations the peak of the SRP-PHAT is expected to be close in temporally adjacent observations, the same is not true for silence periods. Since all responses are usually similar, background noise plays a decisive role when retrieving the highest peak, which may be far from the one detected in the previous observation. In fact, the proposed state  transition matrix Asi = asiij for the silence-related HMM considers all transitions equally probable, i.e., asiij = 1/N . As for the initial distribution ρ for both speech and silence HMMs, we assumed that all states (i.e., positions) are initially equally probable, i.e., ρi = 1/N . 4) Multimodal VAD using the HMMs: Given the speech HMM λsp , the silence HMM λsi , and a sequence of observables Otk = {Ok (t − T ), Ok (t − T + 1), ..., Ok (t)} for speaker k within a time window of size T , we can compute how well both HMMs describe Otk (this is the same time window used in Section III-A1). If Otk was generated during a speech situation, then it should present a higher adherence to λsp than λsi , and the opposite for silence situations. More precisely, this can be done by computing likelihoods P (Otk ; λsp ) and P (Otk ; λsi ) using the forward-backward procedure [45]. This way, an AVAD-only decision could be performed such that the k th user at frame t is considered to be active if P (Otk ; λsp ) > P (Otk ; λsi ). However, as the number of simultaneous speakers increases, the height of the SRP-PHAT peaks at the actual speaker positions is lowered, since one person’s voice acts as noise to the others’. As a consequence, O2k might not always be as large as expected, so that false negatives may occur when there is speech. For this reason, we propose a mid-fusion technique that attempts to boost the values of O2 based on visual cues, particularly in simultaneous speech situations, by using the confidence υ of the VVAD algorithm. More precisely, we ¯ k = (Ok , O ¯ k ) for the HMM propose an enhanced observable O 1 2 competition scheme, with   ¯ 2k = O2k 1 + υk wυk , O (29) c4 where υk and wυk are computed for the k th user through Eqs. (14) and (15), respectively, and c4 > 1 controls the contribution of the video modality to the multimodal fusion. The value of c4 must be carefully chosen so that O2k is effectively enhanced during simultaneous speech situations, but not overly amplified to avoid false VAD. Our procedure for setting c4 is presented in Section III-C6. Finally, according to our final MVAD approach, a given user ¯tk ; λsp ) > P (O ¯tk ; λsi ). As for is considered to be active if P (O the time window T , it must be properly chosen. If it receives a small value, a speech hiatus between consecutive words may be detected as silence, which is usually not desirable for speech recognition. On the other hand, larger values of T


0.03 0.064





0.058 0.056


0.02 Probability












0 20

0 20

0.05 20 15

20 15

10 0

Next State(j)

(a) Transition matrix Asp defined in Eq. (26)



5 0


8 10




5 Actua State(i)




50 0


0 Observation

(b) Observation matrix Bsp defined in Eq. (22)




sp b8 (O),

2 0



a slice of Bsp for n = 8

Fig. 5. Plots of the speech HMM matrices for N = 17 and L = 8.

provide better temporal consistency, but also lead to delays when detecting speech-silence or silence-speech changes. In this work we chose T so that it corresponds to a window approximately 1 second long, since it showed to be efficient to deal with speech hiatuses and not present a long delay when the location of the speaker changes. 5) Multimodal SSL: As previously mentioned in Section II, to perform multiple speaker VAD, one implicitly needs to perform localization (either in 3D or 2D in image coordinates), so that active speakers may be differentiated from inactive ones. For this reason, despite the main difficulty of a competing sources scenario being the VAD part itself, we also implement SSL as a part of our algorithm. While we do not consider this to be the main contribution of our work, we show that we can easily benefit from our spatio-temporal-based HMM formulation to locate the active speakers. As a requirement for our MVAD approach, two location estimates are initially produced for each speaker, from the vid audio and video modalities, qaud k and qk , respectively. Either one of them could be used as a final SSL decision for the speakers. However, both estimates present inaccuracies due to practical issues. The audio location qaud k is highly corrupted by noise and reverberation, especially during simultaneous speech situations. The video location qvid k is affected by the depth estimation in Eq. (17), since the face radii r present small variations across time (and for different users). Therefore, we propose a more robust approach by reusing the speech HMM. Recalling that our HMMs are based on the spatial locations Qk of the SRP-PHAT, it is possible to use a decoding algorithm that finds the state sequence with length T that best corresponds (according to some optimality criterion) to the sequence of observables Otk evaluated using a given model λ. In our case, each state of the T decoded states would correspond to the speaker location at each time frame, and such a decoding process could be performed using the Viterbi algorithm [45]. Therefore, by decoding the active speakers’ Otk (same observables used for VAD) against the speech model λsp , the last state in the computed sequence represents the most recent location of the k th speaker. This approach

introduces both spatial and temporal coherence to the SRPPHAT’s location estimates, due to the time-window analysis of the Viterbi algorithm and to the characteristics of Asp and Bsp . However, we must recall that our HMM approach is applied only to the width dimension of the search region, meaning only the horizontal position (x) of each active speaker is brought from our MVAD approach. For this reason, for SSL purposes only, we separately decode λsp using observables obtained from other two 1D ROIs, one spanning along the depth dimension (z), and the other one along the height (y). They are also centered at qvid k , such that the three ROIs are orthogonal to each other, allowing each 1D search region to retrieve one component of the speakers’ 3D location. Therefore, denoting k k xkHMM , yHMM and zHMM as the locations found by decoding the above 1D HMMs, we define the final 3D position of the k th speaker as  k k q ˆk = xkHMM , yHMM , zHMM . (30) Finally, it is important to note that this SSL approach permits a low computational cost for our algorithm, since only 3N states are evaluated with the Viterbi method, oppositely to N 3 as would happen if a 3D cuboid-like ROI was used. 6) Parameter Estimation: As previously mentioned, due to the fact that matrices Asp and Bsp of the speech HMM are composed by a large set of states and observables, we opted to use parametric models. However, the parameters of these PDFs must be carefully selected for them to work properly. These are the case of c1 and O2max in Eq. (23), c2 and c3 in Eq. (25), σ in Eq. (26), and c4 in Eq. (29). Based on manually labeled data from our multimodal sequences, we are able to extract the true observable occurrence count and transition count for each state of the speech HMM, thus allowing us to compute the histograms A0sp and 0 Bsp , corresponding to the transition and observation matrices, respectively. Therefore, we may define cost functions of A0sp , 0 Bsp , Asp and Bsp to estimate these constants by minimization. However, there are two main practical difficulties in such minimization problem. First, the equations that describe Asp


and Bsp are not linear, and no direct solution exist. Secondly, the computed histograms may present outliers due to errors in the manual labeling process, compromising the estimation process through overfitting, specially in the case where the training dataset has limited size. To ensure the first problem is avoided, a trust region minimization approach [46] is applied, which is a robust technique for solving non-linear illconditioned minimization problems [47]. For the second issue, we assign an M-estimator as our residue function, which is a robust statistics method for reducing the effect of outliers during parameter estimation problems [48]. Among the many existing possible M-estimators, we have chosen the Huber function [49], which has been a popular choice since then [50]. Finally, the last parameter to be estimated is c4 for the midfusion approach in Eq. (29). For this, we randomly select some multimodal recordings in our database, and perform a linear search for possible values for c4 within the range of (1, 10] (using a step of 0.1), and select c4 as the value that maximizes the total MVAD accuracy for a 5-fold cross-validation of those recordings. IV. E XPERIMENTAL E VALUATION All our experiments were conducted in our prototype room, which is a computer lab with the dimension of 4.5 m × 4 m × 3 m and the reverberation time of 0.6 seconds. Our data acquisition hardware is composed of a uniform linear array of eight DPA 4060 omnidirectional microphones, placed 8 cm apart from each other, and an RGB webcam positioned in the middle, as depicted in Fig. 1(b). The 1D ROIs Qk were set to have N = 17 discrete locations, spaced 2 cm apart from each other, so that ` = 34 cm. The audio signals were captured at Fs = 44, 100 Hz, and a buffer size of 4096 samples was used to compute the SRP-PHAT at each time frame t. Video capture was synchronized with audio, so that one image corresponds to one audio frame/buffer, implying in an approximate frame rate of 10 images per second. Consequently the time window T mentioned in Section III-C4 is chosen as T = 10. Our dataset consists of 24 multimodal sequences ranging from 40 to 60 seconds of duration each2 . Six of the sequences present a single speaker in the scene, and they are named One1 to One6; ten contain two speakers (Two1 to Two10), and eight present three speakers (Three1 to Three8). In all recordings, the users randomly chat in Portuguese, alternating between speech and silence moments, and for the Two and Three sequences, they intentionally overlap their voices at times. Sequences with two users consist of sections of individual speech (implying in individual silence of the other speakers), simultaneous speech, and simultaneous silence. All Two sequences are composed by such speechsilence elements, which appear in no particular order for different sequences. These elements have 7 seconds in average, ranging from 4 to 8 seconds. For the sequences with three users, a similar procedure is used. They contain moments of 2 In our dataset, we varied the camera model for each recording out of three available: a Logitech Quickcam Pro 5000, a Logitech Pro 9000 or the Kinect’s color stream. More details on the setup may be found in∼crjung/MVAD-data/mvadsimult.htm, where the multimodal sequences with ground truth data can be downloaded.

simultaneous (total) silence, simultaneous speech for all three users, simultaneous speech for two users (all possible pairs of speakers are applied), and individual speech for one user (also appearing in no particular order). The average time length is 7 seconds for each of these moments, ranging from 3 to 9 seconds. Furthermore, all sequences in the dataset (which are 40 to 60 seconds long) are composed of a succession of such described activities, and they also contain some sort of natural noise, such as people talking in the background, airconditioning working, door slams, and the fans from other computers. We used seven of these sequences, One6, Two6-Two10, and Three8 exclusively to train our model (SVM, HMMs and fusion-related parameters). Within the training dataset, a 5-fold cross-validation procedure was applied, and we selected the set of parameters that presented the best result considering all folds. With this fixed set of parameters, we used the remaining 17 sequences to validate the model, as shown in Tables I-IV. We also ensured that different people and camera types were used for training and testing our model, to obtain a flexible classifier that was not biased toward a specific situation. For measuring the VAD accuracy, we manually labeled each speaker at each frame as active or inactive, and ran three experiments for each sequence: testing the precision of the video modality alone, by using Eq. (14); the audio modality, by using P (Otk ; λsp ) (without the fusion); and the ¯tk ; λsp ). Table I shows the combined modalities, by using P (O results, from which we observe that the MVAD outperforms the unimodal classifiers in all experiments, suggesting that our fusion technique is indeed better than the methods using audio or video alone. It is also important to note that accuracy rates for the multimodal version were over 90% for all video sequences, indicating that the trained model generalizes well to the test set. TABLE I VAD ACCURACY FOR ALL RECORDED SEQUENCES , USING OUR PROPOSED ALGORITHMS .





One1 One2 One3 One4 One5 Two1 Two2 Two3 Two4 Two5 Three1 Three2 Three3 Three4 Three5

94.55% 91.12% 82.24% 92.52% 87.23% 93.67% 94.03% 90.40% 92.49% 87.47% 81.58% 82.51% 82.44% 83.96% 88.21%

89.72% 79.91% 81.62% 81.78% 85.83% 86.30% 91.10% 89.93% 87.47% 81.15% 84.07% 83.14% 89.51% 84.01% 87.74%

97.55% 96.74% 92.83% 96.42% 93.79% 96.72% 98.32% 96.00% 96.60% 93.04% 93.13% 92.28% 95.76% 92.26% 94.48%






Another key point to observe is that most existing approaches that use video information for VAD only work for a close capture range [13], [18], [30], [31], which makes the scenario unrealistic for applications involving several users, e.g., console-based gaming like Kinect+Xbox and videoconferencing. In our recorded sequences, however, speakers stand at distances between 0.9 m and 1.4 m from the camera, which is enough to accommodate up to three side-by-side participants realistically. As a consequence, such large distances from the camera may considerably degrade any video based technique. Our video weighting approach, however, is able to balance such effect, increasing the overall accuracy of the final MVAD. Table II shows the VAD results for our two sequences with the broadest capture range, with and without Eq. (15) – the video weighting function. We may observe that proper weighting of the video modality is required so it does not corrupt the final multimodal algorithm. This also suggests that existing videoonly/multimodal techniques (such as the ones just mentioned) would likely fail on our dataset. TABLE II VAD ACCURACY FOR TWO DISTANT CAPTURE SEQUENCES WITH AND WITHOUT THE VIDEO WEIGHTING APPROACH .

Three6 Three7







One1 One2 One3 One4 One5

97.55% 96.74% 92.83% 96.42% 93.79%

91.33% 91.64% 88.27% 89.26% 87.62%

92.26% 70.12% 75.79% 74.48% 72.35%

85.91% 84.98% 79.90% 81.58% 82.35%

Avg. (One)





Two1 Two2 Two3 Two4 Two5

96.89% 97.81% 95.93% 96.71% 94.11%

78.65% 79.85% 80.95% 81.71% 82.66%

75.82% 76.75% 72.94% 74.20% 75.36%

78.91% 79.33% 70.48% 75.55% 78.75%

Avg. (Two)





Three1 Three2 Three3 Three4 Three5

94.02% 92.39% 94.94% 92.10% 95.58%

75.29% 80.74% 82.48% 74.03% 79.74%

58.72% 78.42% 63.25% 76.09% 67.39%

63.42% 78.98% 74.13% 75.43% 69.14%

MVAD w/o weighting

MVAD w/ weighting

Avg. (Three)





77.06% 76.45%

93.23% 91.26%

Avg. (All)





We also compare our approach to other VAD algorithms, namely the AVAD works of [14] (named “Sohn” in the table), of [51] (called “EE”), and the VAD module of the G729B codec [52]. Since these algorithms were designed for single speaker scenarios, individual analysis of the One, Two and Three sequences are performed. Also, they work with only one microphone, so we chose a fixed microphone (the fourth) of our array for the tests. Table III presents comparative results and as expected, these methods present higher accuracy for the One sequences (but still not better than our approach). For the Two and Three recordings, we chose the left-most speaker as the reference one, that is, the obtained results are compared to the ground truth for that person. As also expected, the competitive approaches perform worse for the simultaneous speakers cases, since the speech of other users acts as noise for the speech of the person under consideration. This demonstrates the importance of a multiple speakers VAD algorithm that is able to properly isolate different users, as is the case of our method. For assessing our SSL approach, a different labeling process had to be performed, since it is rather complex to manually define the precise 3D position of each speaker in each frame. One could stipulate the locations of each user before the recordings, but movements would not be allowed, making the scenario unrealistic. For this reason, some sort of automatic labeling had to be employed. We ended up using an RGBD camera (namely, Microsoft’s Kinect sensor) for finding the actual 3D position of the speakers. While this approach may also present some imprecision, it proves to be accurate enough that our algorithms may be compared to. Another detail is that

not all our multimodal sequences have SSL ground truth, given the Kinect device was not present in all of the recordings3 . For this reason, we only assess the SSL accuracy for a portion of our sequences. To evaluate the SSL performance of our algorithm, we have computed the Euclidean and individual dimension’s distances between the found locations and the labeled locations as error measures (results are shown in Table IV). We repeated this process for the video and audio modalities alone as well as for our multimodal SSL approach. For the video modality, we have used Eq. (16). For the audio modality, Eq. (18) was applied, with the difference that a cuboid-like ROI was used instead of Qk (such 3D search was used only for performing the localization comparisons). For our multimodal HMM approach we used q ˆk , which is estimated by Eq. (30). It can be observed that our multimodal SSL approach presents an accuracy gain over the audio and video modalities alone, for both the individual dimensions and for the 3D Euclidean distance. An average Euclidean error of 10.9 cm is present when estimating the speakers’ 3D position using the multimodal approach, which is about twice the average length of the human mouth, meaning at least no speaker is confused as being another one. Finally, despite having an overall high VAD accuracy, our approach is prone to errors in certain situations. The audio component of our MVAD can be affected by words containing narrowband sounds (such as vowels), since the SRP-PHAT is designed to work best for broadband signals as in fricative 3 For the sequences recorded using Kinect, we also make available the acquired depth data.



















Two1 Two2 Two3 Two4 Two5 Two6 Three1 Three2 Three3

0.046 0.032 0.043 0.058 0.047 0.043 0.065 0.061 0.072

0.117 0.119 0.128 0.144 0.111 0.110 0.145 0.130 0.121

0.108 0.073 0.112 0.134 0.108 0.118 0.152 0.129 0.126

0.148 0.137 0.142 0.203 0.138 0.158 0.201 0.189 0.204

0.059 0.043 0.046 0.054 0.059 0.049 0.055 0.050 0.059

0.066 0.052 0.060 0.058 0.060 0.058 0.049 0.047 0.050

0.154 0.094 0.104 0.127 0.105 0.147 0.153 0.112 0.099

0.183 0.096 0.136 0.152 0.128 0.178 0.164 0.131 0.122

0.046 0.038 0.042 0.049 0.045 0.044 0.050 0.041 0.057

0.046 0.043 0.050 0.038 0.054 0.040 0.037 0.035 0.038

0.107 0.070 0.097 0.104 0.095 0.104 0.105 0.101 0.082

0.128 0.095 0.100 0.102 0.115 0.116 0.110 0.117 0.102














consonants [23]. Also, speakers with little lip activity during speech may compromise the video-based part. In particular, a combination of these two situations occurs around frame 390 in sequence Three1, causing our MVAD approach to fail. Additionally, the temporal coherence imposed by the HMMs may also cause some delay when detecting speech-to-silence and silence-to-speech transitions. Nevertheless, our approach works reliably even for challenging situations, e.g., door slams while a speaker is moving, around frame 85 in sequence Two5. V. C ONCLUSIONS We have presented a multimodal VAD and SSL algorithm for simultaneous speaker scenarios. The proposed approach fuses the video and audio modalities through an HMM competition scheme. An SVM-based classifier is first used to extract a visual voice-activity score from the optical-flow algorithm run on the users’ mouths, and the SRP-PHAT algorithm is separately computed for each speaker, to extract the location estimate for each. A mid-fusion technique is proposed by combining the output of the video-based score with the audiobased feature, that is later evaluated by the HMMs to make a final VAD decision. The final positions of the active users are later estimated by reusing the HMMs outputs. Results show an average 95.06% accuracy for VAD in scenarios with up to three simultaneous speakers, and our SSL approach presents an average error of 10.9 cm when localizing such speakers (using a microphone array and a color camera only). Our approach also works for relatively long capture distances (1.4 m was the longest one tested) and using a compact microphone array (56 cm linear aperture), whereas most other systems use close capture scenes and/or large-aperture arrays. Additionally, our method presents a higher accuracy than the ones reported in Section II-A, which run on a controlled/simulated environment. R EFERENCES [1] A. Jaimes and N. Sebe, “Multimodal human-computer interaction: A survey,” Computer Vision and Image Understanding, vol. 108, no. 1-2, pp. 116–134, October 2007.

[2] Y. Ephraim, “Statistical-model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992. [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984. [4] M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications, ser. Digital signal processing. Springer, 2001. [5] S. Maraboina, D. Kolossa, P. Bora, and R. Orglmeister, “Multi-speaker voice activity detection using ica and beampattern analysis,” in Proceedings of the European Signal Processing Conference EUSIPCO, no. Eusipco, 2006, pp. 2–6. [6] A. Bertrand and M. Moonen, “Energy-based multi-speaker voice activity detection with an ad hoc microphone array,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, march 2010, pp. 85 –88. [7] J. Lorenzo-Trueba and N. Hamada, “Noise robust voice activity detection for multiple speakers,” in Intelligent Signal Processing and Communication Systems (ISPACS), 2010 International Symposium on, dec. 2010, pp. 1 –4. [8] H. Do and H. Silverman, “Srp-phat methods of locating simultaneous multiple talkers using a frame of microphone array data,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, march 2010, pp. 125 –128. [9] W. Zhang and B. Rao, “A two microphone-based approach for source localization of multiple speech sources,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 8, pp. 1913 –1928, nov. 2010. [10] H. Asoh, F. Asano, T. Yoshimura, K. Yamamoto, Y. Motomura, N. Ichimura, I. Hara, and J. Ogata, “An application of a particle filter to bayesian multiple sound source tracking with audio and video information fusion,” in in Proc. Int. Conf. on Information Fusion (IF), 2004, pp. 805–812. [11] T. Butko, A. Temko, C. Nadeu, and C. Canton-Ferrer, “Fusion of audio and video modalities for detection of acoustic events,” in INTERSPEECH, 2008, pp. 123–126. [12] I. Almajai and B. Milner, “Using audio-visual features for robust voice activity detection in clean and noisy speech,” in 16th European Signal Processing Conference (EUSIPCO 2008), 2008. [13] T. Petsatodis, A. Pnevmatikakis, and C. Boukis, “Voice activity detection using audio-visual information,” in Digital Signal Processing, 2009 16th International Conference on, july 2009, pp. 1 –5. [14] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Process. Lett, vol. 6, pp. 1–3, 1999. [15] S. Tanyer and H. Ozer, “Voice activity detection in nonstationary noise,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 4, pp. 478 –482, jul 2000. [16] M. Taghizadeh, P. Garner, H. Bourlard, H. Abutalebi, and A. Asaei, “An integrated framework for multi-channel multi-source localization and voice activity detection,” in Hands-free Speech Communication and












[27] [28]






[34] [35] [36]

[37] [38]

Microphone Arrays (HSCMA), 2011 Joint Workshop on, 30 2011-june 1 2011, pp. 92 –97. S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J.-M. Valin, K. Komatani, T. Ogata, and H. Okuno, “Real-time robot audition system that recognizes simultaneous speech in the real world,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, oct. 2006, pp. 5333 –5338. M. Gurban and J. Thiran, “Multimodal Speaker Localization in a Probabilistic Framework,” in 14th European Signal Processing Conference (EUSIPCO), Florence, Italy, September 2006, ser. Parallel Computing in Electrical Engineering. IEEE, 2006. S. Mohsen Naqvi, W. Wang, M. Khan, M. Barnard, and J. Chambers, “Multimodal (audiovisual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking,” Signal Processing, IET, vol. 6, no. 5, pp. 466 –477, july 2012. J. H. DiBiase, “A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays,” Ph.D. dissertation, BROWN UNIVERSITY, May 2000. H. Do and H. F. Silverman, “A method for locating multiple sources from a frame of a large-aperture microphone array data without tracking.” in ICASSP. IEEE, 2008, pp. 301–304. W. Cai, X. Zhao, and Z. Wu, “Localization of multiple speech sources based on sub-band steered response power,” in Electrical and Control Engineering (ICECE), 2010 International Conference on, june 2010, pp. 1246 –1249. A. Brutti, M. Omologo, and P. Svaizer, “Localization of multiple speakers based on a two step acoustic map analysis,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, 31 2008-april 4 2008, pp. 4349 –4352. C. Zhang, Z. Zhang, and D. Florencio, “Maximum likelihood sound source localization for multiple directional microphones,” Conf. Acoust., Speech, and Signal Process. 2007. IEEE International Conference on, vol. 1, pp. I–125 –I–128, apr. 2007. B. Lee and T. Kalker, “A Vectorized Method for Computationally Eficient SRP-PHAT Sound Source localization,” 12th International Workshop on Acoustic Echo and Noise Control, August - September 2010. V. Peruffo Minotto, C. Rosito Jung, L. Gonzaga da Silveira, and B. Lee, “GPU-based Approaches for Real-Time Sound Source Localization using the SRP-PHAT Algorithm,” International Journal of High Performance Computing Applications, 2012. J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing, ser. Springer Topics in Signal Processing Series. Springer, 2008. S. Takeuchi, T. Hashiba, S. Tamura, and S. Hayamizu, “Voice activity detection based on fusion of audio and visual information,” in International Conference on Auditory-Visual Speech Processing, 2009. P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, “Multimodal fusion for multimedia analysis: A survey,” Multimedia Systems, vol. 16, no. 6, pp. 345–379, 2010. A. Aubrey, Y. Hicks, and J. Chambers, “Visual voice activity detection with optical flow,” Image Processing, IET, vol. 4, no. 6, pp. 463 –472, december 2010. P. Tiawongsombat, M.-H. Jeong, J.-S. Yun, B.-J. You, and S.-R. Oh, “Robust visual speakingness detection using bi-level hmm,” Pattern Recogn., vol. 45, no. 2, pp. 783–793, Feb. 2012. B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision (darpa),” in Proceedings of the 1981 DARPA Image Understanding Workshop, April 1981, pp. 121–130. J. Bins, C. Jung, L. Dihl, and A. Said, “Feature-based face tracking for videoconferencing applications,” in Multimedia, 2009. ISM ’09. 11th IEEE International Symposium on, dec. 2009, pp. 227 –234. L. G. Farkas, Anthropometry of the Head and Face, L. G. Farkas, Ed. Raven Press, 1994, vol. 6, no. 4. J.-Y. Bouguet, “Pyramidal implementation of the lucas kanade feature tracker description of the algorithm,” 2000. J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR '94., 1994 IEEE Computer Society Conference on. IEEE, Jun. 1994, pp. 593–600. [Online]. Available: L. Rabiner and R. Schafer, Digital processing of speech signals, ser. Prentice-Hall signal processing series. Prentice-Hall, 1978. C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at %url cjlin/libsvm.

[39] R. E. Schapire and Y. Freund, “Boosting the margin: a new explanation for the effectiveness of voting methods,” The Annals of Statistics, vol. 26, pp. 322–330, 1998. [40] T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” J. Mach. Learn. Res., vol. 5, pp. 975–1005, Dec. 2004. [41] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for training support vector machines,” J. Mach. Learn. Res., vol. 6, pp. 1889–1918, Dec. 2005. [42] D. H. Johnson and D. E. Dudgeon, Array Signal Processing - Concepts and Techniques. Prentice, 1993. [43] D. A. Blauth, V. P. Minotto, C. R. Jung, B. Lee, and T. Kalker, “Voice activity detection and speaker localization using audiovisual cues,” Pattern Recognition Letters, vol. 33, no. 4, pp. 373 – 380, 2012. [44] V. Minotto, C. Lopes, J. Scharcanski, C. Jung, and B. Lee, “Audiovisual voice activity detection based on microphone arrays and color information,” Selected Topics in Signal Processing, IEEE Journal of, vol. 7, no. 1, pp. 147–156, 2013. [45] L. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257 –286, feb 1989. [46] R. Byrd, R. Schnabel, and G. Shultz, “Approximate solution of the trust region problem by minimization over two-dimensional subspaces,” Mathematical Programming, vol. 40, pp. 247–263, 1988. [47] A. Conn, N. Gould, and P. Toint, Trust Region Methods, ser. MPSSIAM Series on Optimization. Society for Industrial and Applied Mathematics, 2000. [Online]. Available: books?id=5kNC4fqssYQC [48] C. G. Small and J. Wang, Numerical Methods for Nonlinear Estimating Equations. Oxford University Press, 2003. [49] P. J. Huber, “Robust estimation of a location parameter,” Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, Mar. 1964. [50] P. Huber, Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, 2005. [Online]. Available: br/books?id=HQp2BKN-qWoC [51] B. Lee and D. Muhkerjee, “Spectral entropy-based voice activity detector for videoconferencing systems.” in INTERSPEECH, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. ISCA, 2010, pp. 3106–3109. [52] ITU-T, “A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70, Annex B,” 1996.

Suggest Documents