Single-Channel Audio Sources Separation via Optimum Mask Filter

Single-Channel Audio Sources Separation via Optimum Mask Filter Mahdi Fallah Meysam Asgari Dept. of Electrical Engineering, Amirkabir University of ...
Author: Tamsin McCarthy
1 downloads 2 Views 468KB Size
Single-Channel Audio Sources Separation via Optimum Mask Filter Mahdi Fallah

Meysam Asgari

Dept. of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran e-mail: [email protected]

Dept. of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran e-mail: [email protected]

Abstract— In many single-channel audio separation techniques, expressing a given audio mixture in terms of its underlying signals is often introduced as a challenging topic. Opposed to commonly used GMM based approaches, introducing a new mask filter for DFT mixtures, in this paper a VQ-based single channel audio separation approach is proposed. Compared with previously proposed methods including MIXMAX mask filter it is demonstrated that proposed mask filter improves separation performance in terms of SNR and MOS criterion. It is observed that the experimental results agree with the given mathematical. Keywords- Single Channel Audio Separation; Optimum Mask Filter; MIXMAX

I.

INTRODUCTION

j1 (1)

X 2  [ x2 (1)e Y  [ y (1)e

,.., x1 (k )e j1 ( k ) ,.., x1 ( N DFT )e j1 ( N DFT ) ]

j2 (1)

j (1)

,., x2 (k )e j2 ( k ) ,., x2 ( N DFT )e j2 ( N DFT ) ]

,....., y (k )e j ( k ) ,...., y ( N DFT )e j ( N DFT ) ]

(1)

(2)

where xi(k) and φi(k), i=1,2, represent the magnitude and the phase vectors of the NDFT-point discrete Fourier transforms of the ith audio source, respectively. The relationship between the mixed spectrum and spectra of the underlying signals is expressed by: Y ( k )  x1 (k )  x2 (k )  2 x1 (k ) x2 (k ) cos  (k )  2

2



k  1, 2,..., N DFT

(4)

log(yˆ (k )) MSE  max log(x1 (k )), log(x2 (k )) yˆ (k ) MSE  max  x1 (k ), x2 (k )

(3)

(5)

We show, these methods (MIXMAX) aren't optimum. In the following section, we present stochastic audio source modeling. In section 3, we will derive MSE estimation for DFT mixtures. In section 4 presents searching criteria in sources codebook .In section 5, presents separation via mask filter approach In Section 6 present experimental study and in section 7 is concludes. II.

where k is the frequency bin index and NDFT number of DFT points. The mixture vector Y(k) is then related to X1(k) and X2(k) by: Y (k )  x1 (k )e j1 ( k )  x2 (k )e j2 ( k )

log(y ( k )) MSE  E log(y (k )) log(x1 ( k )), log(x2 (k ))

consequently we have:

Monaural audio separation has been introduced as a challenging topic in recent years where the aim is to segregate a mixture of audio signals [1]. In such applications, we usually use the amplitude of the Short-Time Fourier Transform (STFT) as a primary feature [4],[5]. Considering x1(n), x2(n) and y(n) as audio and mixed signals, respectively the related frequency spectra can be shown as follows: X 1  [ x1 (1)e

that θ(k) the phase difference given as: θ(k) =φ1(k) –φ2(k). Since presenting a compact model for phase values is a difficult task, hence, in order to exclude the phase information in sound separation scenario, inspiring from Nadas et al. [2] Sayadiyan et al. [3] reported a proof for mixture-maximization (MIXMAX) approximation which stated that the log spectrum of the mixed signal is nearly the element-wise maximum of the log spectra of the two underlying signals. The basic MIXMAX idea established in minimum mean square sense. The MSE estimator for MIXMAX approach can be established as follows [5]:

STOCHASTIC AUDIO SOURCES MODELING

A. Feature selection In order to have a good stochastic model, one need to gather a huge data samples from audio source. In addition, the feature type can play a key role in source modelling in that the more compact a feature density (lower variance), the better audio source model may result. However, the STFT parameter of each audio source is commonly used as appropriate feature in previous works. However, it contains both amplitude and phase information. Due to uniformity of phase distribution STFT is not capable for clustering purposes in this raw format. As a result, by neglecting the phase information in this paper we choose spectrum amplitude as our feature vectors.

B. Vector quantization After extracting features from each audio data samples, we are to establish codebooks for each audio source. Constructing these codebooks requires some feature vectors which are obtained by employing STFT analysis. For clustering stage, the vector quantization employed. For vector quantization commonly used in Linde-Buzo-Gray (LBG) algorithm. III.

DERIVING MSE ESTIMATION FOR DFT MIXTURES

For evaluation MSE estimation, we must have PDF of random variable y(k). Thus compute PDF y(k). Assuming uniform distribution for φi(k), i=1,2, and given x1(k) and x2(k), MSE estimation for DFT mixture will be: yˆ(k ) MSE  E  p y ( k ) ( y (k ) x1 (k ), x2 (k )) 

According to [8] we can assume uniform distribution for φi(k), i=1,2, each audio source as a result, mixture distribution θ(k) will be:

f θ(k) (θ (k )) 

1 , 2

-  θ (k )  

(7)

where given phase distribution θ (k) we can obtain pdf of y(k). Mention that if px(x) is the PDF for random variable x and y=f(x), then y also is random variable, and the PDF of random variable y can be written as [6]: p ( x) p y ( y)  x df ( x ) (8) dx Thus PDF for random variable y(k) computed as follows: Py ( k ) ( y ( k )) 

y (k )

1 2  y (k )  x1 ( k )  x2 (k )   x1 (k ) x2 (k ) 1    2 x1 (k ) x2 (k )   2

Figure 1.

(6)

2

we fit the

yˆ(k ) MSE x2 (k ) respect . x1 (k ) x1 (k )

yˆ (k ) MSE x (k ) respect 2 with polynomial in x 1 (k ) x 1 (k )

 x (k )  order N (in this case=10). In Fig.2 shows PN  2  and  x 1 (k )  yˆ (k ) MSE x (k ) respect 2 . x 1 (k ) x 1 (k )

(9)

2

where

| x1 (k )  x2 (k ) | y( k )  x1 (k )  x2 (k )

(10)

We evaluate MSE estimator as follows:

yˆ(k ) MSE  

x1 ( k )  x2 ( k )

| x1 ( k )  x2 ( k )|

we assume in whole paper

Py ( k ) ( y (k )) dy (k )

(11)

x1 (k )  x2 (k ) then we have:

 x (k )  yˆ( k ) MSE  x1 (k ).g  1   ..  x2 ( k )    (12)  x (k )  x ( k ) 2 1 1   ..Ellipticpi f ( ), x (k )   x2 (k ) x1 (k ) 1 2   ( ) x k  2  where f(.) and g(.) is nonlinear function. In Fig.1 draws yˆ(k ) MSE x (k ) respect 2 . x1 (k ) x1 (k )

 x (k )  yˆ (k ) MSE x (k ) Figure 2. PN  2 respect 2  and . x ( k ) ( ) x 1 (k ) x k 1  1 

 x (k )  we use as PN  2  for mask filter later.  x 1 (k )  IV.

SEARCHING CRITERIA IN SOURCES CODEBOOK

With searching indices in sources codebook which minimize the estimation error for MIXMAX estimation the separation result as follows: 2 NDFT iopt , jopt  argmin   Y (k )  max  x1i (k ), x2j (k )   i, j  k 1





and for optimum DFT mixtures estimator is as follows:

(13)

and define error E as follows:

2 N  q  j   x ( k )  DFT    iopt , jopt  argmin    Y (k)  x1 (k) q 1  2i     x1 (k)    i, j  k 1   

V.

(14)

   x12 ( k )  x1 ( k )  e j1 ( k ) E  x1 ( k )      q x (k )q  x (k )q 2   1

SEPARATION VIA MASK FILTER APPROACH

+

q

We assume for MSE estimation for DFT mixtures is as follows:

yˆ(k ) MSE  q x1 (k )q  x2 (k )q

(15)

The absolute random variable E estimated as follows: q

then:

yˆ(k ) MSE

 x (k )   x1 (k ) 1   2   x1 (k ) 

q

(16)

q

   x (k)x (k)  x12 (k) 1 2   x1(k)   E   q q q  x (k)  x (k)   q x (k)q  x (k)q  2 2  1   1  0   1 est

q

(21)

  x1 ( k ) x2 ( k ) e j 2 ( k )   x1 ( k ) q  x2 ( k ) q 

q

(22)

we must optimize  in (21) as follows:  opt  arg min E est

we compute q from eq.(15)



 x (k )  q  PN  2   x1 (k ) 

(17)

0  1

(23)

E est  0

 x (k )  x 2 (k ) in Fig.3 draws PN  2  respect x 1 (k )  x 1 (k ) 

recall in mathematical optimization the opt calculate as follows:

.

 opt 

x1 (k ) q 1

 x (k ) 1

q

 x2 (k ) q 

1

1 q

(24)

thus optimum mask filter obtain as follow:

MASKX1 (k) 

x1 (k)q x1 (k)q  x2 (k)q

x2 (k)q MASKX2 (k)  x1 (k)q  x2 (k)q

(25)

for MIXMAX mask filter is as follows[3]:  x (k )  x 2 (k ) Figure 3. PN  2  respect x ( k ) x 1 (k )  1 

.

Now we design optimum mask filter for optimum MSE estimation as follows:

Xˆ 1 ( k )  MASK X1 ( k ) y ( k )e j1 ( k )

(18)

VI.

we assume: Xˆ 1 (k )  

x1 (k ) q

x1 (k )  x2 (k ) q

q

 x (k )e 1

j1 ( k )

 x2 (k )e j2 ( k )



(19)

we rewritten (18) as follow:

Xˆ 1 (k )  x1 (k )e j1 ( k )  



1 MASK X1 (k )   0 0 MASK X 2 (k )   1

x1 (k )

q

 x1 (k )q  x2 (k )q 



 ..

.. x1 (k ).e j1 ( k )  x2 (k ).e j2 ( k )  x1 (k ).e j1 ( k )

(20)

x1 (k )  x2 (k ) x2 (k )  x1 (k ) x1 (k )  x2 (k )

(26)

x2 (k )  x1 (k )

EXPERIMENTAL RESULTS

Evaluating different algorithms with our proposed method, we used 100,000 training vectors for 80 sentences uttered by 2 male/female speakers as well as music signals including piano segments extracted from audio waves. The analysis window is set to 32 ms and a frame shift of 16 ms is used for a sampling frequency fs=8 kHz. The codebook size is 12 bit or M=4096 clusters. To evaluate the separation performance, Signal to Noise Ratio (SNR) value is considered as a measure to ensure how much interfering signal is separated from the mixture.

Such SNR criterion is calculated in the DFT amplitude domain and not the time domain due to its dependency to phase. As a result, the SNR criterion can be defined as follows: N DFT  X i (k )2   k 1  SNR i = 10 log  N DFT   X i (k )  Xˆ i (k )  k 1





2

     

(27)

where Xˆ i (k ) and X i (k ) are the original and the reconstructed signal DFT amplitudes, respectively. Taking average over all frames we have ASNR = AVERAGE(SNRi )

(28)

Table 1, Table 2, and Table 3 summarizes the obtained Average SNR (ASNR) results for male/female, male/music, female/music separation in the case of using different methods including, MIXMAX filter mask and optimum filter mask. As it is seen, ASNR of the proposed filter mask is improvement. TABLE 1: ASNR FOR SEPARATED MALE AND FEMALE SIGNALS USING MIXMAX AND OPTIMUM FILTER MASK. MIXMAX Optimum Signal filter mask filter mask

male

8.6

11.7

female

8.5

11.4

Figure 4. Whole segregation procedure.

0.5

TABLE 2: ASNR FOR SEPARATED MALE AND MUSIC SIGNALS USING MIXMAX AND OPTIMUM FILTER MASK. MIXMAX Optimum Signal filter mask filter mask

male

15.0

17.3

music

12.4

14.7

0

-0.5 0.5

(a)

0

TABLE 3: ASNR FOR SEPARATED FEMALE AND MUSIC SIGNALS USING MIXMAX AND OPTIMUM FILTER MASK. MIXMAX Optimum Signal filter mask filter mask

female

15.1

17.4

music

12.7

15.2

-0.5

(b) 0.5

0

-0.5

(c) 0.5

Fig.3 depicts Whole segregation procedure where optimum mask filter used and Fig.4 depicts the mixture male/female as well as separated underlying signals in time domains. As our subjective results, we conducted Mean Opinion Score (MOS) to evaluate the output separated signals in terms of perception. Table 4 shows the MOS results obtained through our experimental results. As it seen, our proposed method outperforms from other previous including MIXMAX mask filter about 1 in terms of MOS.

0

-0.5 0.5

(d)

0

-0.5

(e)

Figure 5. Responding time domain of (a) male (b) female, (c) male/female mixture (d) separated male, (e) separated female signal.

TABLE 4: MOS RESULTS FOR SYNTHESIZED OUTPUT SIGNALS. Category Method MOS (Speech/Music) Adult male + 3.3 Optimum mask music filter Adult female + 3.2 music Adult male + 2.3 MIXMAX music mask filter Adult female + 2.4 music

CONCLUSION We have presented a novel mask filter for monaural audio separation. Opposed to commonly used MixtureMaximization (MIXMAX) mask filter based approximation, which was extensively used in single-channel audio separation algorithm. It was demonstrated that the proposed approach outperformed MIXMAX mask filter in terms of SNR results as well as MOS results. REFERENCES [1]

S. Roweis, “One microphone source separation”, in Proc. NIPS, 2000.

[2]

Nadas, A., Nahamoo, D., and Picheny, M.A.: “Speech recognition using noise-adaptive prototypes”, IEEE Trans. Acoust. Speech Signal Process., vol. 37,pp. 1495–1503, 1989. Radfar, M.H., Banihashemi, A.H., Dansereau, R.M., Sayadiyan, A., “A non-linear minimum mean square error estimator for the mixture-maximization approximation”. Electronic. Letter. 42 (12), pp. 75–76, 2006. Reyes-Gomez, M.J., Ellis, D., and Jojic, N.: “Multiband audio modeling for single channel acoustic source separation”.in Proc. ICASSP’04, pp. 641–644, May 2004. Reddy, A.M., and Raj, B.: “A minimum mean squared error estimator for single channel speaker separation’.in proc. INTERSPEECH’2004, pp. 2445–2448, Oct., 2004. Papoulis, A.: “Probability, random variables, and stochastic processes’, McGraw-Hill, 1991. M. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis”, in proc. ICMC, 2000. H. Pobloth and W. B. Kleijn, “Squared error as a measure of perceived phase distortion,”J. Acoust. Soc. Am.,vol.114,no. 2, pp. 1081–1094, Aug. 2003. A. Gersho and R. M. Gray, “Vector Quantization and Signal Compression, Kluwer Academic”, Norwell MA, 1992. D. Ellis and R. Weiss, “Model-based monaural source separation using a vector-quantized phase-vocoder representation”, in proc. ICASSP-06, vol. V, pp. 957– 960, May 2006.

[3]

[4]

[5]

[6] [7]

[8]

[9]

[10]

Suggest Documents