An Improved Method for Estimating the a Priori Probability of Speech Absence for Enhancement of Speech

An Improved Method for Estimating the a Priori Probability of Speech Absence for Enhancement of Speech Aboubakar Nasser Samatin Njikam1, and Huan Zhao...
Author: Carmel Gibson
0 downloads 0 Views 471KB Size
An Improved Method for Estimating the a Priori Probability of Speech Absence for Enhancement of Speech Aboubakar Nasser Samatin Njikam1, and Huan Zhao1 School of Information Science and Engineering, Hunan University, Changsha, Hunan, China

1

Abstract - The efficiency of many noise suppression filters relied on an accurate estimation of their parameters such as the power spectral density of the noise (PSD) and the a priori speech absence probability (SAP). This work addresses the problem of estimation of the a priori SAP. The proposed method relied on the conditional probabilities of the noisy speech magnitude, assuming that speech is absent or present and by dynamically computed parameters of the a priori SAP using two factors: a smoothing-update factor and a factor related to the kth spectral component. The smoothing-update factor, which is based on a decision made in frequency band whether speech is present or absent, is computed by recursively averaging past spectral values of the a priori SAP. The factor related to the kth spectral component is computed using the a priori signal-to-noise ratio (SNR). The efficiency of the proposed algorithm over competitive ones, both in terms of background noise suppression and speech distortion, is assessed by the use of an objective measure namely average segmental signal-to-noise ratio (segSNR) plus a study of speech spectrograms. Keywords: Signal and speech processing, Enhancement techniques, Speech Absence Probability

1

Introduction

The optimally modified log-spectral amplitude estimator (OM-LSA) [1] was proposed to minimize the mean-square error of the log-spectra for speech signals under signal presence uncertainty. The algorithm is very efficient for noise suppression and improvement of quality of speech if its required parameters such as the a priori signal-to-noise ratio (SNR) and the a priori speech absence probability (SAP) are correctly-estimate. This paper focuses on the estimation of the a priori speech absence probability. A priori SAP refers to a probability that a speech is not present with respect to a frame and a frequency bin resulting from an input signal [2]. Several algorithms for estimating and updating the a priori sap have been proposed. In [3] two methods for estimating the a priori of SAP were proposed. The first method, which is based on comparing the conditional probabilities of the noisy speech magnitude,

assuming that speech is present or absent, can be considered as hard-decision approach. The second method proposed in [3] for estimating the a priori SAP is known as a soft-decision approach. Unlike the hard decision method which classifies input signal into speech presence or speech absence, the soft decision method is directly design to represent the probability that the input signal is from a speech absence state. The preceding two methods for estimating the a priori SAP were incorporated into an MMSE estimator and have shown to yield about 2-3 dB SNR improvement over the estimator that used a fixed a priori SAP [4, 5, 6]. In [6], Malah et Al. proposed a different method based on the a posteriori SNR. The decision on speech absence was based on a comparison of the estimated a posteriori SNR against a threshold. Its main drawbacks are the misclassification of voice activity causes unwanted artifacts [7, 8, 9]. Cohen in [1, 10] presented a technique for the computation of the a priori SAP estimator essentially based on three factors. In his work, a soft decision made on each frame whether speech is present or not and calculation of local and global factors of the a priori SNR are used as a parameters. This technique has shown to yield better performance in terms of objective segmental SNR measure over the pre-cited methods. But as in [6], undesired artifacts are introduces due to the misclassification of the speech activity. In this paper an effective estimator for the a priori SAP is proposed. Next, after a brief description of the OM-LSA estimator, a detail description of the proposed estimator is presented. The technique is evaluated using an objective measure namely average segmental signal-to-noise ratio (segSNR) plus a study of speech spectrograms.

2

Description of the Optimally Modified Log-Spectral Amplitude

We assumed a clean signal x to be corrupted by uncorrelated additive background noise signal n . Considering short-time Fourier Transform (STFT)

Yk (l )  X k (l )  Nk (l )

E[log L (l ) | Y (l )] k k

where l stands for the frame index and band index.

1  E[log L (l ) | Y (l ), H (l )] p (l ) k k k k 0  E[log L (l ) | Y (l ), H (l )](1  p (l )) k k k k

k for the frequency

(4)

Under the following two hypotheses: Hence the following assumption can be made: 0 H (l ) : Speech absent k 1 H (l ) : Speech present k

In absence of speech, a minimum gain G

min

subjective criteria, is used:

and based on a complex Gaussian assumption [5, 7] the conditional probabilities of the noisy speech magnitude are computed as below:

0 p (Y (l ) | H (l ))  k k

 | Y (l ) | 1  exp  k n n k (l )  k (l )

x

n

k (l )

  

2





(1)

(5)

In presence of speech, the conditional gain function given by 1 exp{E[log L (l ) | Y (l ), H (l )]}  G (l ) | Y (l ) | k k k k k

(6)

is derived in [6] to be t  1  k (l ) e G (l )  exp   dt  k   1   (l ) 2 v k (l ) t k  

2 1 E | X (l ) | | H (l )  and k   k 

E | N (l ) |  k



0 exp E[log L (l ) | Y (l ), H (l )]  G  | Y (l ) | k k k k min

2

 | Y ( l ) |2  1   1 k p (Y (l ) | H (l ))   exp  k k x n x n   ( k (l )  k (l ))  k (l )  k (l )  where k (l )

determined by a

(7)

Therefore, Substituting (5) and (6) into (3), the gain function for the OM-LSA estimator is obtained by

respectively, indicate the variance of G ( k , l )  {G

the clean speech and the variance of noise.

H 1

( k , l )}

p(k , l )

1  p(k , l ) G min

(8)

Therefore, the probability that speech is present denotes by 1 p (l )  p ( H (l ) | Y (l )) is derived from Bayes formula [5]: k k k

0   P ( H (l ))   k p (l )  1  (1   (l )) exp( v (l ))  k k k 0    1  P ( H k (l ))  x

n

1

2

(2)

3 d

in which  k (l ) k (l ) / k (l ) and  k (l ) | Yk (l ) | / k (l ) represent the a priori SNR and the a posteriori SNR, respectively. And vk (l )  k (l ) k (l ) / (1   k (l )) . Let L | X | be the spectral speech amplitude. Its optimal estimate L ' [7], considering statistically independent assumption of spectral components [6], is given by: L '(l )  exp{E[log L (l ) | Y (l )]} k k k

G (l ) | Y (l ) | k k

Assuming Gaussian statistical model, we get

In the above scheme, the OM-LSA is modified by considering the uncertainty of speech in real environment, which requires the computation of speech absence probability (SAP). In the next section, we present an improved method for tracking the a priori SAP.

(3)

The Proposed A Priori SAP Estimation Method Two factors  (l ) and c k

k

are involved in the computation

of the a priori SAP estimator, where  (l ) represents a k

dynamic smoothing update factor which describes the speech absence probability of the current frame, and c a factor k

related to the kth spectral component.

3.1

Computation of a Dynamic Smoothing Update Factor

The process of computed the dynamic smoothing update factor is described below:



Assuming the two following hypotheses

if exp(  ) I 2   k 0 k k

0 H (l ) : Speech absent k 1 H (l ) : Speech present k

b 0 k

2Y 0 p (Y | H )  k exp( ) k n k k k

where

I () 0

 1 then

speech presence

(13)

else b 1 k

and based on Gaussian statistical model assumption [5, 7] the conditional probabilities of the observed signal are as follows:

2Y 1 p (Y | H )  k k n k k



(9)

 Y 2  X 2   2X Y  k I  k k exp  k n   0   n k    k

speech absence

end

From the above scheme, a dynamic smoothing update factor for frame l is derived by



 (l )  sin 1    q (l  1)  b

   

k

(10)

represents the zero-order modified Bessel

k

k



(14)

where  is a constant. The sinus function is used in order to track accurately and faster the a priori SAP. In the conventional method, the value of  (l ) is fixed. The tracking k

speed of the a priori SAP is therefore constant.

function.

3.2

Computation of Component Factor

Using the conditional probabilities from (9) and (10), a binary decision in frequency band k is given by

 X k

c  sin( k ) k

(11)

since

X

is unknown,

k

the approximation

2 n /  is used instead. Hence, k k

 



2Y 1 p (Y | H )  k exp    I 2   k n k k k 0 k k k



k

k

is a prior SNR. The sinus

function here is used to prevent clipping of speech onset or weak component while the a priori SNR is very helpful for eliminating musical noise.

3.3

Update of the a Priori SAP estimate

After computing the dynamic smoothing update factor and the kth spectral component factor, the a priori SAP (12)

Therefore, the preceding condition (11) can be simplified and expressed in terms of  and  as follows: k

(15)

where  is a constant, and 

speech absence

end

However,

Spectral



speech presence

else b 1 k

kth

The kth spectral component factor is computed as follows

1 0 if p (Y | H )  p (Y | H ) then k k k k b 0 k

the

estimate denoted as q (l ) is updated as follows k

q (l )  (1   (l ))  c  q (l  1)   (l ) k k k k k

(16)

The overall algorithm for estimating the a priori SAP can therefore be summarized as follows: 1.

Using the conditional probabilities from equation (9) and (10), make a binary decision for frequency bin according to (13).

2.

Compute the dynamic smoothing update factor using (14).

Table 1: segSNR scores for various estimators Noise

Method

0dB

5dB

10dB

15dB

Hard

0.956

3.242

5.709

8.043

Soft

0.863

3.160

5.635

7.998

White

4.

4

Compute the (15).

Method

0 dB

5 dB

10dB

15dB

Hard

-0.73

1.255

4.110

6.373

Soft

-0.74

1.319

4.131

6.403

Street Malah et Al

0.867

3.188

5.595

7.996

Malah et Al

-0.82

1.141

4.062

6.372

Cohen

1.065

3.399

5.909

8.162

Cohen

-0.62

1.242

4.102

6.424

Proposed Method

1.234

3.465

5.913

8.204

Proposed Method

-0.59

1.420

4.231

6.453

Hard

-1.43

0.812

3.591

6.153

Hard

-0.12

2.127

4.661

7.291

Soft

-1.43

0.781

3.637

6.181

Soft

-0.19

2.066

4.586

7.266

Babble

3.

Noise

Car Malah et Al

-1.59

0.643

3.534

6.054

Malah et Al

-0.35

2.013

4.560

7.245

Cohen

-1.40

0.839

3.686

6.163

Cohen

0.014

2.280

4.715

7.400

Proposed Method

-1.37

0.853

3.689

6.195

Proposed Method

0.158

2.368

4.845

7.453

kth spectral component factor using

Update the a priori SAP estimate using (16).

Performance Evaluation

For evaluation purposes, we select the state-of-the-art approaches for comparison, including hard-decision [3], softdecision [3], Malah et al [4] and Cohen [1, 10]. Furthermore, we integrate the different a priori SAP methods into the OMLSA estimator [1]. Speech data, segmented into 20-ms frames using a Hamming window with 75% overlap is used for analysis. A total of 15 utterances speech corrupted by babble, car, street and white noise at SNR level ranges from 0dB to 15dB, is taken from NOIZEUS database [13] which contains IEEE sentences corrupted by real-world noise from AURORA database [14]. An objective measure namely segmental SNR is chosen for evaluation. The segmental SNR measure is known as the best measure in terms of background noise distortion and takes into account residual noise [15]. Higher segSNR values indicate that the enhanced speech is more similar to the clean speech. Moreover, a visual study of speech spectrograms is done to complement the evaluation. By analysis the segSNR scores obtained by various a priori SAP estimators reported in table 1, the first conclusion we can

draw is, the proposed estimator gained the higher values in terms of segSNR measure (bold scores) over the other estimators for every types of noise and at different SNR input. On the other hand the proposed method is more efficient in residual noise suppression and speech distortion minimization than conventional estimators. The advantage of the proposed estimator is more significant for Car noise and at low input SNR levels for all types of noise. This is partially due to the reliable decision made by the proposed estimator when it comes to classify speech between speech present and speech absent, but also to its capacity to effectively track the SAP at each frequency band even at low SNR inputs. The estimator proposed by Cohen also gained good results compare to the other ones. This leads us to choose that estimator in order to conduct our study of speech spectrograms. Figure 1 shows the enhanced speech obtained using the Cohen estimator (panel c) and the proposed estimator (panel d). Speech spectrograms of the clean speech and noisy speech (corrupted by street noise at 5dB SNR input level) are given respectively in panel (a) and panel (b). Visual inspection of speech spectrograms show that while the proposed estimator and the estimator proposed by Cohen slightly perform the same in terms of background noise suppression, the proposed algorithm does perform better in minimizing speech distortions introduced during the enhanced process (see arrows). This confirms results obtained in table 1.

5

Conclusions

This work proposed an improved method for estimating the a priori speech absence probability for speech enhancement by dynamically computed its parameters. The

3500

3000

3000

2500

2500

Frequency(Hz)

Frequency(Hz)

3500

2000 1500

2000 1500

1000

1000

500

500

0

0 100

200

300 Time(s)

400

500

600

100

200

400

500

600

(b)

3500

3500

3000

3000

2500

2500

Frequency(Hz)

Frequency(Hz)

(a)

300 Time(s)

2000 1500

2000 1500

1000

1000

500

500 0

0 100

200

300 Time(s)

400

(c)

500

600

100

200

300 Time(s)

400

500

600

(d)

Figure 1: Speech spectrograms: (a) Clean speech, (b) Speech corrupted by street noise at 5dB, (c) Enhanced speech: Cohen estimator, (d) Enhanced speech: Proposed estimator. Arrows indicate the speech preserved during the enhancement process performed by the proposed method, while they are distorted during the enhancement process using the Cohen estimator. .

proposed estimator is based on comparing the conditional probabilities of the noisy speech magnitude, assuming that speech is absent or present and by dynamically computed parameters of the a priori SAP using two factors: a smoothing-update factor and a factor related to the kth spectral component. The smoothing-update factor which is based on a decision made in frequency band whether speech is present or absent is computed by recursively averaging past spectral values of the a priori SAP. The kth spectral component factor is computed using the a priori signal-tonoise ratio (SNR). In this paper, we showed that by using the proposed estimator, the performance of the optimally modified log-spectral estimator (OM-LSA) can be significantly improved. The proposed method can also be directly integrated to any filters which require such an estimate.

6

Acknowledgment

This work was supported by National Science Foundation of China (Grant No. 61173106), the Key Program of Hunan Provincial Natural Science Foundation of China (Grant No.10JJ2046).

7

References

[1] Israel Cohen. “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator”; IEEE Signal processing letters, Vol. l.9 No. 4, 113-116, 2002. [2] Song Joo Lee. “Method for estimation priori SAP based on statistical model”; Electronics and Telecommunications Research Institute, Daejeon (KR), 2008. [3] I. Y. Soon, S. N. Koh, and C. K. Yeo. “Improved noise suppression filter using self-adaptive estimator of probability of speech absence”; Signal Processing, Vol. 75, 151–159, 1999. [4] Y. Ephraim and D. Malah. “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 32 No. 6, 1109–1121, 1984. [5] N. S. Kim and J.-H. Chang. “Spectral enhancement based on global soft decision”; IEEE Signal Processing Letters, Vol. 7 No. 5, 108-110, 2000. [6] D. Malah, R. V. Cox, and A. J. Accardi. “Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments”; in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 789–792, 1999. [7] C. Min-Seok and K. Hong-Goo. “An improved estimation of a priori speech absence probability for speech

enhancement: in perspective of speech perception”; IEEE ICASSP, 2005. [8] C. Jae-Hun, C. Joon-Hyuk, J. Yu-Gwang and K. NamSoo. “Speech enhancement based on improved speech presence uncertainty tracking technique”; Inter.noise, 2011. [9] S. Young-ho and L. Sang-min. “Improved speech absence probability estimation based on environmental noise classification”; J. Cent. South Univ., Vol. 19, 2548-2553, 2012. [10] I. Cohen and B. Berdugo. “Speech enhancement for nonstationary noise environments”; Signal Processing, Vol. 81, No. 11, 2403–2418, Oct 2001. [11] Y. Ephraim and D. Malah. “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 33, 443–445, Apr. 1985. [12] R. J. McAulay and M. L. Malpass. “Speech enhancement using a soft decision noise suppression filter”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 28, 137–145, Apr. 1980. [13] Hu, Y. and Loizou, P. “Subjective evaluation and comparison of speech enhancement algorithms”; Speech Communication, Vol. 49, 588-601, 2007. [14] H. Hirsch, and D. Pearce “The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions”; ISCA ITRW ASR, Paris, France, 18-20, Sept 2000. [15] ITU-T Rec. P.862. "Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs", 2000.

Suggest Documents