An Improved Method for Estimating the a Priori Probability of Speech Absence for Enhancement of Speech Aboubakar Nasser Samatin Njikam1, and Huan Zhao1 School of Information Science and Engineering, Hunan University, Changsha, Hunan, China
1
Abstract - The efficiency of many noise suppression filters relied on an accurate estimation of their parameters such as the power spectral density of the noise (PSD) and the a priori speech absence probability (SAP). This work addresses the problem of estimation of the a priori SAP. The proposed method relied on the conditional probabilities of the noisy speech magnitude, assuming that speech is absent or present and by dynamically computed parameters of the a priori SAP using two factors: a smoothing-update factor and a factor related to the kth spectral component. The smoothing-update factor, which is based on a decision made in frequency band whether speech is present or absent, is computed by recursively averaging past spectral values of the a priori SAP. The factor related to the kth spectral component is computed using the a priori signal-to-noise ratio (SNR). The efficiency of the proposed algorithm over competitive ones, both in terms of background noise suppression and speech distortion, is assessed by the use of an objective measure namely average segmental signal-to-noise ratio (segSNR) plus a study of speech spectrograms. Keywords: Signal and speech processing, Enhancement techniques, Speech Absence Probability
1
Introduction
The optimally modified log-spectral amplitude estimator (OM-LSA) [1] was proposed to minimize the mean-square error of the log-spectra for speech signals under signal presence uncertainty. The algorithm is very efficient for noise suppression and improvement of quality of speech if its required parameters such as the a priori signal-to-noise ratio (SNR) and the a priori speech absence probability (SAP) are correctly-estimate. This paper focuses on the estimation of the a priori speech absence probability. A priori SAP refers to a probability that a speech is not present with respect to a frame and a frequency bin resulting from an input signal [2]. Several algorithms for estimating and updating the a priori sap have been proposed. In [3] two methods for estimating the a priori of SAP were proposed. The first method, which is based on comparing the conditional probabilities of the noisy speech magnitude,
assuming that speech is present or absent, can be considered as hard-decision approach. The second method proposed in [3] for estimating the a priori SAP is known as a soft-decision approach. Unlike the hard decision method which classifies input signal into speech presence or speech absence, the soft decision method is directly design to represent the probability that the input signal is from a speech absence state. The preceding two methods for estimating the a priori SAP were incorporated into an MMSE estimator and have shown to yield about 2-3 dB SNR improvement over the estimator that used a fixed a priori SAP [4, 5, 6]. In [6], Malah et Al. proposed a different method based on the a posteriori SNR. The decision on speech absence was based on a comparison of the estimated a posteriori SNR against a threshold. Its main drawbacks are the misclassification of voice activity causes unwanted artifacts [7, 8, 9]. Cohen in [1, 10] presented a technique for the computation of the a priori SAP estimator essentially based on three factors. In his work, a soft decision made on each frame whether speech is present or not and calculation of local and global factors of the a priori SNR are used as a parameters. This technique has shown to yield better performance in terms of objective segmental SNR measure over the pre-cited methods. But as in [6], undesired artifacts are introduces due to the misclassification of the speech activity. In this paper an effective estimator for the a priori SAP is proposed. Next, after a brief description of the OM-LSA estimator, a detail description of the proposed estimator is presented. The technique is evaluated using an objective measure namely average segmental signal-to-noise ratio (segSNR) plus a study of speech spectrograms.
2
Description of the Optimally Modified Log-Spectral Amplitude
We assumed a clean signal x to be corrupted by uncorrelated additive background noise signal n . Considering short-time Fourier Transform (STFT)
Yk (l ) X k (l ) Nk (l )
E[log L (l ) | Y (l )] k k
where l stands for the frame index and band index.
1 E[log L (l ) | Y (l ), H (l )] p (l ) k k k k 0 E[log L (l ) | Y (l ), H (l )](1 p (l )) k k k k
k for the frequency
(4)
Under the following two hypotheses: Hence the following assumption can be made: 0 H (l ) : Speech absent k 1 H (l ) : Speech present k
In absence of speech, a minimum gain G
min
subjective criteria, is used:
and based on a complex Gaussian assumption [5, 7] the conditional probabilities of the noisy speech magnitude are computed as below:
0 p (Y (l ) | H (l )) k k
| Y (l ) | 1 exp k n n k (l ) k (l )
x
n
k (l )
2
(1)
(5)
In presence of speech, the conditional gain function given by 1 exp{E[log L (l ) | Y (l ), H (l )]} G (l ) | Y (l ) | k k k k k
(6)
is derived in [6] to be t 1 k (l ) e G (l ) exp dt k 1 (l ) 2 v k (l ) t k
2 1 E | X (l ) | | H (l ) and k k
E | N (l ) | k
0 exp E[log L (l ) | Y (l ), H (l )] G | Y (l ) | k k k k min
2
| Y ( l ) |2 1 1 k p (Y (l ) | H (l )) exp k k x n x n ( k (l ) k (l )) k (l ) k (l ) where k (l )
determined by a
(7)
Therefore, Substituting (5) and (6) into (3), the gain function for the OM-LSA estimator is obtained by
respectively, indicate the variance of G ( k , l ) {G
the clean speech and the variance of noise.
H 1
( k , l )}
p(k , l )
1 p(k , l ) G min
(8)
Therefore, the probability that speech is present denotes by 1 p (l ) p ( H (l ) | Y (l )) is derived from Bayes formula [5]: k k k
0 P ( H (l )) k p (l ) 1 (1 (l )) exp( v (l )) k k k 0 1 P ( H k (l )) x
n
1
2
(2)
3 d
in which k (l ) k (l ) / k (l ) and k (l ) | Yk (l ) | / k (l ) represent the a priori SNR and the a posteriori SNR, respectively. And vk (l ) k (l ) k (l ) / (1 k (l )) . Let L | X | be the spectral speech amplitude. Its optimal estimate L ' [7], considering statistically independent assumption of spectral components [6], is given by: L '(l ) exp{E[log L (l ) | Y (l )]} k k k
G (l ) | Y (l ) | k k
Assuming Gaussian statistical model, we get
In the above scheme, the OM-LSA is modified by considering the uncertainty of speech in real environment, which requires the computation of speech absence probability (SAP). In the next section, we present an improved method for tracking the a priori SAP.
(3)
The Proposed A Priori SAP Estimation Method Two factors (l ) and c k
k
are involved in the computation
of the a priori SAP estimator, where (l ) represents a k
dynamic smoothing update factor which describes the speech absence probability of the current frame, and c a factor k
related to the kth spectral component.
3.1
Computation of a Dynamic Smoothing Update Factor
The process of computed the dynamic smoothing update factor is described below:
Assuming the two following hypotheses
if exp( ) I 2 k 0 k k
0 H (l ) : Speech absent k 1 H (l ) : Speech present k
b 0 k
2Y 0 p (Y | H ) k exp( ) k n k k k
where
I () 0
1 then
speech presence
(13)
else b 1 k
and based on Gaussian statistical model assumption [5, 7] the conditional probabilities of the observed signal are as follows:
2Y 1 p (Y | H ) k k n k k
(9)
Y 2 X 2 2X Y k I k k exp k n 0 n k k
speech absence
end
From the above scheme, a dynamic smoothing update factor for frame l is derived by
(l ) sin 1 q (l 1) b
k
(10)
represents the zero-order modified Bessel
k
k
(14)
where is a constant. The sinus function is used in order to track accurately and faster the a priori SAP. In the conventional method, the value of (l ) is fixed. The tracking k
speed of the a priori SAP is therefore constant.
function.
3.2
Computation of Component Factor
Using the conditional probabilities from (9) and (10), a binary decision in frequency band k is given by
X k
c sin( k ) k
(11)
since
X
is unknown,
k
the approximation
2 n / is used instead. Hence, k k
2Y 1 p (Y | H ) k exp I 2 k n k k k 0 k k k
k
k
is a prior SNR. The sinus
function here is used to prevent clipping of speech onset or weak component while the a priori SNR is very helpful for eliminating musical noise.
3.3
Update of the a Priori SAP estimate
After computing the dynamic smoothing update factor and the kth spectral component factor, the a priori SAP (12)
Therefore, the preceding condition (11) can be simplified and expressed in terms of and as follows: k
(15)
where is a constant, and
speech absence
end
However,
Spectral
speech presence
else b 1 k
kth
The kth spectral component factor is computed as follows
1 0 if p (Y | H ) p (Y | H ) then k k k k b 0 k
the
estimate denoted as q (l ) is updated as follows k
q (l ) (1 (l )) c q (l 1) (l ) k k k k k
(16)
The overall algorithm for estimating the a priori SAP can therefore be summarized as follows: 1.
Using the conditional probabilities from equation (9) and (10), make a binary decision for frequency bin according to (13).
2.
Compute the dynamic smoothing update factor using (14).
Table 1: segSNR scores for various estimators Noise
Method
0dB
5dB
10dB
15dB
Hard
0.956
3.242
5.709
8.043
Soft
0.863
3.160
5.635
7.998
White
4.
4
Compute the (15).
Method
0 dB
5 dB
10dB
15dB
Hard
-0.73
1.255
4.110
6.373
Soft
-0.74
1.319
4.131
6.403
Street Malah et Al
0.867
3.188
5.595
7.996
Malah et Al
-0.82
1.141
4.062
6.372
Cohen
1.065
3.399
5.909
8.162
Cohen
-0.62
1.242
4.102
6.424
Proposed Method
1.234
3.465
5.913
8.204
Proposed Method
-0.59
1.420
4.231
6.453
Hard
-1.43
0.812
3.591
6.153
Hard
-0.12
2.127
4.661
7.291
Soft
-1.43
0.781
3.637
6.181
Soft
-0.19
2.066
4.586
7.266
Babble
3.
Noise
Car Malah et Al
-1.59
0.643
3.534
6.054
Malah et Al
-0.35
2.013
4.560
7.245
Cohen
-1.40
0.839
3.686
6.163
Cohen
0.014
2.280
4.715
7.400
Proposed Method
-1.37
0.853
3.689
6.195
Proposed Method
0.158
2.368
4.845
7.453
kth spectral component factor using
Update the a priori SAP estimate using (16).
Performance Evaluation
For evaluation purposes, we select the state-of-the-art approaches for comparison, including hard-decision [3], softdecision [3], Malah et al [4] and Cohen [1, 10]. Furthermore, we integrate the different a priori SAP methods into the OMLSA estimator [1]. Speech data, segmented into 20-ms frames using a Hamming window with 75% overlap is used for analysis. A total of 15 utterances speech corrupted by babble, car, street and white noise at SNR level ranges from 0dB to 15dB, is taken from NOIZEUS database [13] which contains IEEE sentences corrupted by real-world noise from AURORA database [14]. An objective measure namely segmental SNR is chosen for evaluation. The segmental SNR measure is known as the best measure in terms of background noise distortion and takes into account residual noise [15]. Higher segSNR values indicate that the enhanced speech is more similar to the clean speech. Moreover, a visual study of speech spectrograms is done to complement the evaluation. By analysis the segSNR scores obtained by various a priori SAP estimators reported in table 1, the first conclusion we can
draw is, the proposed estimator gained the higher values in terms of segSNR measure (bold scores) over the other estimators for every types of noise and at different SNR input. On the other hand the proposed method is more efficient in residual noise suppression and speech distortion minimization than conventional estimators. The advantage of the proposed estimator is more significant for Car noise and at low input SNR levels for all types of noise. This is partially due to the reliable decision made by the proposed estimator when it comes to classify speech between speech present and speech absent, but also to its capacity to effectively track the SAP at each frequency band even at low SNR inputs. The estimator proposed by Cohen also gained good results compare to the other ones. This leads us to choose that estimator in order to conduct our study of speech spectrograms. Figure 1 shows the enhanced speech obtained using the Cohen estimator (panel c) and the proposed estimator (panel d). Speech spectrograms of the clean speech and noisy speech (corrupted by street noise at 5dB SNR input level) are given respectively in panel (a) and panel (b). Visual inspection of speech spectrograms show that while the proposed estimator and the estimator proposed by Cohen slightly perform the same in terms of background noise suppression, the proposed algorithm does perform better in minimizing speech distortions introduced during the enhanced process (see arrows). This confirms results obtained in table 1.
5
Conclusions
This work proposed an improved method for estimating the a priori speech absence probability for speech enhancement by dynamically computed its parameters. The
3500
3000
3000
2500
2500
Frequency(Hz)
Frequency(Hz)
3500
2000 1500
2000 1500
1000
1000
500
500
0
0 100
200
300 Time(s)
400
500
600
100
200
400
500
600
(b)
3500
3500
3000
3000
2500
2500
Frequency(Hz)
Frequency(Hz)
(a)
300 Time(s)
2000 1500
2000 1500
1000
1000
500
500 0
0 100
200
300 Time(s)
400
(c)
500
600
100
200
300 Time(s)
400
500
600
(d)
Figure 1: Speech spectrograms: (a) Clean speech, (b) Speech corrupted by street noise at 5dB, (c) Enhanced speech: Cohen estimator, (d) Enhanced speech: Proposed estimator. Arrows indicate the speech preserved during the enhancement process performed by the proposed method, while they are distorted during the enhancement process using the Cohen estimator. .
proposed estimator is based on comparing the conditional probabilities of the noisy speech magnitude, assuming that speech is absent or present and by dynamically computed parameters of the a priori SAP using two factors: a smoothing-update factor and a factor related to the kth spectral component. The smoothing-update factor which is based on a decision made in frequency band whether speech is present or absent is computed by recursively averaging past spectral values of the a priori SAP. The kth spectral component factor is computed using the a priori signal-tonoise ratio (SNR). In this paper, we showed that by using the proposed estimator, the performance of the optimally modified log-spectral estimator (OM-LSA) can be significantly improved. The proposed method can also be directly integrated to any filters which require such an estimate.
6
Acknowledgment
This work was supported by National Science Foundation of China (Grant No. 61173106), the Key Program of Hunan Provincial Natural Science Foundation of China (Grant No.10JJ2046).
7
References
[1] Israel Cohen. “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator”; IEEE Signal processing letters, Vol. l.9 No. 4, 113-116, 2002. [2] Song Joo Lee. “Method for estimation priori SAP based on statistical model”; Electronics and Telecommunications Research Institute, Daejeon (KR), 2008. [3] I. Y. Soon, S. N. Koh, and C. K. Yeo. “Improved noise suppression filter using self-adaptive estimator of probability of speech absence”; Signal Processing, Vol. 75, 151–159, 1999. [4] Y. Ephraim and D. Malah. “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 32 No. 6, 1109–1121, 1984. [5] N. S. Kim and J.-H. Chang. “Spectral enhancement based on global soft decision”; IEEE Signal Processing Letters, Vol. 7 No. 5, 108-110, 2000. [6] D. Malah, R. V. Cox, and A. J. Accardi. “Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments”; in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 789–792, 1999. [7] C. Min-Seok and K. Hong-Goo. “An improved estimation of a priori speech absence probability for speech
enhancement: in perspective of speech perception”; IEEE ICASSP, 2005. [8] C. Jae-Hun, C. Joon-Hyuk, J. Yu-Gwang and K. NamSoo. “Speech enhancement based on improved speech presence uncertainty tracking technique”; Inter.noise, 2011. [9] S. Young-ho and L. Sang-min. “Improved speech absence probability estimation based on environmental noise classification”; J. Cent. South Univ., Vol. 19, 2548-2553, 2012. [10] I. Cohen and B. Berdugo. “Speech enhancement for nonstationary noise environments”; Signal Processing, Vol. 81, No. 11, 2403–2418, Oct 2001. [11] Y. Ephraim and D. Malah. “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 33, 443–445, Apr. 1985. [12] R. J. McAulay and M. L. Malpass. “Speech enhancement using a soft decision noise suppression filter”; IEEE Trans. Acoustic. Speech Signal Processing, ASSP Vol. 28, 137–145, Apr. 1980. [13] Hu, Y. and Loizou, P. “Subjective evaluation and comparison of speech enhancement algorithms”; Speech Communication, Vol. 49, 588-601, 2007. [14] H. Hirsch, and D. Pearce “The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions”; ISCA ITRW ASR, Paris, France, 18-20, Sept 2000. [15] ITU-T Rec. P.862. "Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs", 2000.