Local partial least square regression for spectral mapping in voice conversion

Local partial least square regression for spectral mapping in voice conversion Xiaohai Tian∗ , Zhizheng Wu† , Eng Siong Chng∗† ∗ Joint NTU-UBC Resear...
3 downloads 2 Views 132KB Size
Local partial least square regression for spectral mapping in voice conversion Xiaohai Tian∗ , Zhizheng Wu† , Eng Siong Chng∗† ∗ Joint

NTU-UBC Research Centre of Excellence in Active Living for the Elderly, Nanyang Technological University, Singapore † School of Computer Engineering, Nanyang Technological University, Singapore [email protected], [email protected], [email protected]

Abstract—Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses neighboring data rather than all the training data to estimate the transformation function for each feature vector, we proposed a local partial least square method to avoid the over-smoothing problem of JDGMM and the over-fitting problem of local linear transformation when training data are limited. We conducted experiments using the VOICES database and measure both spectral distortion and correlation coefficient of the spectral parameter trajectory. The experimental results show that our proposed method obtain better performance as compared to baseline methods.

I. I NTRODUCTION Voice conversion is a process to modify a speech signal uttered by one speaker (source) to sound like a desired target speaker without changing the linguistic information. The conversion process, which transforms the source feature vectors into the target feature space, includes two phases: offline training phase and real-time conversion phase. During the off-line training phase, a transformation function is estimated from parallel source and target feature vector sequence. While in the real-time conversion phase, the transformation function is applied to an input testing utterance to generated converted speech signal. The features can be any parameters which represent the speaker identity, such as spectral envelop [1], [2], prosody [3], [4], [5], duration [3], [6]. As spectral envelops contain more speaker characteristics information, spectral mapping is the one of the most important techniques in voice conversion. To implement a robust spectral conversion function, a number of techniques have been proposed. A linear conversion function has been implemented by joint density Gaussian mixture model (JD-GMM) with both minimum mean square error and maximum likelihood criteria [1], [2], partial least square regression [7], mixture of factor analyzers [8], local linear transformation [9] and so on. In addition to the linear conversion function, by assuming that the source and target speech features have non-linear relationship, methods such as artificial neural network [10], kernel partial least square [11], and conditional restricted Boltzmann machine [12], have also been proposed. Due to the probabilistic treatment and flexible

implementation, JD-GMM based method [1], [2] has become the mainstream method. However, over-smoothing and over-fitting problems of JDGMM method have been reported in many studies [13], [7], [9], [11]. To address the over-fitting problem caused by the full covariance estimation, in [7], partial least square regression method is combined with Gaussian mixture model to replace the transformation matrix estimated by the full covariance matrix, while keeping the mean vectors of the original JD-GMM vector in conversion function. This combination, however, cannot avoid the over-smoothing caused by the statistical average during estimating the mean vectors of JD-GMM. One of the successful methods for reducing over-smoothing problem is the local linear transformation method [9]. In this method, each frame has its own transformation matrix, which is estimated from its K-nearest neighbourhood frame pairs in terms of Euclidean distance in the training data. One problem of this method is that when the number of nearest neighbourhood is small, over-fitting problem will still be observed. When K is too large, however, over-smoothing problem will occur [9]. Therefore, choosing the optimal number of K is an important issue in local linear transformation method. In this study, motivated by the local linear transformation method, we proposed a local partial least square (LPLS) regression method to avoid over-fitting and over-smoothing problems in voice conversion. Similar to local linear transformation, each testing frame has individual linear transformation matrix estimated from the testing frame’s K-nearest neighbour in training data. Our strategy is then to use partial least square regression to project both source and target speech vectors to a low-dimensional space. As such, we are able to estimate a robust transformation function from limited parallel frames. II. BASELINE

CONVERSION METHODS

In this section, we briefly introduce the two baseline conversion methods used in this study: joint density Gaussian mixture model (JD-GMM) and local linear transformation methods. A. Joint density Gaussian mixture model method The first baseline method is the JD-GMM method [1], [2], [14], which employs Gaussian mixture model to model the joint probability distribution of source and target feature

vectors. This method is original proposed by Kain et al in [14], and has became a mainstream approach [2]. Given training data of source speaker X and target speaker Y, by using dynamic time warping (DTW) to align the source spectral vectors X = [x1 , x2 , . . . , xn , . . . , xN ] and target spectral vectors Y = [y1 , y2 , . . . , ym , . . . , yM ], the parallel training speech corpus, Z = [z1 , z2 , . . . , zt , . . . , zT ], can be ⊤ ⊤ obtained. Here, xn ∈ Rd , ym ∈ Rd , and zt = [x⊤ n , ym ] ∈ 2d R . With the paired spectral vectors Z, Gaussian mixture model (GMM) is adopted to represent the joint probability density of X and Y, written as: L X

P (Z) = P (X, Y) =

(z)

(z)

(z)

wl N (z|µl , Σl ),

(1)

l=1

# # " (xx) (xy) (x) Σl Σl µl (z) and Σl = where = (yx) (yy) (y) Σl Σl µl are the mean vector and the covariance matrix of the l-th (z) (z) (z) Gaussian component N (z|µl , Σl ), respectively. wl is the prior probability of the l-th Gaussian component with PL (z) = 1 constraint. L is the total number of Gaussian l=1 wl components. In off-line training process, the expectation maximization (EM) algorithm is employed to estimate the parameters of the joint density Gaussian mixture model λ(z) = (z) (z) (z) {wl , µl , Σl |l = 1, 2, . . . , L} in maximum-likelihood sense. In conversion phase, the joint probability density GMM model is employed to formulated a conversion function. For each source speech feature vector x, the conversion function ˆ in F (x), which predicts the target speaker’s feature vector y minimum mean square error sense, can be expressed as : "

(z) µl

F (x) =

L X

(y)

pl (x)(µl

(yx)

+ Σl

(xx) −1

(Σl

)

xx w N (x|µx l ,Σl ) PL l x ,Σxx ) w N (x|µ k k=1 k k

is the posterior probabilwhere pl (x) = ity of the source vector x for the l-th Gaussian component. During the JD-GMM model parameter estimation process, the mean vector and the covariance matrix of l-th Gaussian component can be calculated as: (z)

µl

= Pt=1 T

pl (zt , λ

t=1

(z) Σl

=

PT

t=1

(z)

pl (zt

)zt

, λ(z) ) (z)

.

(3)

(z)

pl (zt , λ(z) )(zt − µl )(zt − µl )⊤ PT (z) ) t=1 pl (zt , λ

The equation (2) can also be presented as a linear regression model: yi = Bxi + ε,

(5)

where xi and yi denote the source and target observation data for i-th frame, B is regression matrix or transformation matrix and ε is the regression residual. Different from equation (2), which use all the training samples to estimate the transformation matrix, the local linear transformation (LLT), [9], uses the neighboring data to estimate the individual transformation matrix for each feature vector. We use this method as our second baseline, and following is a brief introduction. Firstly, we select the K-nearest neighbors (KNN) in terms of Euclidean distance in X for the feature vector of each test frame xtest i : test d(xtest i , X) =k xi − X k

(6)

The K paired vectors in Y are selected simultaneously. The XKNN and YiKNN are: i XKNN = [xi,1 , xi,2 , · · · , xi,k ], i YiKNN = [yi,1 , yi,2 , · · · , yi,k ],

(7)

where, xi,k means the k-th nearest vector in X for xtest i , d×k KNN d×k XKNN ∈ R and Y ∈ R . i i Then, for each test feature vector xtest i , the linear transformation Bi ∈ Rd×d could be calculated by the neighborhood we just selected, using the linear regression model: YiKNN = Bi XKNN i

(8)

Using least squares criterion, the Bi is:

(x)

(x − µl )), (2)

l=1

PT

B. Local linear transformation method

(4)

From (3) and (4), we notice that all the training samples are used for mean and covariance calculation, so-called statistical average, which results in the over-smoothing of the converted speech.

)⊤ Bi = XKNN (XKNN i i

−1

XKNN (YiKNN )⊤ i

(9)

ˆ i of source vector xtest Finally, the converted speech vector y i is given as: ˆ i = Bi xtest y i

(10)

Obviously, the quality of the conversion will be influence by the neighborhood selection. The converted results can be used for the selection of new K-nearest neighbors to improve the performance. We can combine the testing source xtest and i ˆ i into a new vector set, and then choose the converted results y KNN again from the whole training data set Z.

   xtest  

xtest i i

− Z (11) ,Z = d

ˆi ˆi y y

This reselection process is iterated till the neighborhoods determined in consecutive steps become virtually identical or sufficiently similar.

function SIMPLS (XKNN , YKNN , h)

III. P ROPOSED

LOCAL PARTIAL LEAST SQUARE REGRESSION METHOD

From section II, we note that the local linear transformation method uses adjacent training data to estimate the transformation matrix rather than the whole training data to avoid the over-smoothing problem. The size of the neighborhood, however, affects the performance. In order to obtain a robust transformation matrix, partial least square regression is employed to relax the KNN selection constraint.

1: 2: 3: 4: 5: 6: 7: 8: 9:

A. Basic model of partial least square Compare with the linear regression in (5), partial least square (PLS) [15] is a regression method used to find a relationship between source and target feature vectors in a new low-dimensional space. The model of PLS is written as follows:

10:

X = RW⊤ + E,

(12)

16: 17:

Y = QU⊤ + F,

(13)

18:

where R and Q are the factor loading matrices of source and target vectors; W⊤ and U⊤ are low-dimensional representation matrices for source X and target Y, respectively. E and F are residual components. Here, R ∈ Rd×h , Q ∈ Rd×h , W⊤ ∈ Rh×T , U⊤ ∈ Rh×T , X ∈ Rd×T and Y ∈ Rd×T . d is the dimension of feature vector, and h is the number of PLS components. If h = d, which means the number of variables is set to the number of predictors, PLS becomes equivalent to standard linear multivariate regression as that in (5) and (9). Usually, the number of PLS components h is lower than the dimension of the feature vector d. Therefore, it is able to produce a robust transformation with a small amount of training data. In this regard, PLS regression is suitable for using limited data to estimate a transformation matrix. B. Combining PLS with LLT

11: 12: 13: 14: 15:

19: 20: 21: 22: 23: 24: 25:

X = (XKNN )⊤ Y = (YKNN )⊤ Y0 = Y− MEAN(Y) S = X⊤ × Y0 for i = 1, · · · , h do q = dominant eigenvector of S⊤ × S r=S×q w =X×r w = w− MEAN(w) kwk = SQRT(w⊤ × w) w = w/kwk r = r/kwk p = X⊤ × w q = Y0⊤ × w u = Y0 × q v=p if i > 1 then v = v − V × (V⊤ × p) u = u − W × (W⊤ × u) end if v = v/ SQRT (v⊤ × v) S = S − v × (v⊤ × S) Store r, w, p, q, u and v as i-th columns of matrices R, W, P, Q, U and V, respectively end for Then, the regression matrix can be obtained by B = Q × R⊤ .

IV. E XPERIMENTS To evaluate the performance of our proposed method, several experiments were conducted, with baseline methods for comparison.

As described in section 3.A, PLS is able to achieve a good performance given limited training observations. In order to prevent the over-fitting problem caused in LLT method, we propose to combine PLS with LLT. Specifically, PLS could be used to substitute the least squares solution used in LLT method to obtain the transformation matrix of each testing and YiKNN , frame. In this case, the K-nearest neighbors, XKNN i obtained in the first phase of LLT mentioned in section 2.B, are used as input for PLS method. Different to [9], where the KNN selection is based on perceptual evaluation, we will use objective evaluation methods to choose the neighborhood.

A. Acoustic Data The VOICE database [17] with the speech signal downsampled from 22.5 kHz to 16 kHz is used in our experiments. Four speaker pairs: male-to-male, male-to-female, female-tomale and female-to-female are selected from the database. The STRAIGHT system [18] is used to extract spectral envelope, which is represented by mel-cepstral coefficients (MCCs) with order 24. A parallel set of 20 sentences aligned by dynamic time warping are used as training data, and another 20 sentences are used for objective evaluations. The results are averaged over all the conversion pairs.

C. SIMPLS Algorithm

B. Model Settings

Several variant methods could be used for solving the PLS regression problem. In this paper, we use the SIMPLS (simple partial least squares) algorithm proposed by de Jong [16]. This algorithm could provide faster computation speed, as the weight factors can be obtained without matrix inverses. Following is a brief description of the algorithm.

In order to evaluate the performance of the proposed method, the results of JD-GMM, PLS and LLT methods are also reported as references. • JD-GMM: It is the mainstream method as described in section 2.A. The number of Gaussian components is set to 64.

LLT: As the conversion quality varies between different numbers of neighborhood, we evaluate the system using KNN with K from 50 to 300 with the step 10. Another factor which will influence the transform performance is the number of iterations to choose K-nearest neighbors in (11). Our results show that one iteration is enough to find nearest neighbors which give the lowest distortion. Thus, in this paper, we just report the results for one iteration. • PLS: The whole training data are used to learn the transformation matrix from source feature to target feature. By changing the number of components, we can find an optimal latent PLS number, which yield a low distortion error. • LPLS (proposed): Using local data and partial least square regression method to estimate the transformation for each frame as described in section 3. The spectral envelop is converted using above conversion methods, with fundamental frequency converted by equalizing the means and variances of source and target speakers in logscale. •

where cn,i and cconv n,i are the i-th dimension target and converted MCCs in frame n, respectively. A lower MCD value indicates smaller distortion. To evaluate the conversion performance, the correlation coefficient is also calculated between target and converted MCC parameters. Different to the MCD calculation which calculated by frame, the correlation coefficient was calculated for each dimension defined as follows: conv − ci )(cconv ) n,i − ci , qP N conv conv 2 2 ) n=1 (cn,i − ci ) n=1 (cn,i − ci

γi = qP N

n=1 (cn,i

Spectral Distortion [dB]

6.4 6.2 6 5.8 5.6 5.4 5.2 5 0

5

10 15 Number of latent component

20

25

Fig. 1. Spectral distortion as a function of number of latent components.

0.55 0.5 0.45 0.4

GMM LLT PLS LPLS

0.35 0.3

The mel-cepstral distortion (MCD) between the target and converted mel-cepstral is used as the objective evaluation measure. The following equation is the MCD for n-th frame: v u 24 X 10 u t2 2 (14) (cn,i − cconv MCD[dB] = n,i ) , ln10 i=1

PN

GMM LLT PLS LPLS

6.6

Correlation

C. Evaluation Methods

7 6.8

(15)

where cn,i and cconv n,i are the i-th dimension target and converted denote the mean value of MCCs in n-th frame; ci and cconv i the target and converted MCCs in i-th dimension, respectively. Here, higher correlation coefficient means higher similarity between the target and converted MCCs. We report the distortion and correlation coefficient results by averaging all the conversion pairs. D. Objective Results 1) The Effect of number of PLS Components: We first evaluate the effect of PLS components number in LPLS method and PLS method. The number of KNN was fixed as 200. The number of PLS components varies from 1 to 24

0.25 0.2

0

5

10 15 Number of latent component

20

25

Fig. 2. The correlation coefficient as a function of number of latent components.

(since the MCC feature is 24 order). The results of JD-GMM method, PLS and LLT methods were also shown as references. Fig. 1 depicts the spectral distortion results. By increasing the latent components, the MCD of PLS method, which using whole training data, decrease continuously; while the MCD of LPLS method reduces at the beginning, then increases because of over-fitting. The LPLS method yield a lower error than reference methods, when the number of latent components is smaller than 11, and the optimal number of latent components for LPLS method is 3 with the lowest distortion 5.10 dB, which is much lower than the results of the two baseline method: JD-GMM method (5.31 dB) and LLT method (5.58 dB) respectively. With different amount of training data, the best result of PLS method is 5.42 dB for the number of latent components around 18. This result is also 0.32 dB higher than the result of our proposed method. As the latent components number increased from 3 to 24, the distortion is getting higher and higher. This occurs as the training data used to estimate the transformation matrix is limited. When the number of latent components equals to the feature dimension (24), the distortion is almost the same as using LLT method. Correlation coefficient results are presented in Fig. 2. In accordance with the results of MCDs, with 3 latent components, the LPLS method achieved highest correlation coefficient

7.5

Original GMM LLT PLS LPLS

Spectral Distortion [dB]

7

6.5

6

5.5

5 50

100

150 200 Number of KNN

250

300

Fig. 3. Spectral distortion as a function of number of nearest neighbor.

0.6

coefficient increases, with the number of KNN grows from 50 to 200, and the result of LPLS method become stable after 200, whereas the result of LLT continues increase. V. C ONCLUSIONS In this paper, we have proposed a local partial least square (LPLS) method for voice conversion. The use of KNN data can avoid the over-smoothing problem as described in section 2. In addition, partial least square regression is able to produce more robust transformation function with limited training data. Therefore, the proposed LPLS method is capable to balance the over-smoothing and over-fitting problems in conventional methods. The experimental results indicate that our proposed method outperforms three baseline methods in terms of objective evaluation. In the future, we plan to use multiple source features to predict target and add evaluate by both objective methods and subjective listening test.

0.55

VI. ACKNOWLEDGEMENT Correlation

0.5

0.45 Original GMM LLT PLS LPLS

0.4

0.35

R EFERENCES

0.3

0.25 50

This research is supported in part by Interactive and Digital Media Programme Office (IDMPO), National Research Foundation (NRF) hosted at Media Development Authority (MDA) of Singapore (Grant No.: MDA/IDM/2012/8/8-2 VOL 01).

100

150 200 Number of KNN

250

300

Fig. 4. The correlation coefficient as a function of number of nearest neighbor.

(0.522), which is 0.039 and 0.062 higher than the results of JD-GMM method and LLT method, and 0.042 higher than the best result of PLS. The performance is getting worse when the latent components number increased above 4. 2) The Effect of number of KNN: We then evaluate the effect of the number of KNN vectors in LPLS and LLT methods, and the spectral distortion and correlation coefficient were calculated by varying the number of KNN in 10 steps from 50 to 300. According to the results of previous experiments, the PLS components number for LPLS method and PLS method were fixed as 3 and 18 respectively. The results were averaged over four conversion pairs and presented with results of JDGMM and original data together. Fig. 3 indicates how the number of KNN affect the spectral distortion results for different methods. With LLT method, While the number of KNN is very small, as 50 for example, the spectral distortion is almost the same as the result calculated directly by the source and target features, which means the method does not work; Contrast with the LLT method, LPLS method is performed well with limited training observations. With 50 KNN, the result is already better than JD-GMM method and PLS method. The correlation coefficient results with different number of KNN were shown Fig.4. It shows the similar as the results of MCDs. As in both LLT and LPLS methods, the correlation

[1] Yannis Stylianou, Olivier Capp´e, and Eric Moulines, “Continuous probabilistic transform for voice conversion,” Speech and Audio Processing, IEEE Transactions on, vol. 6, no. 2, pp. 131–142, 1998. [2] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 8, pp. 2222–2235, 2007. [3] Chung-Hsien Wu, Chi-Chun Hsia, Te-Hsien Liu, and Jhing-Fa Wang, “Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, no. 4, pp. 1109–1116, 2006. [4] Elina E Helander and Jani Nurminen, “A novel method for prosody prediction in voice conversion,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 2007, vol. 4, pp. IV–509. [5] Zhi-Zheng Wu, Tomi Kinnunen, Eng Siong Chng, and Haizhou Li, “Text-independent F0 transformation with non-parallel data for voice conversion,” in Eleventh Annual Conference of the International Speech Communication Association, 2010. [6] Damien Lolive, Nelly Barbot, and Olivier Boeffard, “Pitch and duration transformation with non-parallel data,” Speech Prosody 2008, pp. 111– 114, 2008. [7] Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef Gabbouj, “Voice conversion using partial least squares regression,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 5, pp. 912–921, 2010. [8] Zhizheng Wu, Tomi Kinnunen, Eng Siong Chng, and Haizhou Li, “Mixture of factor analyzers using priors from non-parallel speech for voice conversion,” IEEE SIGNAL PROCESSING LETTERS, vol. 19, no. 12, 2012. [9] Victor Popa, Hanna Silen, Jani Nurminen, and Moncef Gabbouj, “Local linear transformation for voice conversion,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4517–4520. [10] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad, “Voice conversion using artificial neural networks,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 3893–3896. [11] Elina Helander, Hanna Sil´en, Tuomas Virtanen, and Moncef Gabbouj, “Voice conversion using dynamic kernel partial least squares regression,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 3, pp. 806–817, 2012.

[12] Zhizheng Wu, Eng Siong Chng, and Haizhou Li, “Conditional restricted boltzmann machine for voice conversion,” in the first IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP). IEEE, 2013. [13] Yining Chen, Min Chu, Eric Chang, Jia Liu, and Runsheng Liu, “Voice conversion with smoothed GMM and MAP adaptation,” in Eurospeech2003, 2003, pp. 2413–2416. [14] Alexander Kain and Michael W Macon, “Spectral voice conversion for text-to-speech synthesis,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. IEEE, 1998, vol. 1, pp. 285–288. [15] Roman Rosipal and Nicole Kr¨amer, “Overview and recent advances in partial least squares,” in Subspace, Latent Structure and Feature Selection, pp. 34–51. Springer, 2006. [16] Sijmen de Jong, “SIMPLS: an alternative approach to partial least squares regression,” Chemometrics and Intelligent Laboratory Systems, vol. 18, no. 3, pp. 251–263, 1993. [17] Alexander Blouke Kain, High resolution voice transformation, Ph.D. thesis, Rockford College, 2001. [18] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain de Cheveign´e, “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3, pp. 187–207, 1999.

Suggest Documents