Voice Processing Technique for Human Cerebral Activity Measurement

Voice Processing Technique for Human Cerebral Activity Measurement Kakuichi Shiomi Electronic Navigation Research Institute 7-42-23 Jindaiji-Higashima...
Author: Roderick Porter
7 downloads 1 Views 1MB Size
Voice Processing Technique for Human Cerebral Activity Measurement Kakuichi Shiomi Electronic Navigation Research Institute 7-42-23 Jindaiji-Higashimachi, Chofu. Tokyo, Japan [email protected] or [email protected] Abstract—Variations in the chaotic characteristics of the human uttered voice are strongly related to the speaker’s fatigue condition. When we define a time-local exponent to evaluate the time-local characteristics of the uttered voice time domain signal and calculate it periodically or continually, it is possible to observe changes in human cerebral activity. The time-local exponent, called the Cerebral Exponent, can be defined like the first Lyapunov exponent. If it is possible to calculate the Cerebral Exponent of uttered voice signals in real-time, the proposed chaotic voice analysis technology can be considered as a technology to install a tachometer on the brain of a speaker. Keywords—voice analysis, fatigue, human cerebral activity

I.

immediately before departure, “Shu-ppatsu Shinko!” (which means “start and go ahead.”). The strange attractor shown in Fig. 1 was generated from a voice recording taken some tens of minutes after the start of a driving exercise in a railroad vehicle driving simulator, while Fig. 2 was generated from a voice recording taken after about five hours of simulated driving. Five hours of simulated railroad driving generally causes fatigue, and the author hypothesized that the fluctuations in the uttered voice observed in Fig. 1 decreased in Fig. 2 due to this fatigue. Moreover, it was hypothesized that if it were possible to evaluate the level of fluctuations in the uttered voice, it would be possible to quantify the level of fatigue of the speaker.

INTRODUCTION

In 1998, the author and Mr. Shozo HIROSE had found that the time-averaged value of the first Lyapunov exponent calculated from a sampled human voice time domain signal changed according to the speaker’s condition [1]. The first Lyapunov exponent calculated from uttered voice was not good enough to evaluate time-local characteristics of the human voice, and in 2002 the author defined a new exponent named the Cerebral Exponent (CE). The Cerebral Exponent (CE) is defined to evaluate time-local characteristics of a time domain signal that has some chaotic characteristics and a peak in its spectrum, like the human voice. Furthermore, the author created an algorithm named SiCECA (Shiomi’s Cerebral Exponent Calculation Algorithm) to calculate the CE. The first implementation of SiCECA on a personal computer (Motorola PowerPC, 500 MHz clock) was able to calculate the CE from 10 seconds of uttered voice signal in about 5 minutes. Since 2002, the author and his staffs have carried out many experiments to confirm the function of the voice analyzing software that is implemented in SiCECA, and a typical result is presented in this paper. II.

HUMAN VOICE ANALYSIS

A. Historical Discovery Figures 1 and 2 each show a mathematical figure called a “strange attractor” generated from a vocalized “o” sound. The generation technique for the strange attractor is described later. Each strange attractor is generated from the last 80ms of the vocalized “o" sound of call made by railway drivers

1-4244-2384-2/08/$20.00 ©2008 IEEE

Fig.1 Strange attractor of voiced “o” sound: the speaker is not fatigued.

Fig.2 Strange attractor: the speaker is sleepy with exhaustion.

B. Conventional Spectrum Analysis of Human Voice In conventional frequency/spectrum analysis, the signal to be processed is suitably extracted from a time domain voice signal during which the speaker is uttering, and power spectrum is observed. Figure 3 shows typical shape of waveforms of each Japanese vowel sound “a”, “e”, “i”, “o” and “u” and their power spectra. The phonemes differ in their power spectra in the number and the position of the peaks corresponding to the pitch and formant frequencies. The height of the peak of a pitch frequency and those of the formant frequencies that are the overtones of the pitch change according to the duration of utterance of the phoneme and the size of the data sample used to calculate the power spectrum.

SMC 2008

If the speaker becomes tense, the pitch frequency rises and the frequencies of all peaks in the spectrum also rise consequentially. However, until now it has not appeared possible to evaluate the speaker’s level of fatigue from the change in the pitch frequency and frequencies of peaks.

Figure 4. Time domain waveform and its strange attractor

the local time stability of the system that generates the uttered voice time domain signal is not sufficient, this will cause embedding in an extremely high dimension and give a wrong evaluation result.

Figure 3. Wave form and power spectrum of Japanese vowel sound

C. Chaotic Analysis of Human Voice In the chaotic analysis of a time domain signal, we analyze the strange attractor shown in Fig. 4 that is reconstructed in phase space according to Takens’ embedding theorem [2]. The process is quite similar for sampled audio signals. Although the strange attractor in the example in Fig. 4 is reconstructed in two-dimensional phase space, it is usually reconstructed in a phase space the dimension of which is equal or higher than the fractal dimension of the time series which is embedded in it. Since it is thought that the fractal dimension of uttered voice signals is between four and six, audio signals are usually embedded in a phase space of four or more dimensions. Generally, there is no upper bound for the embedding dimension, but it is typically recommended that the time domain signal should be embedded in a phase space of sufficiently high dimension that computation time will be excessive. However, if

Figure 5. Strange attractors of Japanese vowel sound

SMC 2008

The parameter τ used in the construction of the strange attractor in Fig. 4 is called the embedding delay time. In general, the time at which the auto correlation coefficient of the time domain signal first becomes zero is used as the embedding delay time. However, we have experimentally confirmed that determining the embedding delay time by this method does not give good results when we attempt to quantitatively evaluate a speaker’s level of alertness from the degree of fluctuation in uttered voice. In the case of sampled voice time domain signals, the author believes that it is suitable to choose the embedding delay time such that it gives a strange attractor the structure of which can be easily understood visually. Strange attractors with different shapes can be constructed for each vowel as shown in Fig. 5 when the single vowel sound is uttered continuously. In the current state-of-the-art in chaotic analysis, the first Lyapunov exponent is usually calculated as an index of the chaotic characteristics of a time domain signal. Calculating the first Lyapunov exponent is sufficiently effective to quantify the level of fluctuation in a uttered voice signal. D. Chaotic Analysis of Human Voice The purpose of the author’s analysis is to quantify the amount of fluctuation in uttered voice signals, and to confirm whether it is possible to evaluate the activity of speaker’s cerebrum, or more precisely the part of the cerebrum concerned with speech, by quantifying the level of the fluctuation in the uttered voice or observing variations in the level. Up to the present, it had been almost confirmed experimentally that the level of fluctuation in an uttered voice signal correlates with the alertness of the speaker. The author also aims to realize an algorithm to calculate the time-local level of fluctuations in uttered voice signals more quickly and precisely. In the calculation of the first Lyapunov exponent, the system that generates time domain signal must be sufficiently stable so that a strange attractor can be constructed that can be analyzed by current state-of-the-art chaos theory. It is not possible to calculate a proper first Lyapunov exponent when the analyzed strange attractor is constructed from a time domain signal containing two or more vowels. It is possible to apply Sano & Sawada’s algorithm or another algorithm for the approximate calculation of the first Lyapunov exponent when a single vowel phoneme is uttered for one second or more [3]. However, in normal daily speech, which contains more than six phonemes per second, the duration of utterance of a single phoneme is only about 100ms, which is too short to calculate the first Lyapunov exponent by Sano-Sawada’s algorithm or another algorithm. By conventional spectrum analysis, however, it is possible to derive the pitch frequency and each formant frequency even if the processed data sample contains two or more vowel phonemes. The spectrum calculated from a data sample containing multiple phonemes simply consists all of peaks contributed by each phoneme. Moreover, peaks corresponding pitch frequencies have a width that is interpreted as the level of fluctuation of the pitch frequencies when the duration of the sample is long enough that the fluctuations in pitch frequency can be captured. Figure 6 shows a pair of speech waveforms recorded when reading a pair of texts, 1 and 2, and the corresponding power

spectra. The voice data to be processed was arbitrarily cut from the continuous text reading recording, and the power spectra were calculated from the data. In the calculation of the power spectra the data size of FFT is 16,384, and the Hanning window function is applied.

Figure 6. Two waveforms and power spectrums of reading 1 and 2

By comparing the two power spectra in Fig. 6 it can be seen that the fluctuation level of the pitch frequency of reading 1(-1) is smaller than that of reading 2(-1). Indeed, it is possible to confirm that the fluctuation level of the waveform of reading 1(-1) is smaller than that of reading 2(-1) by the chaotic analysis method. Even if the fluctuation level of pitch frequency does not correspond to that of waveform, if the fluctuation level of uttered voice shown in Fig. 5 could instead be observed and quantified from the power spectrum, it would not be necessary to derive it by the chaotic method. This would be greatly advantageous since the chaotic method is complex and time consuming (the calculation of the first Lyapunov exponent takes three or more orders of magnitude longer than calculating the power spectrum). Unfortunately, however, more detailed analysis shows that power spectrum analysis does not appear in fact to quantify the fluctuation level of uttered voice, as explained below.

Figure 7. Two waveforms and power spectrums of speeches delayed a hundred and several tens milliseconds

Figure 7 shows waveforms and power spectra generated from different voice signal samples taken from the same recording as used in Fig. 6. The difference between the

SMC 2008

utterance signals used in Figs. 6 and 7 is only the cutout time of the data sample. Comparing the two power spectra in Fig. 7, it does not appear that the peak width of the pitch frequency quantifies the fluctuation level. If the FFT data size is enlarged by four times to 65,536 as shown in Fig. 8, what appears to be one fluctuated pitch frequency peak in reading 2(-1) in Fig. 6 is shown to be divided into two peaks.

Lyapunov exponent. The author then defined such indices of time-local chaotic characteristics as follows, which are termed the “Cerebral Exponent micro (CEm)” and the “Cerebral Exponent Macro (CEM),” since it is hypothesized that the CEM values indicate human cerebral activity. Here, a time series of a signal s = s(t), such as a continuous speech voice to be processed by SiCECA, is defined as Eq. 1: s = s(t i ) , i = 1, 2, 3, ...

Eq.1.

When the embedding dimension is 4, an embedded point P0 in phase space at time t0 is described as P0 (s(t 0 ), s(t 0 + τ d ), s(t 0 + 2 τ d ), s(t 0 + 3 τ d ))

Eq. 2.

where τ d in Eq. 2 is embedding delay time. Figure 8. Power spectrum of reading voice 2 (FFT data size: 65,536)

In the case of reading 1, details of the power spectrum can also be observed when the FFT data size is increased by a factor of two (32,768) and four (65,536). However, it does not become to possible to quantify the fluctuation level of the uttered voice. If it were possible to quantify the fluctuation level of uttered voice by spectrum analysis, there would be no problem if voice signal contained several vowel sounds, though this presents a serious problem in chaotic analysis. However, from the above result, it appears that it is not possible to quantify the fluctuation level of the uttered voice from spectrum analysis.

Figure 9. Power spectrum of reading voice 1 (FFT data size: 32,768)

Figure 10. Power spectrum of reading voice 1 (FFT data size: 65,536)

E. Shiomi’s Cerebral Exponent Calculation Algorithm (SiCECA) Since general speech contains many phonemes per second, it is not possible to calculate the first Lyapunov exponent directly from a general speech time domain signal. The first Lyapunov exponent has not been defined in the time series mentioned above in the current chaos theory. When evaluating the fluctuation level of general speech from its chaotic characteristics, it is necessary to define a new index or exponent that can be said to be the first time-local

The locus of P0, P1, P2, P3, P4, …. as it changes with time is called Takens’ plot, and constitutes the strange attractor of the time series s(t). The shape of the strange attractor for each vowel is as shown in Fig. 5, and it seems that each pseudo-orbit is fluctuating. In the conventional algorithm to calculate the first Lyapunov exponent, it is necessary to set the neighborhood point condition beforehand, and “the neighborhood point set of point P0” is created by finding a predetermined number of neighborhood point from the entire unit for processing. On the other hand, in SiCECA, the neighborhood point set is created according to the following procedure: the first neighborhood point of P0 is the point PN1 that turns from the point P0 once round a pseudo-orbit and approaches the most; the second neighborhood point PN2 is the point that turns twice round a pseudo-orbit and approaches the most; and so on. After creation of the neighborhood point set, it becomes possible to calculate CEm value at time t0 by setting the appropriate evolution time τe in SiCECA in a way similar to the calculation of the first Lyapunov exponent by Sano-Sawada’s algorithm. The SiCECA algorithm is carried out in two steps. The first step is to make a list that associates the CEm value with εs (SiCECA epsilon) for all the embedded points constituting the strange attractor. εs is the diameter of the super sphere enclosing the neighborhood point set. The second step of SiCECA is statistical processing to calculate CEM from the list generated in the first step. For example, a CEM value of the entire processed voice signal can be obtained by averaging the CEm values for which εs is equal or less than 3.0 % of the diameter of the super sphere enclosing the strange attractor. III.

AN EXPERIMENT AND TYPICAL RESULTS

A. Method A subject reads aloud from one reading card under two conditions. The reading card is an excerpt from an old Japanese tale that can be read in about ten or less seconds. The subject first reads the card normally. The subject then reads the same card again but with his/her fingers in his/her ears to induce mental stress. The reading voice was recorded using a microphone (AKG C391B) and a PCM recorder (Marantz PMD670) at sampling frequency of 48.0 kHz with 16 bits/sample.

SMC 2008

A total of 41 subjects participated in the experiment: 36 males and five females. B. Results Figure 11 shows CEM values that calculated from the uttered voice recordings of the card readings for different values of the embedding dimension. For each subject, the CEM values are plotted as a point on a graph with the normal reading CEM value along the abscissa and the mental stress CEM value along the ordinate. For instance, the point labeled subject-A in Fig. 11 (top) shows the case of 4-dimensional embedding, a normal reading CEM value of about 460, and a stressed reading CEM value of about 580.

of the sound of their voice when closing their ears, and the stress this causes increases the CEM value. As Fig. 11 shows, the majority of points do indeed lie above the x=y line for all values of embedding dimension. For other points, it may be the case the some subjects were initially stressed for some reason causing their normal reading CEM values to be comparatively high, resulting in a smaller differences between their “normal” and “stressed” reading states. Also, the average female CEM values appears to be smaller those of males. IV.

SHORT CONCLUSION

The author believes that this and other experiments have confirmed that the quantitative evaluation of human cerebral activity had became possible by processing uttered voice signals by the SiCECA algorithm. Since the area dealing with language occupies a large part of the human brain, the author thinks that it will be possible to evaluate whole cerebral activity if uttered voice samples can be obtained under appropriately controlled conditions. Imagine if the status of the human brain can be measured as simply as blood pressure, such as the brain has ample margin; the brain has been used considerably; the brain is wide awake; or the brain is half asleep or is in a haze from lack of stimulation. By measuring the status of the brain on a daily basis, it may be possible to assess chronic fatigue or overwork, extreme tension, or even the degree of cerebral dysfunctions. In 2008, the latest implementation of SiCECA on a PC (Intel Core2 processor, 2.0 GHz clock) takes less than five seconds or less to calculate CEM from a five-second sample of uttered voice. ACKNOWLEDGEMENT The system of uttered voice chaotic analysis used in this experiment has been granted the following patents in the U.S.A.: US 6,876,964 (Oct. 19, 2005), US 7,321,842 (Jan. 22, 2008) and US 7,363,226 (Apr. 22, 2008). REFERENCES [1]

[2] [3] [4]

[5]

K. Shiomi and S. Hirose, “Fatigue and Drowsiness Predictor for Pilots and Air Traffic Controllers” Proc. of 45th Annual ATCA Conference, Oct. 2000. F. Takens, “Det ermining strange attractors in turbulence, ”Lecture Note in Mathemat ics, vol. 898, pp. 366–381, 1981. M. Sano and Y Sawada, “Measurement of the Lyapunov Spectrum from a Chaotic Series” Physical Review Letters,1985,pp.1082-1085. T. Ikeguchi and K. Aihara, “Lyapunov Spectral Analysis on Random Data” International Journal of Bifurcation and Chaos , 1997 , pp.1267-1282 http://www.siceca.org

Figure 11. Distribution of CEM values

Points along the diagonal line (x=y) have no difference between the CEM values in each experimental condition (normal / stressed), while points above the line have higher CEM values in the stressed condition than in the normal condition. It is considered that subjects become more conscious

SMC 2008