The present status, progress, and usage of speech databases in Japan

Acoust. Sci. & Tech. 26, 1 (2005) TECHNICAL REPORT The present status, progress, and usage of speech databases in Japan Hisao Kuwabara1;  Shuich It...
Author: Martha Gilbert
20 downloads 1 Views 55KB Size
Acoust. Sci. & Tech. 26, 1 (2005)

TECHNICAL REPORT

The present status, progress, and usage of speech databases in Japan Hisao Kuwabara1;  Shuich Itahashi2 , Mikio Yamamoto2 , Satoshi Nakamura3 , Toshiyuki Takezawa3 and Kazuya Takeda4 1

Teikyo University of Science and Technology, Uenohara, Kitatsuru-gun, Yamanashi, 409–0193 Japan 2 University of Tsukuba, Tennodai, Tsukuba, Ibaraki, 305–8573 Japan 3 ATR Spoken Language Translation Research Laboratories, Hikaridai, Soraku-gun, Kyoto, 619–0288 Japan 4 Center for Integrated Acoustic Information Research, Nagoya University, Furomachi, Chigusa-ku, Nagoya, 464–8603 Japan ( Received 26 February 2004, Accepted for publication 22 June 2004 ) Abstract: The present status, progress and usage of Japanese speech database has been described. The database project in Japan started in the early 1980s. The first was by the Japan Electronic Industry Development Association (JEIDA), which aimed at creating a speech database to evaluate performance of the existing speech input/output machines and systems. Several database projects have been undertaken since then, including the one initiated by the Advanced Telecommunication Research Institute (ATR), and now we have reached a point where an enormous amount of spontaneous speech data is available. A survey was conducted recently on usage of the presently existing speech databases among industry and university institutions in Japan where speech research is now actively going on. It was revealed that the ATR’s continuous speech database is the most frequently used, followed by the equivalent version from the Acoustical Society of Japan. Keywords: Speech database, Continuous speech corpus, Spontaneous speech, Speech synthesis, Speech recognition PACS number: 43.10.Ln

1.

[DOI: 10.1250/ast.26.62]

INTRODUCTION

As speech processing technologies, such as speech recognition and synthesis, develop together with the digital computer’s progress in speed, memory and handiness, a growing demand for large-scale speech databases has became evident since 1980s. Especially in the late 1980s and early 1990s, the database projects have been stimulated to grow by a series of database workshops in many industrially advanced countries, and by the foundation of societies and consortia related to speech and language databases such as LDC, ELRA, and GSK. In speech research, especially in the field of speech processing technologies, research progress is largely dependent on commonly used large speech databases. The speech database project in Japan started in the early 1980s initiated by JEIDA (Japan Electronic Industry Development Association) [1]. It was a time when several high-tech firms began 

e-mail: [email protected]

62

delivering speech I/O machines to the market claiming the best performance by themselves. There was a need to standardize the evaluation of these I/O machines using a common speech database. Soon after finishing this initial project in 1985, ATR (The Advanced Telecommunication Research Institute, International) began database construction from the very beginning of its foundation [2]. The work was chiefly aiming at creating a large-scale corpus for use in continuous speech recognition research. The mid1980s was a time when researchers in the field of speech technology started seriously to develop continuous speech recognition algorithm since they saw limitations to the future of isolated word recognition. In 1990, the JEIDA project was reformed as an ASJ (Acoustic Society of Japan) database projects adding more members in the committee and creating such public domain speech corpora as continuous speech, and Japanese newspaper article read speech (JNAS) [3]. Recently, various speech corpora including spontaneous speech, and free conversational speech, rather than traditional read speech, are being

H. KUWABARA et al.: SPEECH DATABASE

created [4–6] reflecting the development of speech understanding systems. Also, speech corpora of elderly people, reflecting the rapidly aging society, and those collected in a real noisy environment, in accordance with the wide use of cellar phones, are being built today [7]. Recently, researchers in speech are beginning to disclose the database they have used. It is still not well known, however, what kinds of speech corpora are used and for what purposes. The ASJ Speech Database Committee has recently conducted a survey on speech database use to monitor the present status of database use and provide guidance for future database activities. Questionnaire sheets were sent to many industry/university institutions where speech research is being actively conducted. This paper reports on analysis of the survey after looking briefly at the speech databases that have already been built and exist today, and database-related research projects that have now been underway [8].

2.

THE PRESENT STATUS OF SPEECH DATABASE

Japanese speech database project started in 1982, through the formation of JEIDA, by building an isolatedword speech database. Then the committee was re-formed as a new organization attached to ASJ and it began to collect a large amount of speech data. At the same time, the Advanced Telecommunications Research Laboratories International (ATR) was founded and started to build a large-scale speech database, originally for exclusive ATR use, but later to be commonly used among speech researchers. 2.1. Existing Speech Databases Table 1 shows a list of databases. Data collection has already finished in these corpora. The Ministry of Education’s ‘‘Japanese Speech’’ database is somewhat different from the rest of the corpora listed in the table since it is the only one that collected across Japan as its purpose have to serve as a tool for linguistic (dialect) research and language education. The most frequently used corpora at present are those of ATR and the ASJ. Table 2 describes the contents of the ASJ database. The first corpus built in this database project was the continuous speech database. The ATR’s 503 phonetically balanced sentences were included in this database. To cope with the Table 1 Existing main speech databases. Speech database

Year

JEIDA speech database ATR speech database Ministry of Education ‘‘Japanese Speech’’ database ASJ speech database

1982–1985 1986– 1986–1991 1990–1997

Table 2 Contents of the ASJ speech database. Corpus 1. Continuous speech for research — ATR’s phonetically balanced sentences (503) — Guidance task sentences (1,027) — Simulated dialog (37)

Speaker

30 males, 34 females 18 males, 18 females 29 males, 8 females

2. Japanese newspaper article sentences — Mainichi Shinbun articles 153 males, 153 females (15,500) — Subset of ATR’s phonetially 153 males, 153 females balanced sentences (50)

Table 3 The contents of ATR speech database. Corpus

Speakers

1. Isolated word database: Set A — Important words (5,240) — Phonetically balanced words (216) — Numerals, alphabets (110) — Simulated dialog (115)

20

2. Continuous speech database: Set B — Phonetically balanced sentences (503)

10

3. Database for large number of speakers: Set C — Important words (750, subset from Set A) — Phonetically balanced sentences (150, subset from Set B))

250

progress in speech understanding research, some guidance task sentences and a few simulated dialogues were also added to this database. Table 3 depicts the contents of the ATR speech database. The isolated word database (Set A) consists of 5,240 most common words selected from Sanseido’s Shinmeikai Japanese Dictionary, 216 phonetically balanced words, 110 numerals and alphabets, and 115 simulated dialogues. The continuous speech database (Set B) consists of 503 ‘phonetically balanced’ sentences selected from newspapers and magazines. The contents of Set C are 750 important words, a subset of Set A, and 150 phonetically balanced sentences which are also part of Set B. Speakers for the Set A and Set B databases are all professional announcers. The speech data in the above mentioned databases are all collected in a quiet studio and the data are laboratoryregulated. Recently, as will be mentioned later, there are now databases that have been collected in a real living environment from the beginning, but most databases that have already been built add some kinds of artificial noise to simulate real environments. The JEIDA and RWCP databases fall in this category. In the RWCP project, four kinds of speech corpora were created in order to simulate real living environments; 1) sound scene database of real acoustic environments, 2) dialogue speech dataset, 3)

63

Acoust. Sci. & Tech. 26, 1 (2005) Table 4 RWCP sound-scene database. Sound category 1. Collision sounds — Wood — Metal — Plastic — Ceramics 2. Action sounds — Dropping articles — Jetting gas — Rubbing things — Breaking things — Clapping 3. Characteristic sounds — Small metal articles — Paper rubbing — Musical instruments — Electronic sounds — Machine noise

#samples 1,187 1,000 550 800 200 200 500 200 829 1,072 400 1,079 705 1,000

Table 5 Recent speech research projects. Speech research project 1. 2. 3. 4.

Spontaneous Speech Engineering Integrated Acoustic Information Research Advanced Spoken Language Processing The Expressive Speech Processing Project

broadcast news speech dataset, and 4) round table-meeting speech dataset. Table 4 describes the contents of the sound scene database. 2.2. Database-Related Speech Research Projects In this section, we explain recent database-related speech research projects. Table 5 presents a list of four projects. A specific database construction plan underlies each project. The first three projects just finished their project-span in the end of fiscal 2004 year. In the ‘‘Spontaneous Speech Engineering’’ project, a large speech corpus of monologue speech was built. The corpus is called CSJ, which refers to a ‘‘Corpus of Spontaneous Japanese.’’ The project, started in 1999, had three main goals; 1) building a large-scale spontaneous speech corpus, 2) developing acoustic and linguistic models for spontaneous speech recognition, understanding and summarization, and 3) constructing a prototype system for spontaneous speech summarization. It has focused on monologue speech data rather than dialogue and has collected about 670 hours’ data as of January 2000. It explains two ways to collect monologue data, 1) academic presentation speech and 2) simulated public speech. The goal of this corpus project is to collect some 7 million words by the end of this project. They are annotating 1) orthographic and phonetic labels, 2)

64

utterance boundary labels, 3) fillers, dis-fluency information and noise. In the ‘‘Integrated Acoustic Information Research’’ project, not only speech technology research but also integrated aspect of human-sound relationship was considered. They are collecting data for 1) spatial sound localization by humans, 2) analysis and synthesis of sounds, 3) recognition of sounds including speech, 4) machine understanding of spoken language, and 5) human perception of sounds. This project plans to create large scale speech corpus and acoustic sound databases to be used both in the project and in the speech research community. So far, it has created a corpus spoken by elderly people. The third project, Advanced Spoken Language Processing, the full name is ‘‘Realization of Advanced Spoken Language Processing from Prosodic Features,’’ aims at building a Japanese version of ERLA’s MULTEXT prosodic corpus. The text was translated from five European languages and modified into Japanese. It comprises 40 different passages with 6,523 morae including 1,085 accent kernels. For each speaker, recordings have been made in two ways; 1) read the text naturally, and 2) read it with an instructed emotional and paralinguistic attitude and they are trying to give the data a prosody tag. Signals from EGG electrodes have also been recorded simultaneously. A total of 3 hours and 37 minutes speech have been recorded. The fourth project, the Expressive Speech Processing project, is creating a very large spontaneous speech database, including emotional speech from a small number of speakers. They are recording spontaneous conversational speech and daily spoken interactions with naturally-occurring emotions and attitude without recourse to acting. They avoid laboratory regulated speech samples. The project, funded by the Agency of Science and Technology, is now running jointly with 6 groups, 1) NAIST (Nara Institute of Science and Technology), 2) Kobe University, 3) Keio University, 4) Chiba University, 5) ICP Grenoble, and 6) ATR. They have collected 1,000 hours of natural daily conversational speech data.

3.

UTLIZATION OF SPEECH DATABASE

As mentioned above, there is a growing number of researchers who disclose details of the speech databases they have used in their research when they present results at a scientific meeting. In order to accumulate precise statistics of the corpora, a survey about the use of speech database has been conducted. 3.1. Questionnaire Survey We mailed questionnaire sheets to approximately 200 industry/university laboratories and have collected 61 responses so far. At first, considered questioning through e-mail but we thought that, under the recent flood of e-mail,

H. KUWABARA et al.: SPEECH DATABASE

it would be ignored and that people would not respond. The returned response mails were 61 (approximately 30% of the total) as of the end of April, 2004. It was regrettable that the number of returns largely underscored our expectation. We tried to increase the return-number by using e-mail letters at the same time during our collection of responses but it had little effect. The questions include: 1) As for the use of speech database. (1) Which databases have you used over the past 5 years and for what purposes? (2) Which databases are you now using and for what purposes? (3) Which databases will you use in the near future and for what purposes? (4) Which databases do you have at present but not yet used? 2) As for present data collection or planning. (5) Please give us information about the databases you are now building or planning to build in the near future, if any. 3) As for the databases needed for future speech research. (6) What kind of database will be necessary for future speech research and technology development? The results for question (1) were almost the same for question (2). The respondents were allowed to answer the questions anonymously so that, especially for some people in industry, they would be able to respond more freely. The response data were analyzed with industry, government institution, university, and unknown (anonymous) groups separately. 3.2. The Survey Results Table 6 shows the number of return-responses. The return mails were approximately 30% of the total which was far less than anticipated. Table 7 shows the results for question (2). The table depicts eight most frequently used databases in order of frequency. As can be seen from the table, the most frequently used database is that of ATR followed by the ASJ database. These databases both consist of read data in a quiet studio or laboratory environment and include many identical sentences. ‘‘Other speech database’’

Table 6 Response to the questionnaire. Number of mail delivered

200

Total response return: — University — Industry — Government/public res. institute — Unknown

61 33 16 3 9

Table 7 Currently used speech database. Database

Number

ATR speech database ASJ speech database JEIDA word speech database Other speech database Self-made speech database Corpus of spontaneous Japanese RWCP TIMIT+TIDIGIT

31 26 14 14 13 9 8 6

Table 8 Currently used speech database (detail). Database

Univ.

Ind.

Gov.

Unkn.

ATR speech database ASJ speech database JEIDA word speech database Other speech database Self-made speech database Corpus of spontaneous Japanese RWCP TIMIT+TIDIGIT

14 16 8 11 6 6 5 2

8 1 3 0 4 2 0 2

3 1 0 1 0 0 1 0

5 4 1 2 2 0 1 2

Table 9 The purpose of presently used database. Database

(a)

(b)

(c)

(d)

(e)

ATR speech database ASJ speech database JEIDA speech database Other speech database Self-made speech database Corpus of spontaneous Japanese RWCP TIMIT+TIDIGIT

23 20 12 10 5 6 5 5

8 2 0 1 2 0 0 0

2 1 1 2 4 1 0 0

0 1 0 0 2 1 2 0

0 0 0 1 0 0 0 1

Total

86

13

11

6

2

represents those that are not listed here. The number of self-made speech databases is unexpectedly large. Many institutions still use their own database especially for people in the field of physiology and linguistics. Some institutions started to use the CSJ database before its distribution. Table 8 shows where the databases are used. In this table, ‘‘Univ.’’ stands for university, ‘‘Ind.’’ for industry’s research institutions, ‘‘Gov.’’ for government research institutions and ‘‘Unkn.’’ for the unknown (anonymous) group. Table 9 represents the purpose of presently used databases. The database use is summarized into the following five categories from (a) to (e): (a) Speech recognition: recognition algorithms, evaluation, sentence comprehension, sentence summarization, speaker recognition (b) Speech synthesis: synthesis method, prosody analysis (c) Acoustic analysis: acoustic/phonetic feature analysis,

65

Acoust. Sci. & Tech. 26, 1 (2005)

speech coding (d) Language analysis: syntactic/semantic analysis, grammatical analysis, morphological analysis (e) Speech/language education: Japanese language education It is quite clear from this table that the majority of people use their databases for speech recognition and/or related research. This is quite natural if we consider the purpose of major database construction in Japan. However, the usage for speech synthesis is unexpectedly small, though it is second largest. Last, we examined the responses for the speech databases that they consider to be needed in the future speech research. Table 10 shows the top ten responses. The database they think most important for the future research is dialogue/continuous speech corpus recorded in a real living environment not in the laboratory situation, followed by para-linguistic, emotional speech corpus.

4.

CONCLUSIONS

The results of a survey about the present status of Japanese speech database use have been described. The Table 10 Speech database needed for future research. 1 2 3 4 5 6 7 8 9 10

66

Dialogue, continuous speech corpus recorded in real living environments Para-linguistic, emotional speech corpus Corpus for different speaking style Bi-modal, multi-modal speech corpus Read speech corpus with prosody tag Corpus for broadcasting news speech Corpus for speaker recognition including data for the same speakers recorded in different times and ages English speech corpus pronounced by native Japanese Mobile phone speech data with a large number of speakers Speech corpus having various kinds of information about production including movie pictures

database projects in Japan started in early 1980s. Several database projects have been undertaken since then including the one initiated by the Advanced Telecommunication Research Institute (ATR) and now it has come to the point where an enormous amount of spontaneous speech data is available. A survey has been conducted recently about the usage of the presently existing speech databases among industry and university institutions in Japan where speech research is now actively going on. It has been revealed that the ATR’s continuous speech database is the most frequently used followed by the equivalent version of the Acoustical Society of Japan, and that a list of speech corpora they anticipated for future speech research. REFERENCES [1] S. Itahashi, ‘‘Speech database of discrete words,’’ J. Acoust. Soc. Jpn. (J), 41, 723–726 (1985). [2] K. Takeda, Y. Sagisaka, S. Katagiri and H. Kuwabara, ‘‘A Japanese speech database for various kinds of research purposes,’’ J. Acoust. Soc. Jpn. (J), 44, 747–754 (1988). [3] T. Kobayashi, S. Itahashi, S. Hayamizu and T. Takezawa, ‘‘ASJ conti-nuous speech corpus for research,’’ J. Acoust. Soc. Jpn. (J), 48, 888–893 (1992). [4] S. Furui, K. Maekawa and H. Isahara, ‘‘Intermediate results and perspectives of the project Spontaneous Speech: Corpus and Processing Technology,’’ Proc. 2nd Spontaneous Speech Science and Technology Workshop, pp. 1–6 (2002). [5] K. Meakawa, H. Koiso, S. Furui and H. Isahara, ‘‘Spontaneous speech corpus of Japan,’’ Proc. LREC 2000, pp. 947–952 (2000). [6] P. Mokhatari and N. Campbell, ‘‘Automatic detection of acoustic centers of reliabilities for tagging paralinguistic information in expressive speech,’’ Proc. LREC 2002, pp. 2015–2018 (2002). [7] N. Kawaguchi, S. Matsubara, K. Takeda and F. Itakura, ‘‘Multidimensional data acquisition for integrated acoustic information research,’’ Proc. LREC 2002, pp. 2043–2046 (2002). [8] H. Kuwabara, S. Itahashi, M. Yamamoto, T. Takezawa, S. Nakamura and K. Takeda, ‘‘The present status of speech database in Japan: Development, management, and application to speech research,’’ Proc. LREC 2002, pp. 10–15 (2002).

Suggest Documents