Calibration of Microphone Arrays for Improved Speech Recognition

Calibration of Microphone Arrays for Improved Speech Recognition Michael L.Seltzer1 and Bhiksha Raj2 1. Department of Electrical and Computer Engineer...
Author: Cora Parsons
2 downloads 0 Views 69KB Size
Calibration of Microphone Arrays for Improved Speech Recognition Michael L.Seltzer1 and Bhiksha Raj2 1. Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh, PA 15217 USA 2. Mitsubishi Electric Research Lab Cambridge, MA 02139 USA

4 September 2001

Introduction •

Current speech recognition technology is capable of good performance in quiet conditions with close-talking microphones.



In many applications, the environment is noisy and the use of a close-talking microphone is impossible or inconvenient.



As the distance between the user and the microphone grows, the signal is increasingly susceptible to distortions from the environment.



Using an array of microphones, rather than a single microphone, has been proposed as a solution to this problem.

CMU Robust Speech Group

2

Microphone Array Processing •

Combine multiple signals captured by the array to obtain a higher quality output signal, as judged (typically) by a human listener. MIC1 MIC2

s[n]

Array Processor

sˆ[n]

MICN



Many array processing methods exist: – Fixed/adaptive schemes, de-reverberation techniques, blind source separation.



The objective of these methods is speech enhancement, a signal processing problem.

CMU Robust Speech Group

3

Automatic Speech Recognition (ASR) •

Parameterize speech signal and compare parameter sequence to statistical models of speech sound units to hypothesize what a user said.



The speech signal is interpreted by a machine.

ASR

s[n]



Feature {O1 ,..., ON } Extraction

AM P(O|W)

P(W | O) =

P(O | W ) P(W ) P(O )

{Wˆ1 ,...,Wˆ M }

LM P(W)

The objective is accurate recognition, a statistical pattern classification problem.

CMU Robust Speech Group

4

ASR with Microphone Arrays • Recognition with microphone arrays has been performed by “gluing” the two systems together. • We believe this is not the ideal approach. – Systems have different objectives. – Each system does not exploit information present in the other. MIC1 MIC2 MIC3

Array Proc

Feature Extraction

ASR

MIC4 CMU Robust Speech Group

5

A new approach • Consider array processor and speech recognizer to be components of a single interconnected system which allows information to pass in both directions. • Develop an array processing scheme specifically targeted at improved speech recognition performance without regard to conventional array processing objective criteria.

MIC1 MIC2 MIC3

Array Proc

Feature Extraction

ASR

MIC4 CMU Robust Speech Group

6

ASR-based Array Processing •

The simplest beamforming technique (delay and sum) simply averages the signals together:

1 N y[ n] = å xi [ n − Ti ] N signals i =1 • Others weight or filter the before combining:



y[ n] = å α i xi [n − Ti ]

y[n] = å hi [n] ⊗ xi [n − Ti ]

i i How do we choose the weights or filter coefficients to improve speech recognition performance?

CMU Robust Speech Group

7

What criterion do we want? • Want an objective function that uses parameters directly related to recognition

MIC1 MIC2

MICM

x1 x2

xM

τ1

h1

τ2

h2

τΜ

CMU Robust Speech Group

hM

Clean Speech Features Ms Σ

y

FE

My



ε

minimize ε

8

An Objective Function for ASR • Define Q as the SSE of the log Mel spectra of clean speech s and noisy speech y

Q = åå (M y [ f , l ] − M s [ f , l ])

2

f

l

where y is the output of a filter-and-sum microphone array and M[ f, l] is the lth log Mel spectral value in frame f.

• My[ f, l] is a function of the signals captured by the

array and the filter parameters associated with each microphone.

CMU Robust Speech Group

9

Calibration of Microphone Arrays for ASR • Calibration of Filter-and-Sum Microphone Array: – Have a user speak an utterance with known transcription. • With or without close-talking microphone

– Derive optimal set of filters. • Minimize the objective function with respect to the filter coefficients. • Since objective function is non-linear, use iterative gradientbased methods.

– Apply to all future speech. CMU Robust Speech Group

10

Calibration Using Close-talking Recording • Given the close-talking mic recording for the calibration utterance, derive an “optimal” filter for each channel to improve recognition FE MIC1

τ1

OPT

h1(n) Σ

s[n] MICM

τΜ

CMU Robust Speech Group

Ms ASR

FE M y

hM(n)

11

Multi-microphone data sets •

TMS – Recorded in the CMU Speech Lab

7cm

• Approx. 5m x 5m x 3m • Noise from computer fans, blowers ,etc.

– Isolated letters and digits, keywords – 10 speakers * 14 utterances = 140 utterances – Each utterance has closetalking mic control waveform

CMU Robust Speech Group

1m

12

Multi-microphone data sets (2) •

WSJ + off-axis noise source – Room simulation created using the image method • 5m x 4m x 3m • 200ms reverberation time • WGN source @ 5dB SNR

25cm

2m

– WSJ test set • 5K word vocabulary • 10 speakers * 65 utterances = 650 utterances

15cm 1m

– Original recordings used as close-talking control waveforms

CMU Robust Speech Group

13

Results • TMS data set, WSJ0 + WGN point source simulation – Constructed 50 point filters from a single calibration utterance – Applied filters to all test utterances

WER (%)

TMS

WSJ

100

100

80

80

60

60

40

40

20

20

0

0

CLSTK

1 MIC

CMU Robust Speech Group

D&S

MEL OPT

CLSTK

1 MIC

D&S

MEL OPT

14

Calibration without Close-talking Microphone • Obtain initial waveform estimate using conventional array processing technique (e.g. delay and sum). • Use transcription and the recognizer to estimate the sequence of target clean log Mel spectra. • Optimize filter parameters as before.

CMU Robust Speech Group

15

Calibration w/o Close-talking Microphone (2) • Force align the delay-and-sum waveform to the known transcription to generate an estimated HMM state sequence. BLAH BLAH... MIC1

τ1

Σ

s[n] MICM

FE

FALIGN

{qˆ1 , qˆ 2 ,..., qˆ N }

τΜ

HMM CMU Robust Speech Group

16

Calibration w/o Close-talking Microphone (3) • Extract the means from the single Gaussian HMMs of the estimated state sequence. – Since the models have been trained from clean speech, use these means as the target clean speech feature vectors.

{qˆ1 , qˆ 2 ,..., qˆ N }

CMU Robust Speech Group

HMM

{µ1 , µ 2 ,..., µ N }

IDCT

Mˆ s

17

Calibration w/o Close-talking Microphone (4) • Use estimated clean speech feature vectors to optimize filters as before.

Mˆ s MIC1

τ1

OPT

h1(n) Σ

s[n] MICM

τΜ

CMU Robust Speech Group

ASR

FE M y

hM(n)

18

Results • TMS data set, WSJ0 + WGN point source simulation – Constructed 50 point filters from calibration utterance – Applied filters to all utterances

WER (%)

TMS

WSJ

100

100

80

80

60

60

40

40

20

20

0

0 CLSTK

1 MIC

D&S

CMU Robust Speech Group

MEL OPT MEL OPT W/ CLSTK NO CLSTK

CLSTK

1 MIC

D&S

MEL OPT MEL OPT W/ CLSTK NO CLSTK

19

Results (2) • WER vs. SNR for WSJ + WGN – Constructed 50 point filters from calibration utterance using transcription only – Applied filters to all utterances 100

WER (%)

80 Closetalk Optim-Calib Delay-Sum 1 Mic

60 40 20 0 0

5

10

15

20

25

SNR (dB) CMU Robust Speech Group

20

Is Joint Filter Estimation Necessary? •

We compared 4 cases: – – – –

Delay and Sum Optimize 1 filter for Delay and Sum Output Optimize Microphone Array Filters Independently Optimize Microphone Array Filters Jointly 50

WER (%)

40 30

WSJ + WGN 10dB

20 10 0

CMU Robust Speech Group

Delay Sum

Delay Indep Sum + 1 Optim Filter

Joint Optim

21

Summary and Future Work •

We have presented a new microphone array calibration scheme specifically designed for speech recognition.



We have achieved improvements in WER of up to 37% over conventional Delay and Sum processing using this method.



Successfully fedback information from the recognizer all the way back to the waveform level.



We plan to investigate the following extensions to the algorithm: reverberation compensation, unsupervised optimization, filter adaptation.

CMU Robust Speech Group

22

Suggest Documents