Calibration of Microphone Arrays for Improved Speech Recognition Michael L.Seltzer1 and Bhiksha Raj2 1. Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh, PA 15217 USA 2. Mitsubishi Electric Research Lab Cambridge, MA 02139 USA
4 September 2001
Introduction •
Current speech recognition technology is capable of good performance in quiet conditions with close-talking microphones.
•
In many applications, the environment is noisy and the use of a close-talking microphone is impossible or inconvenient.
•
As the distance between the user and the microphone grows, the signal is increasingly susceptible to distortions from the environment.
•
Using an array of microphones, rather than a single microphone, has been proposed as a solution to this problem.
CMU Robust Speech Group
2
Microphone Array Processing •
Combine multiple signals captured by the array to obtain a higher quality output signal, as judged (typically) by a human listener. MIC1 MIC2
s[n]
Array Processor
sˆ[n]
MICN
•
Many array processing methods exist: – Fixed/adaptive schemes, de-reverberation techniques, blind source separation.
•
The objective of these methods is speech enhancement, a signal processing problem.
CMU Robust Speech Group
3
Automatic Speech Recognition (ASR) •
Parameterize speech signal and compare parameter sequence to statistical models of speech sound units to hypothesize what a user said.
•
The speech signal is interpreted by a machine.
ASR
s[n]
•
Feature {O1 ,..., ON } Extraction
AM P(O|W)
P(W | O) =
P(O | W ) P(W ) P(O )
{Wˆ1 ,...,Wˆ M }
LM P(W)
The objective is accurate recognition, a statistical pattern classification problem.
CMU Robust Speech Group
4
ASR with Microphone Arrays • Recognition with microphone arrays has been performed by “gluing” the two systems together. • We believe this is not the ideal approach. – Systems have different objectives. – Each system does not exploit information present in the other. MIC1 MIC2 MIC3
Array Proc
Feature Extraction
ASR
MIC4 CMU Robust Speech Group
5
A new approach • Consider array processor and speech recognizer to be components of a single interconnected system which allows information to pass in both directions. • Develop an array processing scheme specifically targeted at improved speech recognition performance without regard to conventional array processing objective criteria.
MIC1 MIC2 MIC3
Array Proc
Feature Extraction
ASR
MIC4 CMU Robust Speech Group
6
ASR-based Array Processing •
The simplest beamforming technique (delay and sum) simply averages the signals together:
1 N y[ n] = å xi [ n − Ti ] N signals i =1 • Others weight or filter the before combining:
•
y[ n] = å α i xi [n − Ti ]
y[n] = å hi [n] ⊗ xi [n − Ti ]
i i How do we choose the weights or filter coefficients to improve speech recognition performance?
CMU Robust Speech Group
7
What criterion do we want? • Want an objective function that uses parameters directly related to recognition
MIC1 MIC2
MICM
x1 x2
xM
τ1
h1
τ2
h2
τΜ
CMU Robust Speech Group
hM
Clean Speech Features Ms Σ
y
FE
My
−
ε
minimize ε
8
An Objective Function for ASR • Define Q as the SSE of the log Mel spectra of clean speech s and noisy speech y
Q = åå (M y [ f , l ] − M s [ f , l ])
2
f
l
where y is the output of a filter-and-sum microphone array and M[ f, l] is the lth log Mel spectral value in frame f.
• My[ f, l] is a function of the signals captured by the
array and the filter parameters associated with each microphone.
CMU Robust Speech Group
9
Calibration of Microphone Arrays for ASR • Calibration of Filter-and-Sum Microphone Array: – Have a user speak an utterance with known transcription. • With or without close-talking microphone
– Derive optimal set of filters. • Minimize the objective function with respect to the filter coefficients. • Since objective function is non-linear, use iterative gradientbased methods.
– Apply to all future speech. CMU Robust Speech Group
10
Calibration Using Close-talking Recording • Given the close-talking mic recording for the calibration utterance, derive an “optimal” filter for each channel to improve recognition FE MIC1
τ1
OPT
h1(n) Σ
s[n] MICM
τΜ
CMU Robust Speech Group
Ms ASR
FE M y
hM(n)
11
Multi-microphone data sets •
TMS – Recorded in the CMU Speech Lab
7cm
• Approx. 5m x 5m x 3m • Noise from computer fans, blowers ,etc.
– Isolated letters and digits, keywords – 10 speakers * 14 utterances = 140 utterances – Each utterance has closetalking mic control waveform
CMU Robust Speech Group
1m
12
Multi-microphone data sets (2) •
WSJ + off-axis noise source – Room simulation created using the image method • 5m x 4m x 3m • 200ms reverberation time • WGN source @ 5dB SNR
25cm
2m
– WSJ test set • 5K word vocabulary • 10 speakers * 65 utterances = 650 utterances
15cm 1m
– Original recordings used as close-talking control waveforms
CMU Robust Speech Group
13
Results • TMS data set, WSJ0 + WGN point source simulation – Constructed 50 point filters from a single calibration utterance – Applied filters to all test utterances
WER (%)
TMS
WSJ
100
100
80
80
60
60
40
40
20
20
0
0
CLSTK
1 MIC
CMU Robust Speech Group
D&S
MEL OPT
CLSTK
1 MIC
D&S
MEL OPT
14
Calibration without Close-talking Microphone • Obtain initial waveform estimate using conventional array processing technique (e.g. delay and sum). • Use transcription and the recognizer to estimate the sequence of target clean log Mel spectra. • Optimize filter parameters as before.
CMU Robust Speech Group
15
Calibration w/o Close-talking Microphone (2) • Force align the delay-and-sum waveform to the known transcription to generate an estimated HMM state sequence. BLAH BLAH... MIC1
τ1
Σ
s[n] MICM
FE
FALIGN
{qˆ1 , qˆ 2 ,..., qˆ N }
τΜ
HMM CMU Robust Speech Group
16
Calibration w/o Close-talking Microphone (3) • Extract the means from the single Gaussian HMMs of the estimated state sequence. – Since the models have been trained from clean speech, use these means as the target clean speech feature vectors.
{qˆ1 , qˆ 2 ,..., qˆ N }
CMU Robust Speech Group
HMM
{µ1 , µ 2 ,..., µ N }
IDCT
Mˆ s
17
Calibration w/o Close-talking Microphone (4) • Use estimated clean speech feature vectors to optimize filters as before.
Mˆ s MIC1
τ1
OPT
h1(n) Σ
s[n] MICM
τΜ
CMU Robust Speech Group
ASR
FE M y
hM(n)
18
Results • TMS data set, WSJ0 + WGN point source simulation – Constructed 50 point filters from calibration utterance – Applied filters to all utterances
WER (%)
TMS
WSJ
100
100
80
80
60
60
40
40
20
20
0
0 CLSTK
1 MIC
D&S
CMU Robust Speech Group
MEL OPT MEL OPT W/ CLSTK NO CLSTK
CLSTK
1 MIC
D&S
MEL OPT MEL OPT W/ CLSTK NO CLSTK
19
Results (2) • WER vs. SNR for WSJ + WGN – Constructed 50 point filters from calibration utterance using transcription only – Applied filters to all utterances 100
WER (%)
80 Closetalk Optim-Calib Delay-Sum 1 Mic
60 40 20 0 0
5
10
15
20
25
SNR (dB) CMU Robust Speech Group
20
Is Joint Filter Estimation Necessary? •
We compared 4 cases: – – – –
Delay and Sum Optimize 1 filter for Delay and Sum Output Optimize Microphone Array Filters Independently Optimize Microphone Array Filters Jointly 50
WER (%)
40 30
WSJ + WGN 10dB
20 10 0
CMU Robust Speech Group
Delay Sum
Delay Indep Sum + 1 Optim Filter
Joint Optim
21
Summary and Future Work •
We have presented a new microphone array calibration scheme specifically designed for speech recognition.
•
We have achieved improvements in WER of up to 37% over conventional Delay and Sum processing using this method.
•
Successfully fedback information from the recognizer all the way back to the waveform level.
•
We plan to investigate the following extensions to the algorithm: reverberation compensation, unsupervised optimization, filter adaptation.
CMU Robust Speech Group
22