Algorithms for Speech Recognition and Language Processing Mehryar Mohri
Michael Riley
Richard Sproat
AT&T Laboratories
AT&T Laboratories
Bell Laboratories
[email protected]
[email protected]
[email protected]
Joint work with Emerald Chung, Donald Hindle, Andrej Ljolje, Fernando Pereira
Tutorial presented at COLING’96, August 3rd, 1996.
cmp-lg/9608018 v2 17 Sep 1996
Introduction (1) Text and speech processing: hard problems
Theory of automata Appropriate level of abstraction Well-defined algorithmic problems
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
Introduction
2
Introduction (2) Three Sections:
Algorithms for text and speech processing (2h) Speech recognition (2h) Finite-state methods for language processing (2h)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
Introduction
3
PART I Algorithms for Text and Speech Processing Mehryar Mohri AT&T Laboratories
[email protected]
August 3rd, 1996
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
4
Definitions: finite automata (1)
A = (Σ; Q; ; I; F )
Alphabet Σ, Finite set of states Q, Transition function : Q Σ ! 2Q, I Q set of initial states, F Q set of final states. A recognizes L(A) = fw 2 Σ : (I; w) \ F 6= ;g (Hopcroft and Ullman, 1979; Perrin, 1990) Theorem 1 (Kleene, 1965). A set is regular (or rational) iff it can be recognized by a finite automaton.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
5
Definitions: finite automata (2) b a a
0
1
b
2
a
3
a a
b a 0
1 b
Figure 1:
M.Mohri-M.Riley-R.Sproat
b
a 2
3
b
L(A) = Σ aba.
Algorithms for Speech Recognition and Language Processing
PART I
6
Definitions: weighted automata (1)
A = (Σ; Q; ; ; ; ; I; F ) (Σ; Q; ; I; F ) is an automaton, Initial output function , Output function : Q Σ Q ! K , Final output function , Function f : Σ ! (K; +; ) associated with A: X 8u 2 Dom(f ); f (u) = ((i) (i; u; q ) (q )). 2 ((i;u)\F )
(i;q ) I
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
7
Definitions: weighted automata (2) b/1 0/4
a/0
1/0
b/0
2/0
a/2
a/0 3/0
Figure 2: Index of t = aba.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
8
Definitions: rational power series Power series: functions mapping Σ to a semiring (K; +; ) – Notation: S
=
X
w2Σ
(S; w )w , (S; w ):
– Support: supp(S ) = fw
coefficients
2 Σ : (S; w) 6= 0g
– Sum: (S + T ; w) = (S; w) + (T ; w) X n – Star: S = S
n 0
– Product: (ST ; w) =
X
uv=w2Σ
(S; u)(T ; v )
Rational power series: closure under rational operations of polynomials (polynomial power series) (Salomaa and Soittola, 1978; Berstel and Reutenauer, 1988)
Theorem 2 (Sch¨utzenberger, 1961). A power series is rational iff it can be represented by a weighted finite automaton. M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
9
Definitions: transducers (1)
T = (Σ; ∆; Q; ; ; I; F )
Finite alphabets Σ and ∆, Finite set of states Q, Transition function : Q Σ ! 2Q, Output function : Q Σ Q ! Σ, I Q set of initial states, F Q set of final states. T defines a relation: R(T ) = f(u; v) 2 (Σ )2 : v 2
[
2
\
(I; u; q)g
q ( (I;u) F )
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
10
Definitions: transducers (2) b:a a:b
a:a a:a 0
1
3 b:a
b:b
a:a 2
b:a
Figure 3: Fibonacci normalizer ([abb ! baa] [baa
M.Mohri-M.Riley-R.Sproat
abb]).
Algorithms for Speech Recognition and Language Processing
PART I
11
Definitions: weighted transducers b:a/1 a:b/0 0
a:b/1 a:b/0
Figure 4: Example, aaba
;
(min +)
;
(+ )
:
:
M.Mohri-M.Riley-R.Sproat
1
b:c/1
2
a:b/0
3/0
! (bbcb; (0 0 1 0) (0 1 1 0)).
aaba ! minf1; 2g = 1 aaba ! 0 + 0 = 0
Algorithms for Speech Recognition and Language Processing
PART I
12
Composition: Motivation (1)
Construction of complex sets or functions from more elementary ones Modular (modules, distinct linguistic descriptions) On-the-fly expansion
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
13
Composition: Motivation (2) source program
lexical analyzer
syntax analyzer
semantic analyzer
intermediate code generator
code optimizer
code generator
Figure 5: Phases of a compiler (Aho et al., 1986). target program
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
14
Composition: Motivation (3) Source text
Spellchecker
Inflected forms
Index
Set of positions
Figure 6: Complex indexation.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
15
Composition: Example (1)
0
0
(0,0)
a:a
1
a:d
a:d
1
(1,1)
b:ε
ε :e
b:e
c:ε
2
2
(2,2)
d:d
3
d:a
4
3
c:ε
(3,2)
d:a
(4,3)
Figure 7: Composition of transducers.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
16
Composition: Example (2)
0
0
(0,0)
a:a/3
1
a:d/5
1
a:d/15
(1,1)
b:ε/1
ε :e/7
b:e/7
c:ε/4
2
2
(2,2)
d:a/6
c: ε/4
d:d/2
3
4
3
(3,2)
d:a/12
(4,3)
Figure 8: Composition of weighted transducers (+; ). M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
17
Composition: Algorithm (1)
Construction of pairs of states b:c=w2 0 0 ! q1 and q2 ! q2 a:c=(w1 w2 ) 0 0
q1 – Result: (q1 ; q2 ) ! (q1 ; q2 ) Elimination of -paths redundancy: filter – Match:
a:b=w1
Complexity: quadratic On-the-fly implementation
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
18
Composition: Algorithm (2) (a) A
a:a
0
(b) B (c)
A' (d)
B'
a:d
0 ε:ε1
ε2:ε
0
1
b:ε
ε:e
ε:ε1 a:a
0
1
1
1
2
c:ε
d:a
ε:ε1 b:ε2
ε2:ε a:d
2
2
2
d:d
c:ε2
4
3 ε:ε1
ε2:ε ε1:e
3
3
ε:ε1 d:d
4
ε2:ε d:a
3
Figure 9: Composition of weighted transducers with -transitions. M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
19
Composition: Algorithm (3) 0,0
a:d (x:x)
1,1 b:ε
(ε2:ε2)
2,1
ε:e (ε1:ε1)
b:e
b:ε
(ε2:ε1) (ε2:ε2)
ε:e (ε1:ε1)
c:ε
2,2 c:ε
(ε2:ε2)
3,1
1,2
(ε2:ε2)
ε:e (ε1:ε1)
3,2 d:a (x:x)
4,3
Figure 10: Redundancy of -paths.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
20
Composition: Algorithm (4) ε1:ε1 ε2:ε1 x:x
ε1:ε1
1
x:x 0
ε2:ε2 x:x
ε2:ε2 2
Figure 11: Filter for efficient composition.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
21
Composition: Theory
Transductions (Elgot and Mezei, 1965; Eilenberg, 1974 1976; Berstel, 1979). Theorem 3 Let 1 and 2 be two (weighted) (automata + transducers), then (1 2 ) is a (weighted) (automaton + transducer). Efficient composition of weighted transducers (Mohri, Pereira, and Riley, 1996). Works with any semiring Intersection: composition of automata (weighted).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
22
Intersection: Example a a
b a 0
1
b
a 2
b
1
0
2
a
3
b
a
4
5
a
b (0,0)
b
b c
c b
3
(0,1)
b a
(0,2)
a (1,3)
b
(2,4)
a
(3,5)
Figure 12: Intersection of automata.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
23
Union: Example a/5 a/3
b/1 1
a/3 0
b/2 2
b/6
b/7
c/1
c/0 b/2
1
b/5 0
3/0
a/4
a/6
2
b/3
3
a/4
4
5/0
a/3 c/1
c/0 b/5
1
b/2
2
a/6
3
b/3
4
a/4
5/0
a/3
10
ε/0
0
ε/0
b/1
a/5 a/3
6
a/3
7 b/6
b/2
b/7 8
9/0
a/4
Figure 13: Union of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
24
Determinization: Motivation (1)
Efficiency of use (time) Elimination of redundancy No loss of information (6= pruning)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
25
Determinization: Motivation (2) leave/44.4 leave/61.9
leaves/51.4 leave/64.6
10
Detroit/110
Detroit/109 leaves/50.7 flight/72.4
7
leave/82.1 flights/64 1 which/69.9
flights/61.8
4
6
Detroit/91.9
Detroit/91.6 Detroit/88.5
leave/47.4
leave/57.7 Detroit/103
flight/43.7 flights/50.2
0
14
leave/45.8
flights/54.3
flights/53.5 which/81.6
Detroit/106
leave/68.9
5
leave/70.9
12
Detroit/102
15/0
leaves/34.6
which/72.9
Detroit/99.1 Detroit/106
which/77.7
2
flights/88.2
leaves/39.2
flight/45.4
leave/53.4 9
3 flights/83.8
leave/54.4
flights/83.4
leaves/67.6
Detroit/105 Detroit/102 11 Detroit/99.7
flights/79 leave/31.3 8 leave/35.9
Detroit/99.4
leaves/60.4
Detroit/96.3
13 leave/73.6
leave/37.3
leave/41.9
Figure 14: Toy language model (16 states, 53 transitions, 162 paths).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
26
Determinization: Motivation (3) leave/64.6
flights/53.1 0
which/69.9
1
2
flight/53.2 3
leaves/62.3
leave/63.6
4
Detroit/103
5
Detroit/105 Detroit/105
6
leaves/67.6
8/0
Detroit/101
7
Figure 15: Determinized language model (9 states, 11 transitions, 4 paths).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
27
Determinization: Example (1) a
b 2
b
b 0
3 b
a b
b
1
t4
{0}
a
{1,2}
b
{3}
b
Figure 16: Determinization of automata.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
28
Determinization: Example (2) b/1
a/1
2
b/3
b/4 0
3/0 b/3
a/3 b/1
b/5
1/0
t4
{(1,2),(2,0)}/2 b/1
a/1 {(0,0)}
{(3,0)}/0 b/1
b/3 {(1,0),(2,3)}/0
Figure 17: Determinization of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
29
Determinization: Example (3) a:b b:a 0
2
a:ba b:aa
c:c 3
d:ε 1
a {(0,ε)}
a:b b:a
{(1,a),(2, ε)}
c:c d:a
{(3,ε)}
Figure 18: Determinization of transducers.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
30
Determinization: Example (4) a:b/3 b:a/2 0
2
a:ba/4 b:aa/3
c:c/5 d:ε/4
3/0
1/0
a/1 {(0,e,0)}
a:b/3 b:a/2
{(1,a,1),(2,ε,0)}
c:c/5 d:a/5
{(3,ε,0)}/0
Figure 19: Determinization of weighted transducers (min; +).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
31
Determinization: Algorithm (1)
Generalization of the classical algorithm for automata – Powerset construction – Subsets made of (state, weight) or (state, string, weight)
Applies to subsequentiable weighted automata and transducers Time and space complexity: exponential (polynomial w.r.t. size of the result) On-the-fly implementation
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
32
Determinization: Algorithm (2) Conditions of applications
Twin states:
q and q0 are twin states iff:
– If: they can be reached from the initial states by the same input string u – Then: cycles at q and q 0 with the same input string v have the same output value
Theorem 4 (Choffrut, 1978; Mohri, 1996a) Let be an unambiguous weighted automaton (transducer, weighted transducer), then can be determinized iff it has the twin property. Theorem 5 (Mohri, 1996a) The twin property can be tested in polynomial time.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
33
Determinization: Theory
Determinization of automata – General case (Aho, Sethi, and Ullman, 1986)
– Specific case of Σ : failure functions (Mohri, 1995)
Determinization of transducers, weighted automata, and weighted transducers – General description, theory and analysis (Mohri, 1996a; Mohri, 1996b) – Conditions of application and test algorithm – Acyclic weighted transducers or transducers admit determinization
Can be used with other semirings (ex: (R; +; ))
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
34
Local determinization: Motivation
Time efficiency Reduction of redundancy Control of the resulting size (flexibility) Equivalent function (or equal set) No loss of information
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
35
Local determinization: Example c:a/2 2
a:b/4
a:c/2 a:b/3
b:b/6 5
b:c/3 0
b:a/5 1
a:a/3 b:a/7
c:b/2 c:a/3 4
3
a:a/3
a:a/5
{(2,ε,0)} c:a/2
0
b:ε/5
3 ε::b/1
a: ε/3 1 {(1,a,0),(2,b,1),(3,a,2)}
a:b/3 a:c/2
{(3,ε,0)}
ε:a/2
a:a/3 4
ε:a/0 2
5
c:a/3 c:b/2
{(1,ε,0)} 6 b:c/3
Figure 20: Local determinization of weighted transducers (min; +). M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
36
Local determinization: Algorithm
Predicate, ex: (P ) (out degree(q) > k) k: threshold parameter Local: Dom(det) = fq : P (q)g Determinization only for q 2 Dom(det) On-the-fly implementation (out degree (q ))) Complexity O(jDom(det)j max q 2Q
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
37
Local determinization: theory
Various choices of predicate (constraint: local) Definition of parameters Applies to all automata, weighted automata, transducers, and weighted transducers Can be used with other semirings (ex: (R; +; ))
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
38
Minimization: Motivation
Space efficiency Equivalent function (or equal set) No loss of information (6= pruning)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
39
Minimization: Motivation (2) leave/64.6
flights/53.1 0
which/69.9
1
2
flight/53.2 3
leaves/62.3
leave/63.6
4
Detroit/103
5
Detroit/105 Detroit/105
6
leaves/67.6
8/0
Detroit/101
7
Figure 21: Determinized language model.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
40
Minimization: Motivation (3) leave/0.0498 flights/0 0
which/291
1
2
flight/1.34
leaves/0 leave/0
4
Detroit/0
5/0
3 leaves/0.132
Figure 22: Minimized language model.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
41
Minimization: Example (1) a b
a 0
1
b
2
c 3
a
c b
b
b
5
4 a
a b a 0
1
c
2
b
b
t96
a b
4
3 t97
Figure 23: Minimization of automata.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
42
Minimization: Example (2) a:3
2
c:2
d:0
c:1 3
d:4 0
a:0
1
d:3
b:1
e:3 e:2
6
e:1
7
5 b:2
c:1 4
a:3
2
c:0
d:0
c:1 3
d:6 0
a:6
1
d:6
b:7
e:0 e:0
6
e:0
7
5 b:0
c:0 4
d:0 0
a:6
1
a:3 b:0
c:1 2
b:7 d:6
c:0 3
e:0
4
e:0
5
Figure 24: Minimization of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
43
Minimization: Example (3) 2
b:B
0
e:D
3
f:BC
6
7
c:D
b:C b:C
4
5
2
b:ε a:ABCDB 0
a:DB
d:D
1
a:A
c:C
c:ε
a:DB
d:CDB
1
e:C
3
f: ε
6
7
c:ε
b:CCDDB b:ε
4
5
a:DB b:ε 0
a:ABCDB b:CCDDB
1
2 d:CDB
c:ε 3
e:C
5
f: ε
6
Figure 25: Minimization of transducers.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
44
Minimization: Example (4) 2
b:B/5
0
a:DB/2
d:D/1
1
a:A/0
c:C/3
e:D/1
3
f:BC/6
6
7/0
c:D/4
b:C/2 b:C/2
4
5
2
b: ε/0 a:ABCDB/15
c: ε/0
a:DB/2
d:CDB/9
1
e:C/0
3
f:ε/0
6
7/0
c: ε/0
0 b:CCDDB/15
b: ε/0
4
5
a:DB/2 b:ε/0 0
a:ABCDB/15 b:CCDDB/15
1
2 d:CDB/9
c: ε/0 3
e:C/0
5
f: ε/0
6/0
Figure 26: Minimization of weighted transducers (min; +).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
45
Minimization: Algorithm (1)
Two steps – Pushing or extraction of strings or weights towards initial state – Classical minimization of automata, (input,ouput) considered as a single label
Algorithm for the first step – Transducers: specific algorithm – Weighted automata: shortest-paths algorithms
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
46
Minimization: Algorithm (2)
Complexity – E: set of transitions – S: sum of the lengths of output strings – the longest of the longest common prefixes of the output paths leaving each state Type
General
Automata
O( E
Weighted automata Transducers
M.Mohri-M.Riley-R.Sproat
j j log(jQj)) O (jE j log(jQj)) O (jQj + jE j (log jQj + jPmax j))
Acyclic
j j j j O (jQj + jE j) O (S + jE j + jQj+ (jE j (jQj jF j)) jPmax j) O( Q + E )
Algorithms for Speech Recognition and Language Processing
PART I
47
Minimization: Theory
Minimization of automata (Aho, Hopcroft, and Ullman, 1974; Revuz, 1991) Minimization of transducers (Mohri, 1994) Minimization of weighted automata (Mohri, 1996a) – Minimal number of transitions – Test of equivalence
Standardization of power series (Sch¨utzenberger, 1961) – Works only with fields – Creates too many transitions
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
48
Conclusion (1)
Theory – Rational power series – Weighted automata and transducers
Algorithms – General (various semirings) – Efficiency (used in practice, large sizes)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
49
Conclusion (2)
Applications – Text processing (spelling checkers, pattern-matching, indexation, OCR) – Language processing (morphology, phonology, syntax, language modeling) – Speech processing (speech recognition, text-to-speech synthesis) – Computational biology (matching with errors) – Many other applications
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART I
50
PART II Speech Recognition Michael Riley AT&T Laboratories
[email protected]
August 3rd, 1996
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
51
Overview
The speech recognition problem Acoustic, lexical and grammatical models Finite-state automata in speech recognition Search in finite-state automata
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
52
Speech Recognition Given an utterance, find its most likely written transcription. Fundamental ideas:
Utterances are built from sequences of units Acoustic correlates of a unit are affected by surrounding units Units combine into units at a higher level — phones ! syllables ! words Relationships between levels can be modeled by weighted graphs — we use weighted finite-state transducers Recognition: find the best path in a suitable product graph
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
53
Levels of Speech Representation
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
54
Maximum A Posteriori Decoding Overall analysis [4, 57]:
Acoustic observations: parameter vectors derived by local spectral analysis of the speech waveform at regular (e.g. 10msec) intervals Observation sequence o Transcriptions w Probability P (ojw) of observing o when w is uttered Maximum a posteriori decoding: ˆ w
=
argmax P (wjo) = argmax P (oPjw()oP) (w) w
=
argmax w
M.Mohri-M.Riley-R.Sproat
P (ojw)
w
P (w)
| {z }
| {z }
generative model
language model
Algorithms for Speech Recognition and Language Processing
PART II
55
Generative Models of Speech Typical decomposition of P (ojw) into conditionally-independent mappings between levels:
Acoustic model P (ojp) : phone sequences ! observation sequences. Detailed model: –
P (ojd) : distributions ! observation vectors —
–
P (djm) : context-dependent phone models !
symbolic ! quantitative distribution sequences
P (mjp) : phone sequences ! model sequences Pronunciation model P (pjw) : word sequences ! phone sequences Language model P (w) : word sequences –
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
56
Recognition Cascades: General Form
Multistage cascade:
o = sk
stage k
sk−1
s1
stage 1
w = s0
Find s0 maximizing
P (s0 ; sk ) = P (sk js0 )P (s0 ) = P (s0 )
X s1 ;:::;sk
Y 1
1 j k
P (sj jsj
1)
“Viterbi” approximation: Cost(s0 ; sk )
Cost(sk js0 )
where Cost(: : :) = M.Mohri-M.Riley-R.Sproat
=
Cost(sk js0 ) + Cost(s0 ) mins1 ;:::;sk
P
1
Cost(sj jsj
1 j k
1)
log P (: : :).
Algorithms for Speech Recognition and Language Processing
PART II
57
Speech Recognition Problems
Modeling: how to describe accurately the relations between levels ) modeling errors Search: how to find the best interpretation of the observations according to the given models ) search errors
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
58
Acoustic Modeling – Feature Selection I
Short-time spectral analysis: Z log
g( )x(t + )e
i2f
d
Short-time (25 msec. Hamming window) spectrum of /ae/ – Hz. vs. Db.
Scale selection: – Cepstral smoothing – Parameter sampling (13 parameters)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
59
Acoustic Modeling – Feature Selection II [40, 38]
Refinements – Time derivatives – 1st and 2nd order – non-Fourier analysis (e.g., Mel scale) – speaker/channel adaptation
mean cepstral subtraction vocal tract normalization linear transformations
Result: 39 dimensional feature vector (13 cepstra, 13 delta cepstra, 13 delta-delta cepstra) every 10 milliseconds
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
60
Acoustic Modeling – Stochastic Distributions [4, 61, 39, 5]
Vector quantization – find codebook of prototypes Full covariance multivariate Gaussians:
P [y] = (2)N=12 jSj1=2 e
T
1 2 (y
T )S
1
(y )
Diagonal covariance Gaussian mixtures Semi-continuous, tied mixtures
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
61
Acoustic Modeling – Units and Training [61, 36]
Units – Phonetic (sub-word) units – e.g., cat –> /k ae t/ – Context-dependent units – aek;t – Multiple distributions (states) per phone – left, middle, right
Training – Given a segmentation, training straight-forward – Obtain segmentation by transcription – Iterate until convergence
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
62
Generating Lexicons – Two Steps
Orthography ! Phonemes
“had” ! /hh ae d/ “your” ! /y uw r/
– complex, context-independent mapping – usually small number of alternatives – determined by spelling constraints; lexical “facts” – large online dictionaries available
Phonemes ! Phones /hh ae d y uw r/ ! [hh ae dcl jh axr] /hh ae d y uw r/ ! [hh ae dcl d y axr]
(60% prob) (40% prob)
– complex, context-dependent mapping – many possible alternatives – determined by phonological and phonetic constraints M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
63
Decision Trees: Overview [9]
Description/Use: Simple structure – binary tree of decisions, terminal nodes determine prediction (cf. “Game of Twenty Questions”). If dependent variable is categorical (e.g., red, yellow, green), called “classification tree”, if continuous, called “regression tree”.
Creation/Estimation: Creating a binary decision tree for classification or regression involves three steps (Breiman, et al): 1. Splitting Rules: Which split to take at a node? 2. Stopping Rules: When to declare a node terminal? 3. Node Assignment: Which class/value to assign to a terminal node?
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
64
1. Decision Tree Splitting Rules Which split to take at a node?
Candidate splits considered. – Binary cuts: For continuous 1 x < 1, consider splits of form: x k vs: x > k; 8k:
– Binary partitions: For categorical x 2 f1; 2; :::; ng = X , consider splits of form:
x2A M.Mohri-M.Riley-R.Sproat
vs:
x 2 X A; 8A X:
Algorithms for Speech Recognition and Language Processing
PART II
65
1. Decision Tree Splitting Rules – Continued
Choosing best candidate split. – Method 1: Choose k (continuous) or A (categorical) that minimizes estimated classification (regression) error after split. – Method 2 (for classification): Choose k or A that minimizes estimated entropy after that split.
M.Mohri-M.Riley-R.Sproat
SPLIT #1
SPLIT #2
No. 1: 400
No. 1: 400
No. 2: 400
No. 2: 400
No. 1: 300
No. 1: 100
No. 1: 200
No. 1: 200
No. 2: 100
No. 2: 300
No. 2: 400
No. 2: 0
Algorithms for Speech Recognition and Language Processing
PART II
66
2. Decision Tree Stopping Rules When to declare a node terminal?
Strategy (Cost-Complexity pruning): 1. Grow over-large tree.
2. Form sequence of subtrees, T0 ; :::; Tn ranging from full tree to just the root node. 3. Estimate “honest” error rate for each subtree. 4. Choose tree size with mininum “honest” error rate.
To form sequence of subtrees, vary from 0 (for full tree) to 1 (for just root node) in: min R(T ) + j T j : T
To estimate “honest” error rate, test on data different from training data, e.g., grow tree on 9=10 of available data and test on 1=10 of data repeating 10 times and averaging (cross-validation).
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
67
0.025
End of Declarative Sentence Prediction: Pruning Sequence
0.015
error rate
o +
o + o +
0.0
0.005
o +ooo +++ o o + +ooooo oooo oo o o o +++++ +++ oo oo o o ++ ++ + + +++ +++ + + 0
20
40
o oooooo + +++++++++++++++++++++
60
80
100
# of terminal nodes + = raw, o = cross-validated
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
68
3. Decision Tree Node Assignment
Which class/value to assign to a terminal node?
Plurality vote: Choose most frequent class at that node for classification; choose mean value for regression.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
69
End-of-Declarative-Sentence Prediction: Features [65]
Prob[word with “.” occurs at end of sentence] Prob[word after “.” occurs at beginning of sentence] Length of word with “.” Length of word after “.” Case of word with “.”: Upper, Lower, Cap, Numbers Case of word after “.”: Upper, Lower, Cap, Numbers Punctuation after “.” (if any) Abbreviation class of word with “.”: – e.g., month name, unit-of-measure, title, address name, etc.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
70
End of Declarative Sentence? yes 48294/52895 bprob:27.29 yes
yes
5539/10020 eprob:1.045 yes
no 3289/3547
type:n/a
M.Mohri-M.Riley-R.Sproat
42755/42875
5281/6473 next:cap,upcase+. next:n/a,lcase,lcase+.,upcase,num yes
no
5156/5435
913/1038
type:addr,com,group,state,title,unit
yes
no
5137/5283
133/152
Algorithms for Speech Recognition and Language Processing
PART II
71
Phoneme-to-Phone Alignment PHONEME p er p ax s ae n d r ih s p eh k t M.Mohri-M.Riley-R.Sproat
PHONE p er pcl p ix s ax n r ix s pcl p eh kcl t
WORD purpose
and
respect
Algorithms for Speech Recognition and Language Processing
PART II
72
Phoneme-to-Phone Realization: Features [66, 10, 62]
Phonemic Context: – Phoneme to predict – Three phonemes to left – Three phonemes to right
Stress (0, 1, 2)
Lexical Position: – Phoneme count from start of word – Phoneme count from end of word
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
73
Phoneme-to-Phone Realization: Prediction Example
Tree splits for /t/ in ‘‘your pretty red’’:
PHONE ix n kcl+k tcl+t tcl+t tcl+t dx dx
M.Mohri-M.Riley-R.Sproat
COUNT 182499 87283 38942 21852 11928 5918 3639 2454
SPLIT cm0: vstp,ustp,vfri,ufri,vaff,uaff,nas cm0: vstp,ustp,vaff,uaff cp0: alv,pal cm0: ustp vm1: mono,rvow,wdi,ydi cm-1: ustp,rho,n/a rstr: n/a,no
Algorithms for Speech Recognition and Language Processing
PART II
74
Phoneme-to-Phone Realization: Network Example Phonetic network for ‘‘Don had your pretty...’’: PHONEME d aa n hh ae d y uw r p r ih t iy M.Mohri-M.Riley-R.Sproat
PHONE1 0.91 d 0.92 aa 0.98 n 0.74 hh 0.73 ae 0.51 dcl jh 0.90 y 0.84 0.48 axr 0.99 0.99 pcl p 0.99 r 0.86 ih 0.73 dx 0.90 iy
PHONE2
0.15 hv 0.19 eh 0.37 dcl d 0.16 y 0.29 er
PHONE3
CONTEXT
!dcl d) !dcl jh)
(if d (if d
0.11 tcl t
Algorithms for Speech Recognition and Language Processing
PART II
75
Acoustic Model Context Selection [92, 39]
Statistical regression trees used to predict contexts based on distribution variance One tree per context-independent phone and state (left, middle, right) The trees were grown until the data criterion of 500 frames per distribution was met Trees pruned using cost-complexity pruning and cross-validation to select best contexts About 44000 context-dependent phone models About 16000 distributions
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
76
N-Grams: Basics
‘Chain Rule’ and Joint/Conditional Probabilities:
P [x1 x2 : : : xN ] = P [xN jx1 :::xN 1]P [xN 1 jx1:::xN
2]
: : : P [x2 jx1 ]P [x1 ]
where, e.g.,
P [xN jx1 : : : xN
1] =
(First–Order) Markov assumption:
nth–Order Markov assumption:
P [x1 : : : xN ] P [x1 : : : xN 1]
P [xk 1 xk ] P [xk jx1 : : : xk 1 ] = P [xk jxk 1 ] = P [x ] k 1 P [xk jx1 : : : xk
M.Mohri-M.Riley-R.Sproat
1] =
P [xk jxk n:::xk
1] =
P [xk n : : : xk ] P [xk n : : : xk 1]
Algorithms for Speech Recognition and Language Processing
PART II
77
N-Grams: Maximum Likelihood Estimation Let N be total number of n-grams observed in a corpus and c(x1 : : : xn ) be the number of times the n-gram x1 : : : xn occurred. Then
c (x1 : : : xn ) P [x1 : : : xn ] = N is the maximum likelihood estimate of that n-gram probability. For conditional probabilities,
c (x1 : : : xn ) P [xn jx1 : : : xn 1] = c(x : : : x ) : n 1 1 is the maximum likelihood estimate. With this method, an n-gram that does not occur in the corpus is assigned zero probability. M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
78
N-Grams: Good-Turing-Katz Estimation [29, 16] Let nr be the number of n-grams that occurred r times. Then
(x1 : : : xn ) c P [x1 : : : xn ] = N
is the Good-Turing estimate of that n-gram probability, where c (x) = (c(x) + 1) nnc(cx(x)+) 1 : For conditional probabilities,
(x1 : : : xn ) c P [xn jx1 : : : xn 1] = c(x : : : x ) ; n 1 1
c(x1 : : : xn ) > 0
is Katz’s extension of the Good-Turing estimate. With this method, an n-gram that does not occur in the corpus is assigned the backoff probability P [xn jx1 : : : xn 1 ] = P [xn jx2 : : : xn 1 ]; where is a normalizing constant. M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
79
Finite-State Modeling [57] Our view of recognition cascades: represent mappings between levels, observation sequences and language uniformly with weighted finite-state machines:
Probabilistic mapping P (xjy): weighted finite-state transducer. Example — word pronunciation transducer: d:ε/1
ey:ε/.4
dx:ε/.8
ae:ε/.6
t:ε/.2
ax:"data"/1
Language model P (w): weighted finite-state acceptor
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
80
Example of Recognition Cascade
O observations
A
D phones
M words
Recognition from observations o by composition: – Observations:
8 < 1
O(s; s) = :
if s = o
0 otherwise
A(a; p) = P (ajp) – Pronunciation dictionary: D(p; w) = P (pjw) – Language model: M (w; w) = P (w) Recognition: wˆ = argmax(O A D M )(o; w) – Acoustic-phone transducer:
w
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
81
Speech Models as Weighted Automata
Quantized observations: o1
t0
o2
t1
on
...
t2
tn
Phone model A : observations ! phones oi:ε/p01(i) oi:ε/p12(i) s s2 1 .. .. . . .. .. .. . . .
s0
oi:ε/p00(i)
oi:ε/p11(i)
P
ε:π/p2f
oi:ε/p22(i)
A = A Word pronunciations Ddata : phones ! words Acoustic transducer:
d:ε/1
Dictionary: M.Mohri-M.Riley-R.Sproat
D=
P w
ey:ε/.4
dx:ε/.8
ae:ε/.6
t:ε/.2
Dw
ax:"data"/1
Algorithms for Speech Recognition and Language Processing
PART II
82
Example: Phone Lattice O A Lattices: Weighted acyclic graphs representing possible interpretations of an utterance as sequences of units at a given level of representation (phones, syllables, words,: : : ) Example: result of composing observation sequence for hostile battle with acoustic model: s/-8.579 s/-6.200 f/-8.129 t/-6.200
0
hh/-9.043
14
aa/-2.743 ao/-2.593
20
s/-8.007 v/-0.421
f/-4.100
f/-3.057 s/-7.129 th/-3.264
31
pau/-4.207 23
s/-3.343
38
pau/-2.421
ae/-13.100
ax/-3.721 t/-3.621
s/-4.893 27
q/-0.336
ax/-0.379
dx/-0.514
uw/-0.714
th/-2.464
s/-3.493
s/-3.336
t/-2.493
el/-6.457
s/-3.179
t/-3.229
d/-4.721 el/-4.229
s/-7.214
34
t/-3.386
40
en/-1.729
b/-8.007
v/-2.150
n/-1.236
b/-2.271
n/-2.857
58
ae/-10.600
74
78
r/-0.579 el/-5.679
l/-4.371 el/-4.357
83
68
44
pau/-6.621 s/-4.436
48
53
70
ax/-2.971
th/-6.257 s/-9.893
s/-10.657 s/-11.479
s/-12.007
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
83
Sample Pronunciation Dictionary D Dictionary with hostile, battle and bottle as a weighted transducer: ax:-/2.607
ay:-/1.616 t:-/0.067 0
aa:-/0.055
-:hostile/2.943 hh:hostile/0.134 -:-/2.466
-:-/0.000
s:-/0.035 4
2 el:-/0.431
3
18
17 hv:hostile/2.635
1 -:-/0.014
-:-/0.014 el:-/0.164
l:-/0.112 -:-/0.000 15
16
b:battle/0.000
9
ae:-/0.057
8
t:-/2.113
6
7
dx:-/0.240
b:bottle/0.000 14
aa:-/0.055
13
t:-/2.113 dx:-/0.240
-:-/0.014
ax:-/2.607 12
el:-/0.164
11
ax:-/2.607
-:-/2.466
10
l:-/0.112 l:-/0.112
5
-:-/2.466
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
84
Sample Language Model M Simplified language model as a weighted acceptor:
battle/10.896 battle/6.603 -/1.882
2 -/2.306
battle/9.268 hostile/9.394 hostile/11.119 3
0
-/2.374
4
-/3.173 bottle/11.510
-/3.537 bottle/13.970
-/1.102
-/1.913
5
1
-/3.961
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
85
Recognition by Composition
From phones to words: compose dictionary with phone lattice to yield word lattice with combined acoustic and pronunciation costs: 0
hostile/-32.900
1
battle/-26.825
2
Applying language model: Compose word lattice with language model to obtain word lattice with combined acoustic, pronunciation and language model costs: hostile/-21.781 0
2
battle/-17.916 battle/-15.250
hostile/-19.407
3
1
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
86
Context-Dependency Examples
Context-dependent phone models: Maps from CI units to CD units. Example: ae=b d ! aeb;d Context-dependent allophonic rules: Maps from baseforms to detailed phones. Example: t=V 0 V ! dx Difficulty: Cross-word contexts – where several words enter and leave a state in the grammar, substitution does not apply.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
87
Context-Dependency Transducers Example — triphonic context transducer for two symbols x and y . x.x
x/x_x:x
x/x_y:x
x.y
y/x_y:y
x/y_y:x
y.y
x/y_x:x
y/y_y:y
y/x_x:y
y/y_x:y
y.x
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
88
Generalized State Machines All of the above networks have bounded context and thus can be represented as generalized state machines. A generalized state machine M:
Supports these operations: – – –
M:start – returns start state M:final(state) – returns 1 if final, 0 if non-final state M:arcs(state) – returns transitions (a1 ; a2 ; : : : ; aN ) leaving state, where ai = (ilabel; olabel; weight; nextstate)
Does not necessarily support: – providing the number of states – expanding states that have not been already discovered
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
89
On-Demand Composition [69, 53] Create generalized state machine C for composition A B .
C:start := (A:start; B:start) C:final((s1; s2)) := A:final(s1) ^ B:final(s2) C:arcs((s1; s2)) := Merge(A:arcs(s1); B:arcs(s2))
Merged arcs defined as:
l1; l3; x + y; (ns1; ns2)) 2 Merge(A:arcs(s1); B:arcs(s2))
(
iff
l1; l2; x; ns1) 2 A:arcs(s1) and (l2; l3; y; ns2) 2 B:arcs(s2)
(
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
90
State Caching Create generalized state machine B for input machine A.
B:start := A:start B:final(state) := A:final(state) B:arcs(state) := A:arcs(state) Cache Disciplines:
Expand each state of A exactly once, i.e. always save in cache (memoize). Cache, but forget ’old’ states using a least-recently used criterion. Use instructions (ref counts) from user (decoder) to save and forget.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
91
On Demand Composition – Results ATIS Task - class-based trigram grammar, full cross-word triphonic context-dependency. states
arcs
context
762
40386
lexicon
3150
4816
grammar
48758
359532
1:6 106
5:1 106
full expansion
For the same recognition accuracy as with a static, fully expanded network, on-demand composition expands just 1.6% of the total number of arcs.
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
92
Determinization in Large Vocabulary Recognition
For large vocabularies, ’string’ lexicons are very non-deterministic Determinizing the lexicon solves this problem, but can introduce non-coassessible states during its composition with the grammar Alternate Solutions: – Off-line compose, determinize, and minimize:
Lexicon Grammar – Pre-tabulate non-coassessible states in the composition of:
Det(Lexicon) Grammar
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
93
Search in Recognition Cascades
Reminder: Cost
log probability
ˆ = argmax(O A D M )(o; w) Example recognition problem: w w
ˆ by the output word sequence for the Viterbi search: approximate w lowest-cost path from the start state to a final state in O A D M — ignores summing over multiple paths with same output:
...:w1
...:ε
...:wi ...:ε
...:ε
...:wn
> ...:ε O°A°D°M
Composition preserves acyclicity, O is acyclic ) acyclic search graph
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
94
Single-source Shortest Path Algorithms [83]
Meta-algorithm:
Q
fs0g; 8 s; Cost(s)
While Q not empty, s For each s0
Cost(s0 )
1
Dequeue(Q)
2 Adj [s] such that Cost(s0 ) > Cost(s) + cost(s; s0 )
Cost(s) + cost(s; s0 ) Enqueue(Q; s)
Specific algorithms: Name
Queue type
Cycles
Neg. Weights
acyclic
topological
no
yes
Dijkstra
best-first
yes
no
Bellman-Ford
FIFO
yes
yes
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
Complexity
O(jV j + jE j) O(jE j log jV j) O(jV j jE j) PART II
95
The Search Problem
Obvious first approach: use an appropriate single-source shortest-path algorithm Problem: impractical to visit all states, can we do better? – Admissible methods: guarantee finding best path, but reorder search to avoid exploring provably bad regions – Non-admissible methods: may fail to find best path, but may need to explore much less of the graph
Current practical approaches: – Heuristic cost functions – Beam search – Multipass search – Rescoring
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
96
Heuristic Cost Function — A* Search [4, 56, 17]
States in search ordered by cost-so-far(s) + lower-bound-to-complete(s)
With a tight bound, states not on good paths are not explored With a loose lower bound no better than Dijkstra’s algorithm Where to find a tight bound? – Full search of a composition of smaller automata (homomorphic automata with lower-bounding costs?) – Non-admissible A* variants: use averaged estimate of cost-to-complete, not a lower-bound
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
97
Beam Search [35]
Only explore states with costs within a beam (threshold) of the cost of the best comparable state Non-admissible Comparable states states corresponding to (approximately) the same observations Synchronous (Viterbi) search: explore composition states in chronological observation order Problem with synchronous beam search: too local, some observation subsequences are unreliable and may locally put the best overall path outside the beam
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
98
Beam-Search Tradeoffs [68] Word lattice: result of composing observation sequence, level transducers and language model. Beam 4
Word lattice error rate 7.3%
Median number of edges 86.5
6
5.4%
244.5
8
4.4%
827
10
4.1%
3520
12
4.0%
13813.5
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
99
Multipass Search [52, 3, 68]
Use a succession of binary compositions instead of a single n-way composition — combinable with other methods Prune: Use two-pass variant of composition to remove states not in any path close enough to the best Pruned intermediate lattices are smaller, lower number of state pairings considered Approximate: use simpler models (context-independent phone models, low-order language models) Rescore: : :
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
100
Rescoring Most successful approach in practice:
…
o
approximate rescoring n best w1 cheap detailed models models wn
wi
Small pruned result built by composing approximate models Composition with full models, observations Find lowest-cost path
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART II
101
PART III Finite State Methods in Language Processing Richard Sproat
Speech Synthesis Research Department Bell Laboratories, Lucent Technologies
[email protected]
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
102
Overview
Text-analysis for Text-to-Speech (TTS) Synthesis
– A rich domain with lots of linguistic problems
– Probably the least familiar application of NLP technologies
Syntactic analysis
Some thoughts on text indexation
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
103
The Nature of the TTS Problem This is some text: It was a dark and stormy night. Four score and seven years ago. Now is the time for all good men. Let them eat cake. Quoth the raven nevermore.
Linguistic Analysis phonemes, durations and pitch contours
Speech Synthesis speech waveforms
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
104
From Text to Linguistic Representation
Ñ«YoC ‘The rat is eating the oil’
&
N
lao3
shu3
l a u∩
V
N
chi1
you2
ts
r
j o u∩
µ
µ
µ
µ
σ
σ
σ
σ
ω
ω
su
ω
LH
L
H
LH
Φ
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
105
Russian Percentages: The Problem How do you say ‘%’ in Russian?
)
20% skidka ‘20% discount’
Adjectival forms when modifying nouns dvadcat i -procentn a skidka dvadcat i -procent naja skidka
s 20% rastvorom
) s dvadcat i -procent nym rastvorom
‘with 20% solution’
s dvadcat i -procent nym rastvorom
21%
Nominal forms otherwise dvadcat~ odin procent
)
dvadcat’ odin procent
23%
) dvadcat~ tri procent a dvadcat’ tri procent a
20%
) dvadcat~ procent ov dvadcat’ procent ov
s 20% ‘with 20%’ M.Mohri-M.Riley-R.Sproat
) s dvadcat~ procent ami s dvadcat’ ju procent ami Algorithms for Speech Recognition and Language Processing
PART III
106
Text Analysis Problems
Segment text into words. Segment text into sentences, checking for and expanding abbreviations : St. Louis is in Missouri.
Expand numbers Lexical and morphological analysis Word pronunciation – Homograph disambiguation
Phrasing Accentuation
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
107
Desiderata for a Model of Text Analysis for TTS
Delay decisions until have enough information to make them
Possibly weight various alternatives
Weighted Finite-State Transducers offer an attractive computational model
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
108
Overall Architectural Matters Example: word pronunciation in Russian
Text form: kostra (bonfire+genitive.singular) Morphological analysis: ¨ fnoungfmascgfinang+0 afsggfgeng kost0 Er
Pronunciation: /kstr0 a/
Minimal Morphologically-Motivated Annotation (MMA): kostr0 a (Sproat, 1996)
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
109
Overall Architectural Matters
Language Model
fst
α:β
Lexical Analysis WFST:
γ:δ
L=D
O
−1
O
L
−1
Morphological Analysis :
D
S
#KOST"{E}P{noun}{masc}{inan}+"A{sg}{gen}#
Phonological Analysis WFST:
M
L fst
M
O
P
α:β γ:δ
MMA #KOSTP"A#
fst
S
α:β γ:δ
Surface Orthographic Form KOSTPA
M.Mohri-M.Riley-R.Sproat
fst
α:β γ:δ
P Pronunciation #kastr"a#
Algorithms for Speech Recognition and Language Processing
PART III
110
Orthography ! Lexical Representation A Closer Look
Words : Lex. Annot.
Lex. Annot. : Lex. Anal. S
Special Symbols : Expansions S
_
Punc. :Interp. S
SPACE :Interp.
Numerals : Expansions
SPACE:
white space in German, Spanish, Russian : : :
M.Mohri-M.Riley-R.Sproat
in Japanese, Chinese : : :
Algorithms for Speech Recognition and Language Processing
PART III
111
F FÑ j jó £ b Ñ Ñ£F Ú ñ ñj þÌ ó Ññ Ññj
Chinese Word Segmentation
! ! ! ! ! ! ! ! ! ! ! ! ! ! !
M.Mohri-M.Riley-R.Sproat
F asp : F Ñ vb : j vb : j ó nc : £ adv : b vb : Ñ vb : Ñ vb++£ F npot Ú np : ñ vb : ñj vb : þ Ì nc : ó nc : Ñ ñ nc : Ñ ñj urnp : 1
4 68
le0
perf
2
1
liao3jie3
understand
da4
big
da4jie1
avenue
bu4
not
4 45
zai4
at
11 77
wang4
forget
wang4+bu4liao3
unable to forget
4 88
wo3
I
8 05
fang4
place
10 70
fang4da4
enlarge
11 02
na3li3
where
jie1
avenue
jie3fang4
liberation
1
8 11
5 56
1
11 45
2
4 58
2
1
1
2
10 35
1 3
10 92 1
42 23
:
12 23
xie4 fang4da4
name
Algorithms for Speech Recognition and Language Processing
PART III
112
Chinese Word Segmentation
Space = : #
L = Space _ (Dictionary
_ (Space [ Punc))+
ÚÑ£FÑñjóbþÌ Ñ £F Ññ
BestPath( L) = pro4:88 # vb+ nc10:92 2 npot12:23 # 1 ‘I couldn’t forget where Liberation Avenue is.’
Ú
M.Mohri-M.Riley-R.Sproat
j 1 ó nc11:45 : : :
Algorithms for Speech Recognition and Language Processing
PART III
113
Numeral Expansion
234
Factorization
DecadeFlop
NumberLexicon
) ) +
2 102 + 3 101 + 4 2 102 + 4 + 3 101
zwei+hundert+vier+und+dreißig
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
114
Numeral Expansion
0:0
1:1
1:1
2:2
2:2
3:3
0
4:4 5:5 6:6 7:7
ε:10∧1 0:0 1
ε:10∧2
3:3 ε:10∧1
1:1 2:2
8:8 9:9 2
3:3 4:4 5:5 6:6
4
4:4 5:5 6:6
5
7:7 8:8 3
9:9
7:7 8:8 9:9
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
115
German Numeral Lexicon
fg /f2g /f3g / 1
: :
f gf+++gf1gf10^1g) /(f1gf+++gf1gf10^1g) /(f2gf+++gf1gf10^1g) /(f3gf+++gf1gf10^1g) /( 0
f gf10^1g) /(f3gf10^1g)
/( 2
f ^g /(f10^3g) /( 10 2 )
M.Mohri-M.Riley-R.Sproat
: . .. : : : : .. . : : .. . : :
f g f gjfneutg)fsggf##g)/ (zw’eifnumgf##g)/ (dr’eifnumgf##g)/ (’eins num ( masc
f
gf##g)/ (’elffnumgf##g)/ (zw’¨olffnumgf##g)/ (dr’eif++gzehnfnumgf##g)/ (z’ehn num
f g f gf##g)/ (dr’eif++gßigfnumgf##g)/ (zw’an ++ zig num
f gf##g)/ (t’ausendfnumgfneutgf##g)/ (h’undert num
Algorithms for Speech Recognition and Language Processing
PART III
116
Morphology: Paradigmatic Specifications
Paradigm
fA1g # starke Flektion (z.B. nach unbestimmtem Artikel)
Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix
M.Mohri-M.Riley-R.Sproat
f++ger f++gen f++ge f++gen f++ges f++ge f++ger f++gen
fsggfmascgfnomg fsggfmascg(fgengjfdatgjfaccg) fsggffemig(fnomgjfaccg) fsgg(ffemigjfneutg)(fgengjfdatg) fsggfneutg(fnomgjfaccg) fplg(fnomgjfaccg) fplgfgeng fplgfdatg
Algorithms for Speech Recognition and Language Processing
PART III
117
Morphology: Paradigmatic Specifications Paradigm Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix M.Mohri-M.Riley-R.Sproat
fA6g f++gfEpsg f++ge f++ges f++ger f++gem f++gen f++gfEpsg f++ge f++ger f++gen
##### Possessiva ("mein, euer")
fsgg(fmascg|fneutg)fnomg fsggffemigfnomg fsgg(fmascg|fneutg)fgeng fsggffemig(fgeng|fdatg) fsgg(fmascg|fneutg)fdatg fsggfmascgfaccg fsggfneutgfaccg fplg(fnomg|faccg) fplgfgeng fplgfdatg
Algorithms for Speech Recognition and Language Processing
PART III
118
Morphology: Paradigmatic Specifications /fA1g /fA1g
/fA1g /fA1g
/fA6g /fA6g /fA6g /fA6g /fA6g
/fA6g /fA6g M.Mohri-M.Riley-R.Sproat
: : : : .. . : : : : : : :
(’aalf++gglattfadjg)/
(’abf++ga¨ nderf++glichfadjgfumltg)/ (’abf++gartigfadjg)/
(’abf++gbauf++gw¨urdigfadjgfumltg)/ (d’einfadjg)/ (’euerfadjg)/ (’ihrfadjg)/
(’Ihrfadjg)/
(m’einfadjg)/ (s’einfadjg)/
(’unserfadjg)/
Algorithms for Speech Recognition and Language Processing
PART III
119
Morphology: Paradigmatic Specifications
Project((fA6g_Endings) ((fA6g:Stems)_Id(Σ ))) )
masc sg 0
m
1
’
2
e
3
i
4
n
5
adj
6
++
7
sg e
8
neut
13
femi
9 nom 11
pl 12
s r
24
n
20
sg
25
acc nom
masc neut 23
pl
17
gen
sg
m sg
gen
21
femi 22
dat acc
18
masc 19
dat
10
pl 14
sg
neut 15
M.Mohri-M.Riley-R.Sproat
16
masc
Algorithms for Speech Recognition and Language Processing
PART III
120
Morphology: Finite-State Grammar
M.Mohri-M.Riley-R.Sproat
fEpsg fEpsg t"elef++g
START
PREFIX
PREFIX
STEM
PREFIX
STEM .. .
STEM
SUFFIX
’abend
STEM
SUFFIX .. .
’abenteuer
SUFFIX
PREFIX
SUFFIX
FUGE
SUFFIX
WORD . ..
f++g fEpsg fEpsg
Algorithms for Speech Recognition and Language Processing
PART III
121
Morphology: Finite-State Grammar
FUGE
SECOND
FUGE
SECOND . ..
SECOND
PREFIX
SECOND
STEM
SECOND
WORD .. .
f++g f++gsf++g fEpsg fEpsg fEpsg
WORD
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
122
Morphology: Finite-State Grammar
Unanst¨andigkeitsunterstellung ‘allegation of indecency’
+ "unf++g"anf++gst’¨andf++gigf++gkeitf++gsf++gunterf++gst’ellf++gung
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
123
Rewrite Rule Compilation Context-dependent rewrite rules
General form:
; ; ; regular expressions. Constraint: Example:
! =
cannot be rewritten but can be used as a context
a ! b=c b
(Johnson, 1972; Kaplan & Kay, 1994; Karttunen, 1995; Mohri & Sproat, 1996) M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
124
Example
a ! b=c b
w = cab
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
125
Example Input: c
0
a
1
b
2
3
After r: c
0
a
1
>
2
b
3
4
After f: 0
c
1
4
b
5
g) f); 1; f
1
b:b
M.Mohri-M.Riley-R.Sproat
Algorithms for Speech Recognition and Language Processing
PART III
133
The Replace Transducer
Σ:Σ,