Algorithms for Speech Recognition and Language Processing

Algorithms for Speech Recognition and Language Processing Mehryar Mohri Michael Riley Richard Sproat AT&T Laboratories AT&T Laboratories Bell Lab...

Author: Britney Gilbert

4 downloads 4 Views 897KB Size

Report

Download PDF

Recommend Documents

Speech and Language Processing

An Overview of Speech Recognition and Speech Synthesis Algorithms

Dynamic Programming Algorithms in Speech Recognition

The Problems. Speech and Language. Disciplines Psychoacoustics Auditory processing. Language Comprehension. Human Biological Suitability for Language

Speech Recognition Technology for Dysarthric Speech

TextGraphs-2: Graph-Based Algorithms for Natural Language Processing

Biometric algorithms for iris recognition

Pattern Recognition and Natural Language Processing: State of the Art

Intonation processing for speech technology

Speech and Language Evaluation

Speaker and Language Recognition

Speech and Language Therapy

Chapter 5 Object Recognition. Distributed Algorithms for Image and Video Processing

Information for parents. Speech and language difficulties

Pattern Recognition and Image Processing

Agglutinative Language Speech Recognition Using Automatic Allophone Deriving

Tutorial: Algorithms For 2-Dimensional Object Recognition

Image Processing and Pattern Recognition

Improving Your Speech Recognition

Speech and Language Therapy Interventions for Children with Primary Speech and Language Delay or Disorder

Speech Recognition API

Speech Recognition: Statistical Methods

Distant Speech Recognition

Algorithms for Speech Recognition and Language Processing Mehryar Mohri

Michael Riley

Richard Sproat

AT&T Laboratories

AT&T Laboratories

Bell Laboratories

[email protected]

[email protected]

[email protected]

Joint work with Emerald Chung, Donald Hindle, Andrej Ljolje, Fernando Pereira

Tutorial presented at COLING’96, August 3rd, 1996.

cmp-lg/9608018 v2 17 Sep 1996

Introduction (1) Text and speech processing: hard problems

Theory of automata Appropriate level of abstraction Well-defined algorithmic problems

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

Introduction

2

Introduction (2) Three Sections:

Algorithms for text and speech processing (2h) Speech recognition (2h) Finite-state methods for language processing (2h)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

Introduction

3

PART I Algorithms for Text and Speech Processing Mehryar Mohri AT&T Laboratories [email protected]

August 3rd, 1996

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

4

Definitions: finite automata (1)

A = (Σ; Q; ; I; F )

Alphabet Σ, Finite set of states Q, Transition function : Q Σ ! 2Q, I Q set of initial states, F Q set of final states. A recognizes L(A) = fw 2 Σ : (I; w) \ F 6= ;g (Hopcroft and Ullman, 1979; Perrin, 1990) Theorem 1 (Kleene, 1965). A set is regular (or rational) iff it can be recognized by a finite automaton.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

5

Definitions: finite automata (2) b a a

0

1

b

2

a

3

a a

b a 0

1 b

Figure 1:

M.Mohri-M.Riley-R.Sproat

b

a 2

3

b

L(A) = Σ aba.

Algorithms for Speech Recognition and Language Processing

PART I

6

Definitions: weighted automata (1)

A = (Σ; Q; ; ; ; ; I; F ) (Σ; Q; ; I; F ) is an automaton, Initial output function , Output function : Q Σ Q ! K , Final output function , Function f : Σ ! (K; +; ) associated with A: X 8u 2 Dom(f ); f (u) = ((i) (i; u; q ) (q )). 2 ((i;u)\F )

(i;q ) I

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

7

Definitions: weighted automata (2) b/1 0/4

a/0

1/0

b/0

2/0

a/2

a/0 3/0

Figure 2: Index of t = aba.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

8

Definitions: rational power series Power series: functions mapping Σ to a semiring (K; +; ) – Notation: S

=

X

w2Σ

(S; w )w , (S; w ):

– Support: supp(S ) = fw

coefficients

2 Σ : (S; w) 6= 0g

– Sum: (S + T ; w) = (S; w) + (T ; w) X n – Star: S = S

n 0

– Product: (ST ; w) =

X

uv=w2Σ

(S; u)(T ; v )

Rational power series: closure under rational operations of polynomials (polynomial power series) (Salomaa and Soittola, 1978; Berstel and Reutenauer, 1988)

Theorem 2 (Sch¨utzenberger, 1961). A power series is rational iff it can be represented by a weighted finite automaton. M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

9

Definitions: transducers (1)

T = (Σ; ∆; Q; ; ; I; F )

Finite alphabets Σ and ∆, Finite set of states Q, Transition function : Q Σ ! 2Q, Output function : Q Σ Q ! Σ, I Q set of initial states, F Q set of final states. T defines a relation: R(T ) = f(u; v) 2 (Σ )2 : v 2

[

2

\

(I; u; q)g

q ( (I;u) F )

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

10

Definitions: transducers (2) b:a a:b

a:a a:a 0

1

3 b:a

b:b

a:a 2

b:a

Figure 3: Fibonacci normalizer ([abb ! baa] [baa

M.Mohri-M.Riley-R.Sproat

abb]).

Algorithms for Speech Recognition and Language Processing

PART I

11

Definitions: weighted transducers b:a/1 a:b/0 0

a:b/1 a:b/0

Figure 4: Example, aaba

;

(min +)

;

(+ )

:

:

M.Mohri-M.Riley-R.Sproat

1

b:c/1

2

a:b/0

3/0

! (bbcb; (0 0 1 0) (0 1 1 0)).

aaba ! minf1; 2g = 1 aaba ! 0 + 0 = 0

Algorithms for Speech Recognition and Language Processing

PART I

12

Composition: Motivation (1)

Construction of complex sets or functions from more elementary ones Modular (modules, distinct linguistic descriptions) On-the-fly expansion

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

13

Composition: Motivation (2) source program

lexical analyzer

syntax analyzer

semantic analyzer

intermediate code generator

code optimizer

code generator

Figure 5: Phases of a compiler (Aho et al., 1986). target program

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

14

Composition: Motivation (3) Source text

Spellchecker

Inflected forms

Index

Set of positions

Figure 6: Complex indexation.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

15

Composition: Example (1)

0

0

(0,0)

a:a

1

a:d

a:d

1

(1,1)

b:ε

ε :e

b:e

c:ε

2

2

(2,2)

d:d

3

d:a

4

3

c:ε

(3,2)

d:a

(4,3)

Figure 7: Composition of transducers.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

16

Composition: Example (2)

0

0

(0,0)

a:a/3

1

a:d/5

1

a:d/15

(1,1)

b:ε/1

ε :e/7

b:e/7

c:ε/4

2

2

(2,2)

d:a/6

c: ε/4

d:d/2

3

4

3

(3,2)

d:a/12

(4,3)

Figure 8: Composition of weighted transducers (+; ). M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

17

Composition: Algorithm (1)

Construction of pairs of states b:c=w2 0 0 ! q1 and q2 ! q2 a:c=(w1 w2 ) 0 0

q1 – Result: (q1 ; q2 ) ! (q1 ; q2 ) Elimination of -paths redundancy: filter – Match:

a:b=w1

Complexity: quadratic On-the-fly implementation

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

18

Composition: Algorithm (2) (a) A

a:a

0

(b) B (c)

A' (d)

B'

a:d

0 ε:ε1

ε2:ε

0

1

b:ε

ε:e

ε:ε1 a:a

0

1

1

1

2

c:ε

d:a

ε:ε1 b:ε2

ε2:ε a:d

2

2

2

d:d

c:ε2

4

3 ε:ε1

ε2:ε ε1:e

3

3

ε:ε1 d:d

4

ε2:ε d:a

3

Figure 9: Composition of weighted transducers with -transitions. M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

19

Composition: Algorithm (3) 0,0

a:d (x:x)

1,1 b:ε

(ε2:ε2)

2,1

ε:e (ε1:ε1)

b:e

b:ε

(ε2:ε1) (ε2:ε2)

ε:e (ε1:ε1)

c:ε

2,2 c:ε

(ε2:ε2)

3,1

1,2

(ε2:ε2)

ε:e (ε1:ε1)

3,2 d:a (x:x)

4,3

Figure 10: Redundancy of -paths.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

20

Composition: Algorithm (4) ε1:ε1 ε2:ε1 x:x

ε1:ε1

1

x:x 0

ε2:ε2 x:x

ε2:ε2 2

Figure 11: Filter for efficient composition.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

21

Composition: Theory

Transductions (Elgot and Mezei, 1965; Eilenberg, 1974 1976; Berstel, 1979). Theorem 3 Let 1 and 2 be two (weighted) (automata + transducers), then (1 2 ) is a (weighted) (automaton + transducer). Efficient composition of weighted transducers (Mohri, Pereira, and Riley, 1996). Works with any semiring Intersection: composition of automata (weighted).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

22

Intersection: Example a a

b a 0

1

b

a 2

b

1

0

2

a

3

b

a

4

5

a

b (0,0)

b

b c

c b

3

(0,1)

b a

(0,2)

a (1,3)

b

(2,4)

a

(3,5)

Figure 12: Intersection of automata.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

23

Union: Example a/5 a/3

b/1 1

a/3 0

b/2 2

b/6

b/7

c/1

c/0 b/2

1

b/5 0

3/0

a/4

a/6

2

b/3

3

a/4

4

5/0

a/3 c/1

c/0 b/5

1

b/2

2

a/6

3

b/3

4

a/4

5/0

a/3

10

ε/0

0

ε/0

b/1

a/5 a/3

6

a/3

7 b/6

b/2

b/7 8

9/0

a/4

Figure 13: Union of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

24

Determinization: Motivation (1)

Efficiency of use (time) Elimination of redundancy No loss of information (6= pruning)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

25

Determinization: Motivation (2) leave/44.4 leave/61.9

leaves/51.4 leave/64.6

10

Detroit/110

Detroit/109 leaves/50.7 flight/72.4

7

leave/82.1 flights/64 1 which/69.9

flights/61.8

4

6

Detroit/91.9

Detroit/91.6 Detroit/88.5

leave/47.4

leave/57.7 Detroit/103

flight/43.7 flights/50.2

0

14

leave/45.8

flights/54.3

flights/53.5 which/81.6

Detroit/106

leave/68.9

5

leave/70.9

12

Detroit/102

15/0

leaves/34.6

which/72.9

Detroit/99.1 Detroit/106

which/77.7

2

flights/88.2

leaves/39.2

flight/45.4

leave/53.4 9

3 flights/83.8

leave/54.4

flights/83.4

leaves/67.6

Detroit/105 Detroit/102 11 Detroit/99.7

flights/79 leave/31.3 8 leave/35.9

Detroit/99.4

leaves/60.4

Detroit/96.3

13 leave/73.6

leave/37.3

leave/41.9

Figure 14: Toy language model (16 states, 53 transitions, 162 paths).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

26

Determinization: Motivation (3) leave/64.6

flights/53.1 0

which/69.9

1

2

flight/53.2 3

leaves/62.3

leave/63.6

4

Detroit/103

5

Detroit/105 Detroit/105

6

leaves/67.6

8/0

Detroit/101

7

Figure 15: Determinized language model (9 states, 11 transitions, 4 paths).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

27

Determinization: Example (1) a

b 2

b

b 0

3 b

a b

b

1

t4

{0}

a

{1,2}

b

{3}

b

Figure 16: Determinization of automata.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

28

Determinization: Example (2) b/1

a/1

2

b/3

b/4 0

3/0 b/3

a/3 b/1

b/5

1/0

t4

{(1,2),(2,0)}/2 b/1

a/1 {(0,0)}

{(3,0)}/0 b/1

b/3 {(1,0),(2,3)}/0

Figure 17: Determinization of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

29

Determinization: Example (3) a:b b:a 0

2

a:ba b:aa

c:c 3

d:ε 1

a {(0,ε)}

a:b b:a

{(1,a),(2, ε)}

c:c d:a

{(3,ε)}

Figure 18: Determinization of transducers.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

30

Determinization: Example (4) a:b/3 b:a/2 0

2

a:ba/4 b:aa/3

c:c/5 d:ε/4

3/0

1/0

a/1 {(0,e,0)}

a:b/3 b:a/2

{(1,a,1),(2,ε,0)}

c:c/5 d:a/5

{(3,ε,0)}/0

Figure 19: Determinization of weighted transducers (min; +).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

31

Determinization: Algorithm (1)

Generalization of the classical algorithm for automata – Powerset construction – Subsets made of (state, weight) or (state, string, weight)

Applies to subsequentiable weighted automata and transducers Time and space complexity: exponential (polynomial w.r.t. size of the result) On-the-fly implementation

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

32

Determinization: Algorithm (2) Conditions of applications

Twin states:

q and q0 are twin states iff:

– If: they can be reached from the initial states by the same input string u – Then: cycles at q and q 0 with the same input string v have the same output value

Theorem 4 (Choffrut, 1978; Mohri, 1996a) Let be an unambiguous weighted automaton (transducer, weighted transducer), then can be determinized iff it has the twin property. Theorem 5 (Mohri, 1996a) The twin property can be tested in polynomial time.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

33

Determinization: Theory

Determinization of automata – General case (Aho, Sethi, and Ullman, 1986)

– Specific case of Σ : failure functions (Mohri, 1995)

Determinization of transducers, weighted automata, and weighted transducers – General description, theory and analysis (Mohri, 1996a; Mohri, 1996b) – Conditions of application and test algorithm – Acyclic weighted transducers or transducers admit determinization

Can be used with other semirings (ex: (R; +; ))

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

34

Local determinization: Motivation

Time efficiency Reduction of redundancy Control of the resulting size (flexibility) Equivalent function (or equal set) No loss of information

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

35

Local determinization: Example c:a/2 2

a:b/4

a:c/2 a:b/3

b:b/6 5

b:c/3 0

b:a/5 1

a:a/3 b:a/7

c:b/2 c:a/3 4

3

a:a/3

a:a/5

{(2,ε,0)} c:a/2

0

b:ε/5

3 ε::b/1

a: ε/3 1 {(1,a,0),(2,b,1),(3,a,2)}

a:b/3 a:c/2

{(3,ε,0)}

ε:a/2

a:a/3 4

ε:a/0 2

5

c:a/3 c:b/2

{(1,ε,0)} 6 b:c/3

Figure 20: Local determinization of weighted transducers (min; +). M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

36

Local determinization: Algorithm

Predicate, ex: (P ) (out degree(q) > k) k: threshold parameter Local: Dom(det) = fq : P (q)g Determinization only for q 2 Dom(det) On-the-fly implementation (out degree (q ))) Complexity O(jDom(det)j max q 2Q

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

37

Local determinization: theory

Various choices of predicate (constraint: local) Definition of parameters Applies to all automata, weighted automata, transducers, and weighted transducers Can be used with other semirings (ex: (R; +; ))

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

38

Minimization: Motivation

Space efficiency Equivalent function (or equal set) No loss of information (6= pruning)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

39

Minimization: Motivation (2) leave/64.6

flights/53.1 0

which/69.9

1

2

flight/53.2 3

leaves/62.3

leave/63.6

4

Detroit/103

5

Detroit/105 Detroit/105

6

leaves/67.6

8/0

Detroit/101

7

Figure 21: Determinized language model.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

40

Minimization: Motivation (3) leave/0.0498 flights/0 0

which/291

1

2

flight/1.34

leaves/0 leave/0

4

Detroit/0

5/0

3 leaves/0.132

Figure 22: Minimized language model.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

41

Minimization: Example (1) a b

a 0

1

b

2

c 3

a

c b

b

b

5

4 a

a b a 0

1

c

2

b

b

t96

a b

4

3 t97

Figure 23: Minimization of automata.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

42

Minimization: Example (2) a:3

2

c:2

d:0

c:1 3

d:4 0

a:0

1

d:3

b:1

e:3 e:2

6

e:1

7

5 b:2

c:1 4

a:3

2

c:0

d:0

c:1 3

d:6 0

a:6

1

d:6

b:7

e:0 e:0

6

e:0

7

5 b:0

c:0 4

d:0 0

a:6

1

a:3 b:0

c:1 2

b:7 d:6

c:0 3

e:0

4

e:0

5

Figure 24: Minimization of weighted automata (min; +). M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

43

Minimization: Example (3) 2

b:B

0

e:D

3

f:BC

6

7

c:D

b:C b:C

4

5

2

b:ε a:ABCDB 0

a:DB

d:D

1

a:A

c:C

c:ε

a:DB

d:CDB

1

e:C

3

f: ε

6

7

c:ε

b:CCDDB b:ε

4

5

a:DB b:ε 0

a:ABCDB b:CCDDB

1

2 d:CDB

c:ε 3

e:C

5

f: ε

6

Figure 25: Minimization of transducers.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

44

Minimization: Example (4) 2

b:B/5

0

a:DB/2

d:D/1

1

a:A/0

c:C/3

e:D/1

3

f:BC/6

6

7/0

c:D/4

b:C/2 b:C/2

4

5

2

b: ε/0 a:ABCDB/15

c: ε/0

a:DB/2

d:CDB/9

1

e:C/0

3

f:ε/0

6

7/0

c: ε/0

0 b:CCDDB/15

b: ε/0

4

5

a:DB/2 b:ε/0 0

a:ABCDB/15 b:CCDDB/15

1

2 d:CDB/9

c: ε/0 3

e:C/0

5

f: ε/0

6/0

Figure 26: Minimization of weighted transducers (min; +).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

45

Minimization: Algorithm (1)

Two steps – Pushing or extraction of strings or weights towards initial state – Classical minimization of automata, (input,ouput) considered as a single label

Algorithm for the first step – Transducers: specific algorithm – Weighted automata: shortest-paths algorithms

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

46

Minimization: Algorithm (2)

Complexity – E: set of transitions – S: sum of the lengths of output strings – the longest of the longest common prefixes of the output paths leaving each state Type

General

Automata

O( E

Weighted automata Transducers

M.Mohri-M.Riley-R.Sproat

j j log(jQj)) O (jE j log(jQj)) O (jQj + jE j (log jQj + jPmax j))

Acyclic

j j j j O (jQj + jE j) O (S + jE j + jQj+ (jE j (jQj jF j)) jPmax j) O( Q + E )

Algorithms for Speech Recognition and Language Processing

PART I

47

Minimization: Theory

Minimization of automata (Aho, Hopcroft, and Ullman, 1974; Revuz, 1991) Minimization of transducers (Mohri, 1994) Minimization of weighted automata (Mohri, 1996a) – Minimal number of transitions – Test of equivalence

Standardization of power series (Sch¨utzenberger, 1961) – Works only with fields – Creates too many transitions

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

48

Conclusion (1)

Theory – Rational power series – Weighted automata and transducers

Algorithms – General (various semirings) – Efficiency (used in practice, large sizes)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

49

Conclusion (2)

Applications – Text processing (spelling checkers, pattern-matching, indexation, OCR) – Language processing (morphology, phonology, syntax, language modeling) – Speech processing (speech recognition, text-to-speech synthesis) – Computational biology (matching with errors) – Many other applications

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART I

50

PART II Speech Recognition Michael Riley AT&T Laboratories [email protected]

August 3rd, 1996

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

51

Overview

The speech recognition problem Acoustic, lexical and grammatical models Finite-state automata in speech recognition Search in finite-state automata

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

52

Speech Recognition Given an utterance, find its most likely written transcription. Fundamental ideas:

Utterances are built from sequences of units Acoustic correlates of a unit are affected by surrounding units Units combine into units at a higher level — phones ! syllables ! words Relationships between levels can be modeled by weighted graphs — we use weighted finite-state transducers Recognition: find the best path in a suitable product graph

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

53

Levels of Speech Representation

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

54

Maximum A Posteriori Decoding Overall analysis [4, 57]:

Acoustic observations: parameter vectors derived by local spectral analysis of the speech waveform at regular (e.g. 10msec) intervals Observation sequence o Transcriptions w Probability P (ojw) of observing o when w is uttered Maximum a posteriori decoding: ˆ w

=

argmax P (wjo) = argmax P (oPjw()oP) (w) w

=

argmax w

M.Mohri-M.Riley-R.Sproat

P (ojw)

w

P (w)

| {z }

| {z }

generative model

language model

Algorithms for Speech Recognition and Language Processing

PART II

55

Generative Models of Speech Typical decomposition of P (ojw) into conditionally-independent mappings between levels:

Acoustic model P (ojp) : phone sequences ! observation sequences. Detailed model: –

P (ojd) : distributions ! observation vectors —

–

P (djm) : context-dependent phone models !

symbolic ! quantitative distribution sequences

P (mjp) : phone sequences ! model sequences Pronunciation model P (pjw) : word sequences ! phone sequences Language model P (w) : word sequences –

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

56

Recognition Cascades: General Form

Multistage cascade:

o = sk

stage k

sk−1

s1

stage 1

w = s0

Find s0 maximizing

P (s0 ; sk ) = P (sk js0 )P (s0 ) = P (s0 )

X s1 ;:::;sk

Y 1

1 j k

P (sj jsj

1)

“Viterbi” approximation: Cost(s0 ; sk )

Cost(sk js0 )

where Cost(: : :) = M.Mohri-M.Riley-R.Sproat

=

Cost(sk js0 ) + Cost(s0 ) mins1 ;:::;sk

P

1

Cost(sj jsj

1 j k

1)

log P (: : :).

Algorithms for Speech Recognition and Language Processing

PART II

57

Speech Recognition Problems

Modeling: how to describe accurately the relations between levels ) modeling errors Search: how to find the best interpretation of the observations according to the given models ) search errors

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

58

Acoustic Modeling – Feature Selection I

Short-time spectral analysis: Z log

g( )x(t + )e

i2f

d

Short-time (25 msec. Hamming window) spectrum of /ae/ – Hz. vs. Db.

Scale selection: – Cepstral smoothing – Parameter sampling (13 parameters)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

59

Acoustic Modeling – Feature Selection II [40, 38]

Refinements – Time derivatives – 1st and 2nd order – non-Fourier analysis (e.g., Mel scale) – speaker/channel adaptation

mean cepstral subtraction vocal tract normalization linear transformations

Result: 39 dimensional feature vector (13 cepstra, 13 delta cepstra, 13 delta-delta cepstra) every 10 milliseconds

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

60

Acoustic Modeling – Stochastic Distributions [4, 61, 39, 5]

Vector quantization – find codebook of prototypes Full covariance multivariate Gaussians:

P [y] = (2)N=12 jSj1=2 e

T

1 2 (y

T )S

1

(y )

Diagonal covariance Gaussian mixtures Semi-continuous, tied mixtures

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

61

Acoustic Modeling – Units and Training [61, 36]

Units – Phonetic (sub-word) units – e.g., cat –> /k ae t/ – Context-dependent units – aek;t – Multiple distributions (states) per phone – left, middle, right

Training – Given a segmentation, training straight-forward – Obtain segmentation by transcription – Iterate until convergence

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

62

Generating Lexicons – Two Steps

Orthography ! Phonemes

“had” ! /hh ae d/ “your” ! /y uw r/

– complex, context-independent mapping – usually small number of alternatives – determined by spelling constraints; lexical “facts” – large online dictionaries available

Phonemes ! Phones /hh ae d y uw r/ ! [hh ae dcl jh axr] /hh ae d y uw r/ ! [hh ae dcl d y axr]

(60% prob) (40% prob)

– complex, context-dependent mapping – many possible alternatives – determined by phonological and phonetic constraints M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

63

Decision Trees: Overview [9]

Description/Use: Simple structure – binary tree of decisions, terminal nodes determine prediction (cf. “Game of Twenty Questions”). If dependent variable is categorical (e.g., red, yellow, green), called “classification tree”, if continuous, called “regression tree”.

Creation/Estimation: Creating a binary decision tree for classification or regression involves three steps (Breiman, et al): 1. Splitting Rules: Which split to take at a node? 2. Stopping Rules: When to declare a node terminal? 3. Node Assignment: Which class/value to assign to a terminal node?

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

64

1. Decision Tree Splitting Rules Which split to take at a node?

Candidate splits considered. – Binary cuts: For continuous 1 x < 1, consider splits of form: x k vs: x > k; 8k:

– Binary partitions: For categorical x 2 f1; 2; :::; ng = X , consider splits of form:

x2A M.Mohri-M.Riley-R.Sproat

vs:

x 2 X A; 8A X:

Algorithms for Speech Recognition and Language Processing

PART II

65

1. Decision Tree Splitting Rules – Continued

Choosing best candidate split. – Method 1: Choose k (continuous) or A (categorical) that minimizes estimated classification (regression) error after split. – Method 2 (for classification): Choose k or A that minimizes estimated entropy after that split.

M.Mohri-M.Riley-R.Sproat

SPLIT #1

SPLIT #2

No. 1: 400

No. 1: 400

No. 2: 400

No. 2: 400

No. 1: 300

No. 1: 100

No. 1: 200

No. 1: 200

No. 2: 100

No. 2: 300

No. 2: 400

No. 2: 0

Algorithms for Speech Recognition and Language Processing

PART II

66

2. Decision Tree Stopping Rules When to declare a node terminal?

Strategy (Cost-Complexity pruning): 1. Grow over-large tree.

2. Form sequence of subtrees, T0 ; :::; Tn ranging from full tree to just the root node. 3. Estimate “honest” error rate for each subtree. 4. Choose tree size with mininum “honest” error rate.

To form sequence of subtrees, vary from 0 (for full tree) to 1 (for just root node) in: min R(T ) + j T j : T

To estimate “honest” error rate, test on data different from training data, e.g., grow tree on 9=10 of available data and test on 1=10 of data repeating 10 times and averaging (cross-validation).

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

67

0.025

End of Declarative Sentence Prediction: Pruning Sequence

0.015

error rate

o +

o + o +

0.0

0.005

o +ooo +++ o o + +ooooo oooo oo o o o +++++ +++ oo oo o o ++ ++ + + +++ +++ + + 0

20

40

o oooooo + +++++++++++++++++++++

60

80

100

# of terminal nodes + = raw, o = cross-validated

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

68

3. Decision Tree Node Assignment

Which class/value to assign to a terminal node?

Plurality vote: Choose most frequent class at that node for classification; choose mean value for regression.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

69

End-of-Declarative-Sentence Prediction: Features [65]

Prob[word with “.” occurs at end of sentence] Prob[word after “.” occurs at beginning of sentence] Length of word with “.” Length of word after “.” Case of word with “.”: Upper, Lower, Cap, Numbers Case of word after “.”: Upper, Lower, Cap, Numbers Punctuation after “.” (if any) Abbreviation class of word with “.”: – e.g., month name, unit-of-measure, title, address name, etc.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

70

End of Declarative Sentence? yes 48294/52895 bprob:27.29 yes

yes

5539/10020 eprob:1.045 yes

no 3289/3547

type:n/a

M.Mohri-M.Riley-R.Sproat

42755/42875

5281/6473 next:cap,upcase+. next:n/a,lcase,lcase+.,upcase,num yes

no

5156/5435

913/1038

type:addr,com,group,state,title,unit

yes

no

5137/5283

133/152

Algorithms for Speech Recognition and Language Processing

PART II

71

Phoneme-to-Phone Alignment PHONEME p er p ax s ae n d r ih s p eh k t M.Mohri-M.Riley-R.Sproat

PHONE p er pcl p ix s ax n r ix s pcl p eh kcl t

WORD purpose

and

respect

Algorithms for Speech Recognition and Language Processing

PART II

72

Phoneme-to-Phone Realization: Features [66, 10, 62]

Phonemic Context: – Phoneme to predict – Three phonemes to left – Three phonemes to right

Stress (0, 1, 2)

Lexical Position: – Phoneme count from start of word – Phoneme count from end of word

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

73

Phoneme-to-Phone Realization: Prediction Example

Tree splits for /t/ in ‘‘your pretty red’’:

PHONE ix n kcl+k tcl+t tcl+t tcl+t dx dx

M.Mohri-M.Riley-R.Sproat

COUNT 182499 87283 38942 21852 11928 5918 3639 2454

SPLIT cm0: vstp,ustp,vfri,ufri,vaff,uaff,nas cm0: vstp,ustp,vaff,uaff cp0: alv,pal cm0: ustp vm1: mono,rvow,wdi,ydi cm-1: ustp,rho,n/a rstr: n/a,no

Algorithms for Speech Recognition and Language Processing

PART II

74

Phoneme-to-Phone Realization: Network Example Phonetic network for ‘‘Don had your pretty...’’: PHONEME d aa n hh ae d y uw r p r ih t iy M.Mohri-M.Riley-R.Sproat

PHONE1 0.91 d 0.92 aa 0.98 n 0.74 hh 0.73 ae 0.51 dcl jh 0.90 y 0.84 0.48 axr 0.99 0.99 pcl p 0.99 r 0.86 ih 0.73 dx 0.90 iy

PHONE2

0.15 hv 0.19 eh 0.37 dcl d 0.16 y 0.29 er

PHONE3

CONTEXT

!dcl d) !dcl jh)

(if d (if d

0.11 tcl t

Algorithms for Speech Recognition and Language Processing

PART II

75

Acoustic Model Context Selection [92, 39]

Statistical regression trees used to predict contexts based on distribution variance One tree per context-independent phone and state (left, middle, right) The trees were grown until the data criterion of 500 frames per distribution was met Trees pruned using cost-complexity pruning and cross-validation to select best contexts About 44000 context-dependent phone models About 16000 distributions

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

76

N-Grams: Basics

‘Chain Rule’ and Joint/Conditional Probabilities:

P [x1 x2 : : : xN ] = P [xN jx1 :::xN 1]P [xN 1 jx1:::xN

2]

: : : P [x2 jx1 ]P [x1 ]

where, e.g.,

P [xN jx1 : : : xN

1] =

(First–Order) Markov assumption:

nth–Order Markov assumption:

P [x1 : : : xN ] P [x1 : : : xN 1]

P [xk 1 xk ] P [xk jx1 : : : xk 1 ] = P [xk jxk 1 ] = P [x ] k 1 P [xk jx1 : : : xk

M.Mohri-M.Riley-R.Sproat

1] =

P [xk jxk n:::xk

1] =

P [xk n : : : xk ] P [xk n : : : xk 1]

Algorithms for Speech Recognition and Language Processing

PART II

77

N-Grams: Maximum Likelihood Estimation Let N be total number of n-grams observed in a corpus and c(x1 : : : xn ) be the number of times the n-gram x1 : : : xn occurred. Then

c (x1 : : : xn ) P [x1 : : : xn ] = N is the maximum likelihood estimate of that n-gram probability. For conditional probabilities,

c (x1 : : : xn ) P [xn jx1 : : : xn 1] = c(x : : : x ) : n 1 1 is the maximum likelihood estimate. With this method, an n-gram that does not occur in the corpus is assigned zero probability. M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

78

N-Grams: Good-Turing-Katz Estimation [29, 16] Let nr be the number of n-grams that occurred r times. Then

(x1 : : : xn ) c P [x1 : : : xn ] = N

is the Good-Turing estimate of that n-gram probability, where c (x) = (c(x) + 1) nnc(cx(x)+) 1 : For conditional probabilities,

(x1 : : : xn ) c P [xn jx1 : : : xn 1] = c(x : : : x ) ; n 1 1

c(x1 : : : xn ) > 0

is Katz’s extension of the Good-Turing estimate. With this method, an n-gram that does not occur in the corpus is assigned the backoff probability P [xn jx1 : : : xn 1 ] = P [xn jx2 : : : xn 1 ]; where is a normalizing constant. M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

79

Finite-State Modeling [57] Our view of recognition cascades: represent mappings between levels, observation sequences and language uniformly with weighted finite-state machines:

Probabilistic mapping P (xjy): weighted finite-state transducer. Example — word pronunciation transducer: d:ε/1

ey:ε/.4

dx:ε/.8

ae:ε/.6

t:ε/.2

ax:"data"/1

Language model P (w): weighted finite-state acceptor

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

80

Example of Recognition Cascade

O observations

A

D phones

M words

Recognition from observations o by composition: – Observations:

8 < 1

O(s; s) = :

if s = o

0 otherwise

A(a; p) = P (ajp) – Pronunciation dictionary: D(p; w) = P (pjw) – Language model: M (w; w) = P (w) Recognition: wˆ = argmax(O A D M )(o; w) – Acoustic-phone transducer:

w

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

81

Speech Models as Weighted Automata

Quantized observations: o1

t0

o2

t1

on

...

t2

tn

Phone model A : observations ! phones oi:ε/p01(i) oi:ε/p12(i) s s2 1 .. .. . . .. .. .. . . .

s0

oi:ε/p00(i)

oi:ε/p11(i)

P

ε:π/p2f

oi:ε/p22(i)

A = A Word pronunciations Ddata : phones ! words Acoustic transducer:

d:ε/1

Dictionary: M.Mohri-M.Riley-R.Sproat

D=

P w

ey:ε/.4

dx:ε/.8

ae:ε/.6

t:ε/.2

Dw

ax:"data"/1

Algorithms for Speech Recognition and Language Processing

PART II

82

Example: Phone Lattice O A Lattices: Weighted acyclic graphs representing possible interpretations of an utterance as sequences of units at a given level of representation (phones, syllables, words,: : : ) Example: result of composing observation sequence for hostile battle with acoustic model: s/-8.579 s/-6.200 f/-8.129 t/-6.200

0

hh/-9.043

14

aa/-2.743 ao/-2.593

20

s/-8.007 v/-0.421

f/-4.100

f/-3.057 s/-7.129 th/-3.264

31

pau/-4.207 23

s/-3.343

38

pau/-2.421

ae/-13.100

ax/-3.721 t/-3.621

s/-4.893 27

q/-0.336

ax/-0.379

dx/-0.514

uw/-0.714

th/-2.464

s/-3.493

s/-3.336

t/-2.493

el/-6.457

s/-3.179

t/-3.229

d/-4.721 el/-4.229

s/-7.214

34

t/-3.386

40

en/-1.729

b/-8.007

v/-2.150

n/-1.236

b/-2.271

n/-2.857

58

ae/-10.600

74

78

r/-0.579 el/-5.679

l/-4.371 el/-4.357

83

68

44

pau/-6.621 s/-4.436

48

53

70

ax/-2.971

th/-6.257 s/-9.893

s/-10.657 s/-11.479

s/-12.007

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

83

Sample Pronunciation Dictionary D Dictionary with hostile, battle and bottle as a weighted transducer: ax:-/2.607

ay:-/1.616 t:-/0.067 0

aa:-/0.055

-:hostile/2.943 hh:hostile/0.134 -:-/2.466

-:-/0.000

s:-/0.035 4

2 el:-/0.431

3

18

17 hv:hostile/2.635

1 -:-/0.014

-:-/0.014 el:-/0.164

l:-/0.112 -:-/0.000 15

16

b:battle/0.000

9

ae:-/0.057

8

t:-/2.113

6

7

dx:-/0.240

b:bottle/0.000 14

aa:-/0.055

13

t:-/2.113 dx:-/0.240

-:-/0.014

ax:-/2.607 12

el:-/0.164

11

ax:-/2.607

-:-/2.466

10

l:-/0.112 l:-/0.112

5

-:-/2.466

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

84

Sample Language Model M Simplified language model as a weighted acceptor:

battle/10.896 battle/6.603 -/1.882

2 -/2.306

battle/9.268 hostile/9.394 hostile/11.119 3

0

-/2.374

4

-/3.173 bottle/11.510

-/3.537 bottle/13.970

-/1.102

-/1.913

5

1

-/3.961

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

85

Recognition by Composition

From phones to words: compose dictionary with phone lattice to yield word lattice with combined acoustic and pronunciation costs: 0

hostile/-32.900

1

battle/-26.825

2

Applying language model: Compose word lattice with language model to obtain word lattice with combined acoustic, pronunciation and language model costs: hostile/-21.781 0

2

battle/-17.916 battle/-15.250

hostile/-19.407

3

1

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

86

Context-Dependency Examples

Context-dependent phone models: Maps from CI units to CD units. Example: ae=b d ! aeb;d Context-dependent allophonic rules: Maps from baseforms to detailed phones. Example: t=V 0 V ! dx Difficulty: Cross-word contexts – where several words enter and leave a state in the grammar, substitution does not apply.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

87

Context-Dependency Transducers Example — triphonic context transducer for two symbols x and y . x.x

x/x_x:x

x/x_y:x

x.y

y/x_y:y

x/y_y:x

y.y

x/y_x:x

y/y_y:y

y/x_x:y

y/y_x:y

y.x

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

88

Generalized State Machines All of the above networks have bounded context and thus can be represented as generalized state machines. A generalized state machine M:

Supports these operations: – – –

M:start – returns start state M:final(state) – returns 1 if final, 0 if non-final state M:arcs(state) – returns transitions (a1 ; a2 ; : : : ; aN ) leaving state, where ai = (ilabel; olabel; weight; nextstate)

Does not necessarily support: – providing the number of states – expanding states that have not been already discovered

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

89

On-Demand Composition [69, 53] Create generalized state machine C for composition A B .

C:start := (A:start; B:start) C:final((s1; s2)) := A:final(s1) ^ B:final(s2) C:arcs((s1; s2)) := Merge(A:arcs(s1); B:arcs(s2))

Merged arcs defined as:

l1; l3; x + y; (ns1; ns2)) 2 Merge(A:arcs(s1); B:arcs(s2))

(

iff

l1; l2; x; ns1) 2 A:arcs(s1) and (l2; l3; y; ns2) 2 B:arcs(s2)

(

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

90

State Caching Create generalized state machine B for input machine A.

B:start := A:start B:final(state) := A:final(state) B:arcs(state) := A:arcs(state) Cache Disciplines:

Expand each state of A exactly once, i.e. always save in cache (memoize). Cache, but forget ’old’ states using a least-recently used criterion. Use instructions (ref counts) from user (decoder) to save and forget.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

91

On Demand Composition – Results ATIS Task - class-based trigram grammar, full cross-word triphonic context-dependency. states

arcs

context

762

40386

lexicon

3150

4816

grammar

48758

359532

1:6 106

5:1 106

full expansion

For the same recognition accuracy as with a static, fully expanded network, on-demand composition expands just 1.6% of the total number of arcs.

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

92

Determinization in Large Vocabulary Recognition

For large vocabularies, ’string’ lexicons are very non-deterministic Determinizing the lexicon solves this problem, but can introduce non-coassessible states during its composition with the grammar Alternate Solutions: – Off-line compose, determinize, and minimize:

Lexicon Grammar – Pre-tabulate non-coassessible states in the composition of:

Det(Lexicon) Grammar

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

93

Search in Recognition Cascades

Reminder: Cost

log probability

ˆ = argmax(O A D M )(o; w) Example recognition problem: w w

ˆ by the output word sequence for the Viterbi search: approximate w lowest-cost path from the start state to a final state in O A D M — ignores summing over multiple paths with same output:

...:w1

...:ε

...:wi ...:ε

...:ε

...:wn

> ...:ε O°A°D°M

Composition preserves acyclicity, O is acyclic ) acyclic search graph

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

94

Single-source Shortest Path Algorithms [83]

Meta-algorithm:

Q

fs0g; 8 s; Cost(s)

While Q not empty, s For each s0

Cost(s0 )

1

Dequeue(Q)

2 Adj [s] such that Cost(s0 ) > Cost(s) + cost(s; s0 )

Cost(s) + cost(s; s0 ) Enqueue(Q; s)

Specific algorithms: Name

Queue type

Cycles

Neg. Weights

acyclic

topological

no

yes

Dijkstra

best-first

yes

no

Bellman-Ford

FIFO

yes

yes

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

Complexity

O(jV j + jE j) O(jE j log jV j) O(jV j jE j) PART II

95

The Search Problem

Obvious first approach: use an appropriate single-source shortest-path algorithm Problem: impractical to visit all states, can we do better? – Admissible methods: guarantee finding best path, but reorder search to avoid exploring provably bad regions – Non-admissible methods: may fail to find best path, but may need to explore much less of the graph

Current practical approaches: – Heuristic cost functions – Beam search – Multipass search – Rescoring

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

96

Heuristic Cost Function — A* Search [4, 56, 17]

States in search ordered by cost-so-far(s) + lower-bound-to-complete(s)

With a tight bound, states not on good paths are not explored With a loose lower bound no better than Dijkstra’s algorithm Where to find a tight bound? – Full search of a composition of smaller automata (homomorphic automata with lower-bounding costs?) – Non-admissible A* variants: use averaged estimate of cost-to-complete, not a lower-bound

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

97

Beam Search [35]

Only explore states with costs within a beam (threshold) of the cost of the best comparable state Non-admissible Comparable states states corresponding to (approximately) the same observations Synchronous (Viterbi) search: explore composition states in chronological observation order Problem with synchronous beam search: too local, some observation subsequences are unreliable and may locally put the best overall path outside the beam

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

98

Beam-Search Tradeoffs [68] Word lattice: result of composing observation sequence, level transducers and language model. Beam 4

Word lattice error rate 7.3%

Median number of edges 86.5

6

5.4%

244.5

8

4.4%

827

10

4.1%

3520

12

4.0%

13813.5

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

99

Multipass Search [52, 3, 68]

Use a succession of binary compositions instead of a single n-way composition — combinable with other methods Prune: Use two-pass variant of composition to remove states not in any path close enough to the best Pruned intermediate lattices are smaller, lower number of state pairings considered Approximate: use simpler models (context-independent phone models, low-order language models) Rescore: : :

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

100

Rescoring Most successful approach in practice:

…

o

approximate rescoring n best w1 cheap detailed models models wn

wi

Small pruned result built by composing approximate models Composition with full models, observations Find lowest-cost path

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART II

101

PART III Finite State Methods in Language Processing Richard Sproat

Speech Synthesis Research Department Bell Laboratories, Lucent Technologies

[email protected]

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

102

Overview

Text-analysis for Text-to-Speech (TTS) Synthesis

– A rich domain with lots of linguistic problems

– Probably the least familiar application of NLP technologies

Syntactic analysis

Some thoughts on text indexation

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

103

The Nature of the TTS Problem This is some text: It was a dark and stormy night. Four score and seven years ago. Now is the time for all good men. Let them eat cake. Quoth the raven nevermore.

Linguistic Analysis phonemes, durations and pitch contours

Speech Synthesis speech waveforms

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

104

From Text to Linguistic Representation

Ñ«YoC ‘The rat is eating the oil’

&

N

lao3

shu3

l a u∩

V

N

chi1

you2

ts

r

j o u∩

µ

µ

µ

µ

σ

σ

σ

σ

ω

ω

su





ω

LH

L

H

LH

Φ

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

105

Russian Percentages: The Problem How do you say ‘%’ in Russian?

)

20% skidka ‘20% discount’

Adjectival forms when modifying nouns dvadcat i -procentn a skidka dvadcat i -procent naja skidka

s 20% rastvorom

) s dvadcat i -procent nym rastvorom

‘with 20% solution’

s dvadcat i -procent nym rastvorom

21%

Nominal forms otherwise dvadcat~ odin procent

)

dvadcat’ odin procent

23%

) dvadcat~ tri procent a dvadcat’ tri procent a

20%

) dvadcat~ procent ov dvadcat’ procent ov

s 20% ‘with 20%’ M.Mohri-M.Riley-R.Sproat

) s dvadcat~ procent ami s dvadcat’ ju procent ami Algorithms for Speech Recognition and Language Processing

PART III

106

Text Analysis Problems

Segment text into words. Segment text into sentences, checking for and expanding abbreviations : St. Louis is in Missouri.

Expand numbers Lexical and morphological analysis Word pronunciation – Homograph disambiguation

Phrasing Accentuation

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

107

Desiderata for a Model of Text Analysis for TTS

Delay decisions until have enough information to make them

Possibly weight various alternatives

Weighted Finite-State Transducers offer an attractive computational model

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

108

Overall Architectural Matters Example: word pronunciation in Russian

Text form: kostra (bonfire+genitive.singular) Morphological analysis: ¨ fnoungfmascgfinang+0 afsggfgeng kost0 Er

Pronunciation: /kstr0 a/

Minimal Morphologically-Motivated Annotation (MMA): kostr0 a (Sproat, 1996)

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

109

Overall Architectural Matters

Language Model

fst

α:β

Lexical Analysis WFST:

γ:δ

L=D

O

−1

O

L

−1

Morphological Analysis :

D

S

#KOST"{E}P{noun}{masc}{inan}+"A{sg}{gen}#

Phonological Analysis WFST:

M

L fst

M

O

P

α:β γ:δ

MMA #KOSTP"A#

fst

S

α:β γ:δ

Surface Orthographic Form KOSTPA

M.Mohri-M.Riley-R.Sproat

fst

α:β γ:δ

P Pronunciation #kastr"a#

Algorithms for Speech Recognition and Language Processing

PART III

110

Orthography ! Lexical Representation A Closer Look

Words : Lex. Annot.

Lex. Annot. : Lex. Anal. S

Special Symbols : Expansions S

_

Punc. :Interp. S

SPACE :Interp.

Numerals : Expansions

SPACE:

white space in German, Spanish, Russian : : :

M.Mohri-M.Riley-R.Sproat

in Japanese, Chinese : : :

Algorithms for Speech Recognition and Language Processing

PART III

111

F FÑ j jó £ b Ñ Ñ£F Ú ñ ñj þÌ ó Ññ Ññj

Chinese Word Segmentation

! ! ! ! ! ! ! ! ! ! ! ! ! ! !

M.Mohri-M.Riley-R.Sproat

F asp : F Ñ vb : j vb : j ó nc : £ adv : b vb : Ñ vb : Ñ vb++£ F npot Ú np : ñ vb : ñj vb : þ Ì nc : ó nc : Ñ ñ nc : Ñ ñj urnp : 1

4 68

le0

perf

2

1

liao3jie3

understand

da4

big

da4jie1

avenue

bu4

not

4 45

zai4

at

11 77

wang4

forget

wang4+bu4liao3

unable to forget

4 88

wo3

I

8 05

fang4

place

10 70

fang4da4

enlarge

11 02

na3li3

where

jie1

avenue

jie3fang4

liberation

1

8 11

5 56

1

11 45

2

4 58

2

1

1

2

10 35

1 3

10 92 1

42 23

:

12 23

xie4 fang4da4

name

Algorithms for Speech Recognition and Language Processing

PART III

112

Chinese Word Segmentation

Space = : #

L = Space _ (Dictionary

_ (Space [ Punc))+

ÚÑ£FÑñjóbþÌ Ñ £F Ññ

BestPath( L) = pro4:88 # vb+ nc10:92 2 npot12:23 # 1 ‘I couldn’t forget where Liberation Avenue is.’

Ú

M.Mohri-M.Riley-R.Sproat

j 1 ó nc11:45 : : :

Algorithms for Speech Recognition and Language Processing

PART III

113

Numeral Expansion

234

Factorization

DecadeFlop

NumberLexicon

) ) +

2 102 + 3 101 + 4 2 102 + 4 + 3 101

zwei+hundert+vier+und+dreißig

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

114

Numeral Expansion

0:0

1:1

1:1

2:2

2:2

3:3

0

4:4 5:5 6:6 7:7

ε:10∧1 0:0 1

ε:10∧2

3:3 ε:10∧1

1:1 2:2

8:8 9:9 2

3:3 4:4 5:5 6:6

4

4:4 5:5 6:6

5

7:7 8:8 3

9:9

7:7 8:8 9:9

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

115

German Numeral Lexicon

fg /f2g /f3g / 1

: :

f gf+++gf1gf10^1g) /(f1gf+++gf1gf10^1g) /(f2gf+++gf1gf10^1g) /(f3gf+++gf1gf10^1g) /( 0

f gf10^1g) /(f3gf10^1g)

/( 2

f ^g /(f10^3g) /( 10 2 )

M.Mohri-M.Riley-R.Sproat

: . .. : : : : .. . : : .. . : :

f g f gjfneutg)fsggf##g)/ (zw’eifnumgf##g)/ (dr’eifnumgf##g)/ (’eins num ( masc

f

gf##g)/ (’elffnumgf##g)/ (zw’¨olffnumgf##g)/ (dr’eif++gzehnfnumgf##g)/ (z’ehn num

f g f gf##g)/ (dr’eif++gßigfnumgf##g)/ (zw’an ++ zig num

f gf##g)/ (t’ausendfnumgfneutgf##g)/ (h’undert num

Algorithms for Speech Recognition and Language Processing

PART III

116

Morphology: Paradigmatic Specifications

Paradigm

fA1g # starke Flektion (z.B. nach unbestimmtem Artikel)

Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix

M.Mohri-M.Riley-R.Sproat

f++ger f++gen f++ge f++gen f++ges f++ge f++ger f++gen

fsggfmascgfnomg fsggfmascg(fgengjfdatgjfaccg) fsggffemig(fnomgjfaccg) fsgg(ffemigjfneutg)(fgengjfdatg) fsggfneutg(fnomgjfaccg) fplg(fnomgjfaccg) fplgfgeng fplgfdatg

Algorithms for Speech Recognition and Language Processing

PART III

117

Morphology: Paradigmatic Specifications Paradigm Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix Suffix M.Mohri-M.Riley-R.Sproat

fA6g f++gfEpsg f++ge f++ges f++ger f++gem f++gen f++gfEpsg f++ge f++ger f++gen

##### Possessiva ("mein, euer")

fsgg(fmascg|fneutg)fnomg fsggffemigfnomg fsgg(fmascg|fneutg)fgeng fsggffemig(fgeng|fdatg) fsgg(fmascg|fneutg)fdatg fsggfmascgfaccg fsggfneutgfaccg fplg(fnomg|faccg) fplgfgeng fplgfdatg

Algorithms for Speech Recognition and Language Processing

PART III

118

Morphology: Paradigmatic Specifications /fA1g /fA1g

/fA1g /fA1g

/fA6g /fA6g /fA6g /fA6g /fA6g

/fA6g /fA6g M.Mohri-M.Riley-R.Sproat

: : : : .. . : : : : : : :

(’aalf++gglattfadjg)/

(’abf++ga¨ nderf++glichfadjgfumltg)/ (’abf++gartigfadjg)/

(’abf++gbauf++gw¨urdigfadjgfumltg)/ (d’einfadjg)/ (’euerfadjg)/ (’ihrfadjg)/

(’Ihrfadjg)/

(m’einfadjg)/ (s’einfadjg)/

(’unserfadjg)/

Algorithms for Speech Recognition and Language Processing

PART III

119

Morphology: Paradigmatic Specifications

Project((fA6g_Endings) ((fA6g:Stems)_Id(Σ ))) )

masc sg 0

m

1

’

2

e

3

i

4

n

5

adj

6

++

7

sg e

8

neut

13

femi

9 nom 11

pl 12

s r

24

n

20

sg

25

acc nom

masc neut 23

pl

17

gen

sg

m sg

gen

21

femi 22

dat acc

18

masc 19

dat

10

pl 14

sg

neut 15

M.Mohri-M.Riley-R.Sproat

16

masc

Algorithms for Speech Recognition and Language Processing

PART III

120

Morphology: Finite-State Grammar

M.Mohri-M.Riley-R.Sproat

fEpsg fEpsg t"elef++g

START

PREFIX

PREFIX

STEM

PREFIX

STEM .. .

STEM

SUFFIX

’abend

STEM

SUFFIX .. .

’abenteuer

SUFFIX

PREFIX

SUFFIX

FUGE

SUFFIX

WORD . ..

f++g fEpsg fEpsg

Algorithms for Speech Recognition and Language Processing

PART III

121

Morphology: Finite-State Grammar

FUGE

SECOND

FUGE

SECOND . ..

SECOND

PREFIX

SECOND

STEM

SECOND

WORD .. .

f++g f++gsf++g fEpsg fEpsg fEpsg

WORD

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

122

Morphology: Finite-State Grammar

Unanst¨andigkeitsunterstellung ‘allegation of indecency’

+ "unf++g"anf++gst’¨andf++gigf++gkeitf++gsf++gunterf++gst’ellf++gung

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

123

Rewrite Rule Compilation Context-dependent rewrite rules

General form:

; ; ; regular expressions. Constraint: Example:

! =

cannot be rewritten but can be used as a context

a ! b=c b

(Johnson, 1972; Kaplan & Kay, 1994; Karttunen, 1995; Mohri & Sproat, 1996) M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

124

Example

a ! b=c b

w = cab

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

125

Example Input: c

0

a

1

b

2

3

After r: c

0

a

1

>

2

b

3

4

After f: 0

c

1

4

b

5

g) f); 1; f

1

b:b

M.Mohri-M.Riley-R.Sproat

Algorithms for Speech Recognition and Language Processing

PART III

133

The Replace Transducer

Σ:Σ,