TDT4171 Artificial Intelligence Methods Lecture 5 – Probabilistic Reasoning over Time (cont’d)

Norwegian University of Science and Technology

Helge Langseth IT-VEST 310 [email protected]

1

TDT4171 Artificial Intelligence Methods

Outline

2

1

Probabilistic Reasoning over Time Set-up Inference: Filtering, prediction, smoothing

2

Speech recognition Speech as probabilistic inference Speech sounds Word sequences

3

Other dynamic models Kalman Filters Dynamic Bayesian networks

4

Summary

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Set-up

Time and uncertainty Motivation: The world changes; we need to track and predict it Static (Vehicle diagnosis) vs. Dynamic (Diabetes management) Basic idea: copy state and evidence variables for each time step Raint = Does it rain at time t This assumes discrete time; step size depends on problem Here: A timestep is one day (I guess. . . )

3

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Set-up

Markov processes as Bayesian networks If we want to construct a Bayes net from these variables, then what are the parents? Markov assumption: Xt depends on bounded subset of X0:t−1 First-order Markov process: P(Xt |X0:t−1 ) = P(Xt | Xt−1 ) Second-order Markov process: P(Xt |X0:t−1 ) = P(Xt |Xt−2 , Xt−1 )

4

First−order

X t −2

X t −1

Xt

X t +1

X t +2

Second−order

X t −2

X t −1

Xt

X t +1

X t +2

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Set-up

Hidden Markov models Some variables are not observable themselves A variable Xt is partially disclosed by the sound signal in frame t (or our representation of that). We call the observation E t . Reasonable assumptions to make: Stationary process: Transition model P(Xt |pa (Xt )) fixed for all t k’th-order Markov process: P(Xt |X0:t−1 ) = P(Xt |Xt−k:t−1 ) Sensor Markov assumption: P(Et |X1:t , E1:t−1 ) = P(Et |Xt ).

5

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Set-up

Hidden Markov models as Bayesian networks

X0

X1

X2

X3

X4

E1

E2

E3

E4

The HMM model as a (dynamic) Bayesian net: The variables Xt are discrete and one-dimensional The variables E t are vectors of variables

6

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Inference tasks Filtering: P(Xt |e1:t ). This is the belief state – input to the decision process of a rational agent. Also, as a artifact of the calculation scheme, we can also get the probability needed for speech recognition if we are interested. Prediction: P(Xt+k |e1:t ) for k > 0. Evaluation of possible action sequences; like filtering without the evidence Smoothing: P(Xk |e1:t ) for 0 ≤ k < t. Better estimate of past states – Essential for learning Most likely explanation: arg maxx1:t P(x1:t |e1:t ). Speech recognition, decoding with a noisy channel

7

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Quiz: What kind of inference? In the TV show Dexter the main character’s work is to analyse blood spatter to deduce how a murder has gone down. A submarine captain follows the “blips” of a ship on his sonar to understand where the ship is. He fires a tornado to take the ship out. He continues watching, and after two minutes he says: “We missed. At the time I planned the torpedo to hit, the ship was there, not were I aimed”.

For each inference task, you are asked to decide if this is Filtering, Prediction, Smoothing or Most Likely Explanation. Discuss with your neighbour for a couple of minutes. 8

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering Aim: devise a recursive state estimation algorithm: P(Xt+1 |e1:t+1 ) = Some-Func(P(Xt |e1:t ), et+1 ) P(Xt+1 |e1:t+1 ) = P(Xt+1 , e1:t , et+1 )/P(e1:t+1 )

= P(et+1 |Xt+1 , e1:t ) · P(Xt+1 |e1:t ) · P(e1:t )/P(e1:t+1 ) = P(et+1 |Xt+1

) · P(Xt+1 |e1:t ) · α

= α · P(et+1 |Xt+1 ) · P (Xt+1 |e1:t ) {z } | {z } | Evidence Prediction

So, filtering is a prediction updated by evidence.

9

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering Aim: devise a recursive state estimation algorithm: P(Xt+1 |e1:t+1 ) = Some-Func(P(Xt |e1:t ), et+1 ) Prediction by summing out Xt : P(Xt+1 |e1:t+1 )

= α · P(et+1 |Xt+1 ) · P(Xt+1 |e1:t )

= α · P(et+1 |Xt+1 ) · {Σxt P(Xt+1 |xt , e1:t ) · P(xt |e1:t )}

= α · P(et+1 |Xt+1 ) · {Σxt P(Xt+1 |xt ) · P(xt |e1:t )} {z } | P(Xt+1 |e1:t ) using what we have already All relevant information contained in f1:t = P(Xt |e1:t ); belief revision using f1:t+1 = Forward(f1:t , et+1 ). Note! Time and space requirements for calculating f1:t+1 is constant (independent of t) 9

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Example of Hidden Markov Model from the book

?



Problem: Our guy sits in a bunker underground, wondering what the weather is like each day: Rain or shine? Sensors: His boss walking by is bringing an umbrella with p = .9 if raining and p = .2 if sunshine. Dynamics: Weather is the same as yesterday with p = .7. 10

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Example of Hidden Markov Model from the book R t −1

P(R t )

t f

0.7 0.3

Raint −1

Umbrella t −1

11

Raint +1

Raint Rt

P(U t )

t f

0.9 0.2

Umbrella t

Umbrella t +1

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example

P(X0 ) = h0.5, 0.5i Rain0

12

Rain1

Rain2

umbrella1

umbrella2

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example P(X1 ) = ? P(X0 ) = h0.5, 0.5i Rain0

P(X1 ) =

X x0

Rain1

Rain2

umbrella1

umbrella2

P(X1 |x0 ) · P(x0 )

= h0.7, 0.3i · 0.5 + h0.3, 0.7i · 0.5 = h0.5, 0.5i 12

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example P(X1 ) = h0.5, 0.5i P(X0 ) = h0.5, 0.5i Rain0

P(X1 |e1 ) = ? Rain1

Rain2

umbrella1

umbrella2

P(X1 |e1 ) = α · P(e1 |X1 ) · P(X1 )

= α · h0.9, 0.2i · h0.5, 0.5i = α · h0.9 · 0.5, 0.2 · 0.5i

= α · h0.45, 0.1i

= h0.818, 0.182i

12

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = ?

P(X1 |e1 ) = h0.818, 0.182i

P(X0 ) = h0.5, 0.5i Rain0

P(X2 |e1 ) =

X x1

Rain1

Rain2

umbrella1

umbrella2

P(X2 |x1 ) · P(x1 |e1 )

= h0.7, 0.3i · 0.818 + h0.3, 0.7i · 0.182 = h0.627, 0.373i 12

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example

P(X0 ) = h0.5, 0.5i Rain0

P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = h0.627, 0.373i

P(X1 |e1 ) = h0.818, 0.182i

P(X2 |e1:2 ) = ?

Rain1

Rain2

umbrella1

umbrella2

P(X2 |e1:2 ) = α · P(e2 |X2 ) · P(X2 | e1 )

= α · h0.9, 0.2i · h0.627, 0.373i = α · h0.565, 0.075i

= h0.883, 0.117i 12

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Filtering example

P(X0 ) = h0.5, 0.5i Rain0

12

P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = h0.627, 0.373i

P(X1 |e1 ) = h0.818, 0.182i

P(X2 |e1:2 ) = h0.883, 0.117i

Rain1

Rain2

umbrella1

umbrella2

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Prediction P(Xt+k+1 |e1:t ) = Σxt+k P(Xt+k+1 |xt+k )P(xt+k |e1:t ) Again we have a recursive formulation – This time over k. . . Notice that it is just like filtering, but without adjusting for evidence. As k → ∞, P(xt+k |e1:t ) tends to the stationary distribution of the Markov chain. This means that the effect of e1:t will vanish as k increases, and predictions will become more and more dubious. Mixing time depends on how stochastic the chain is (“how persistent X is”)

13

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Prediction – Example P(X0 ) = h.5, .5i

P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = h0.627, 0.373i

P(X1 |e1 ) = h0.818, 0.182i

P(X2 |e1:2 ) = h0.883, 0.117i

Rain0

P(X3 |e1:2 ) =

Rain1

Rain2

umbrella1

umbrella2

X x2

P(X3 |x2 ) · P(x2 |e1:2 )

= h0.7, 0.3i · 0.883 + h0.3, 0.7i · 0.117 = h0.653, 0.347i

14

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Prediction – Example P(X0 ) = h.5, .5i

P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = h0.627, 0.373i

P(X1 |e1 ) = h0.818, 0.182i

P(X2 |e1:2 ) = h0.883, 0.117i

Rain0

P(X4 |e1:2 ) =

Rain1

Rain2

umbrella1

umbrella2

X x3

P(X4 |x3 ) · P(x3 |e1:2 )

= h0.7, 0.3i · 0.653 + h0.3, 0.7i · 0.347 = h0.561, 0.439i

14

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Prediction – Example P(X0 ) = h.5, .5i

P(X1 ) = h0.5, 0.5i

P(X2 |e1 ) = h0.627, 0.373i

P(X1 |e1 ) = h0.818, 0.182i

P(X2 |e1:2 ) = h0.883, 0.117i

Rain0

P(X10 |e1:2 ) =

Rain1

Rain2

umbrella1

umbrella2

X x9

P(X10 |x9 ) · P(x9 |e1:2 )

= h0.7, 0.3i · 0.501 + h0.3, 0.7i · 0.499 = h0.500, 0.500i

limk→∞ P(Xt+k |e1:t ) = h 21 , 12 i for this transition model. 14

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Example: Automatic recognition of hand-written digits We have this system that can “recognise” hand-written digits: replacements

P(image | Digit)

0

3

6 8

Takes a binary image of a handwritten digit as input Returns P(image | Digit) — from which we can calculate P(Digit | image) (The system we will consider is not very good)

15

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Internals of recogniser

An image is a 16 × 16 matrix of binary variables Imagei ,j : Imagei ,j = true if pixel (i, j ) is white, false otherwise. We need a model for P(image | Digit). Note that image is 256-dimensional, so we must combine single-pixel information to find digit. How should we proceed? Discuss with your neighbour for a couple of minutes. 16

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Internals of recogniser

An image is a 16 × 16 matrix of binary variables Imagei ,j : Imagei ,j = true if pixel (i, j ) is white, false otherwise. There are a number of possible solutions; I’ve assumed a Naïve Bayes model – but is that reasonable? YY P(image | Digit) = P(imagei ,j | Digit). i

16

j

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Scaling up: ZIP-codes We want to build a system that can decode hand-written ZIP-codes for letters to Norway.

Digit1

17

Digit2

Digit3

Digit4

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Scaling up: ZIP-codes We want to build a system that can decode hand-written ZIP-codes for letters to Norway. There is a structure in this: ZIP-codes always have 4 digits Some ZIP-codes more frequent than others (e.g., 0xxx – 13xx for Oslo, 50xx for Bergen, 70xx for Trondheim) Some ZIP-codes are not used, e.g. 5022 does not exist . . . but some illegal numbers are often used, e.g. 7000 meaning “Wherever in Trondheim”

Can we utilise the internal structure to improve the digits-recogniser? Model structure? Assumptions? Discuss with your neighbour for a couple of minutes. 18

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes Take 1: Full model

Digit1

19

Digit2

Digit3

Digit4

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes Take 1: Full model

The full model includes all relations between digits: 7465 is commonly used, 7365 is not

The problem is related to size of CPTs: How many numbers to represent P(Digit4 | Pa(Digit4 ))? What if we want to use this system to recognise KID numbers (> 10 digits)?

19

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes Take 2: Markov model

Digit1

19

Digit2

Digit3

Digit4

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes Take 2: Markov model

The reduced model includes only some relations between digits: Can represent “If start with 7 and digit number three is 6, then the second one is probably 4” Cannot represent “If start with 9 then digit number four is probably not 7”

What about making the model stationary? Does not seem appropriate here. Might be necessary and/or reasonable for KID recognition though.

19

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Inference (filtering) Step 1: First digit classified as a 4! (Not good! I told you!) replacements

P(Digit1 | image1 )

4 7

Digit1

Digit2

Digit3

Digit4

?

?

?

1

20

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Inference (filtering) Step 1: First digit classified as a 4! (Not good! I told you!)

So what happened? The Naive Bayes method supplies P(image1 | Digit1 ) Using the calculation rule, the system finds

P(Digit1 | image1 ) = α · P(image1 | Digit1 ) · P(Digit1 )

20

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Inference (filtering) Step 2: Second digit classified as a 4. P(Digit2 | image1 , image2 )

replacements 4 7

4 2

Digit1

Digit2

Digit3

Digit4

?

?

2

20

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Inference (filtering) Step 2: Second digit classified as a 4. So what happened? The Naive Bayes method supplies P(image2 | Digit2 ) Using the calculation rule, the system finds P(Digit2 | image1 , image2 ) = α · P(image 2 | Digit2 )· P digit1 P(Digit2 | digit1 )P(digit1 | image1 ) To do the classification, the system used the information that The image is a very typical “4” 7 → 4 is probable 4 → 4 is not very probable, but possible

Can this structural information also be used “backwards”? If the 2nd digit is 4, then the 1st digit is probably a 7, not a 4 This is called smoothing 20

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing X0

X1

Xk

Xt

E1

Ek

Et

Calculate P(Xk |e1:t ) by dividing evidence e1:t into e1:k , ek+1:t : P(Xk |e1:t ) = P(Xk |e1:k , ek+1:t )

= P(Xk , e1:k , ek+1:t )/P(e1:k , ek+1:t ) = P(ek+1:t |Xk , e1:k ) · P(Xk |e1:k ) · P(e1:k )/P(e1:k , ek+1:t ) = P(ek+1:t |Xk

) · P(Xk |e1:k ) · α

= α · P(Xk |e1:k ) · P(ek+1:t |Xk )

= α · f1:k · bk+1:t where bk+1:t = P(ek+1:t |Xk ). 21

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing X0

X1

Xk

Xt

E1

Ek

Et

Backward message computed by a backwards recursion: P(ek+1:t |Xk ) = Σxk+1 P(ek+1:t |Xk , xk+1 )P(xk+1 |Xk ) = Σxk+1 P(ek+1:t |xk+1 )P(xk+1 |Xk )

= Σxk+1 P(ek+1 |xk+1 ) · P(ek+2:t |xk+1 ) · P(xk+1 |Xk ) So. . . bk+1:t

= P(ek+1:t |Xk )

= Σxk+1 P(ek+1 |xk+1 ) · bk+2:t (xk+1 ) · P(xk+1 |Xk )

21

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example f0 = h0.5, 0.5i

f1:1 = h0.818, 0.182i

f1:2 = h0.883, 0.117i

b3:2 = ?

Rain0

Rain1

Rain2

umbrella1

umbrella2

b3:2 = P(e3:2 |X2 )

= h1, 1i (void)

22

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example f0 = h0.5, 0.5i

f1:1 = h0.818, 0.182i

f1:2 = h0.883, 0.117i P(X2 |e1:2 ) = ? b3:2 = h1, 1i

Rain0

Rain1

Rain2

umbrella1

umbrella2

P(X2 |e1:2 ) = α · f1:2 · b3:2

= α · h0.883, 0.117i · h1, 1i = h0.883, 0.117i

22

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example f0 = h0.5, 0.5i

f1:1 = h0.818, 0.182i

f1:2 = h0.883, 0.117i P(X2 |e1:2 ) = h0.883, 0.117i b3:2 = h1, 1i

b2:2 = ?

Rain0

Rain1

Rain2

umbrella1

umbrella2

b2:2 = P(e2:2 |X1 ) X = P(e2 |x2 ) · b3:2 (x2 ) · P(x2 |X1 ) x2 = (0.9 · 1 · h0.7, 0.3i) + (0.2 · 1 · h0.3, 0.7i) = h0.690, 0.410i 22

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example f0 = h0.5, 0.5i

Rain0

f1:1 = h0.818, 0.182i

f1:2 = h0.883, 0.117i

P(X1 |e1:2 ) = ?

P(X2 |e1:2 ) = h0.883, 0.117i

b2:2 = h0.690, 0.410i

b3:2 = h1, 1i

Rain1

Rain2

umbrella1

umbrella2

P(X1 |e1:2 ) = αf1:1 · b2:2

= α · h0.818, 0.182i · h0.690, 0.410i = h0.883, 0.117i

22

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example f0 = h0.5, 0.5i

Rain0

22

f1:1 = h0.818, 0.182i

f1:2 = h0.883, 0.117i

P(X1 |e1:2 ) = h0.883, 0.117i

P(X2 |e1:2 ) = h0.883, 0.117i

b2:2 = h0.690, 0.410i

b3:2 = h1, 1i

Rain1

Rain2

umbrella1

umbrella2

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Smoothing example — conclusion

True False

0.500 0.500

Rain 0

0.500 0.500

0.627 0.373

0.818 0.182

0.883 0.117

forward

0.883 0.117

0.883 0.117

smoothed

0.690 0.410

1.000 1.000

backward

Rain 1

Rain 2

Umbrella 1

Umbrella 2

Forward–backward algorithm: Store ft -messages as we move. Time linear in t (polytree inference), space O(t · |f|) 23

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Simplifications for Hidden Markov models Xt is a single, discrete variable (as is Et usually, too) Domain of Xt is {1, . . . , S}   0.7 0.3 Transition matrix Tij = P(Xt = j |Xt−1 = i), e.g., 0.3 0.7 Sensor matrix Ot for each t, diagonal elements P(et |Xt = i). For instance, with Umbrella1 = true we get     P(u1 |x1 ) 0 0.9 0 O1 = = 0 P(u1 |¬x1 ) 0 0.2 Forward and backward messages as column vectors: f1:t+1 = αOt+1 T⊤ f1:t bk+1:t

24

= TOk+1 bk+2:t

The FB-algorithm needs time O(S 2 · t) and space O(S · t)

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

How to classify ZIP-codes?

Digit1

Digit2

Digit3

Digit4

Can we take the most probable digit per image and use for classification? NO! Most likely sequence IS NOT the sequence of most likely states! 25

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Most likely explanation Most likely sequence 6= sequence of most likely states! Most likely path to each xt+1 is most likely path to some xt plus one more step max P(x1 , . . . , xt , Xt+1 |e1:t+1 ) x1 ...xt = max P(x1 , . . . , xt , Xt+1 , e1:t+1 )/P(e1:t+1 ) x1 ...xt = max P(et+1 |x1 , . . . , xt , Xt+1 , e1:t ) · P(Xt+1 |x1 , . . . , xt , e1:t ) x1 ...xt · P(x1 , . . . , xt |e1:t ) · α

max α · P(et+1 |Xt+1 ) · P(Xt+1 |xt ) · P(x1 , . . . , xt |e1:t ) x1 ...xt   = αP(et+1 |Xt+1 ) max P(Xt+1 |xt ) max P(x1 , . . . , xt−1 , xt |e1:t ) x1 ...xt−1 xt =

26

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Most likely explanation Most likely sequence 6= sequence of most likely states! Most likely path to each xt+1 is most likely path to some xt plus one more step max P(x1 , . . . , xt , Xt+1 |e1:t+1 ) x1 ...xt   = αP(et+1 |Xt+1 ) · max P(Xt+1 |xt ) max P(x1 , . . . , xt |e1:t ) x1 ...xt−1 xt Identical to filtering, except f1:t replaced by m1:t =

max P(x1 , . . . , xt−1 , Xt |e1:t ), x1 ...xt−1

I.e., m1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: 26

m1:t+1 = P(et+1 |Xt+1 ) max (P(Xt+1 |xt ) · m1:t ) xt

TDT4171 Artificial Intelligence Methods

Probabilistic Reasoning over Time

Inference: Filtering, prediction, smoothing

Viterbi example

state space paths umbrella most likely paths

Rain 1

Rain 2

Rain 3

Rain 4

Rain 5

true

true

true

true

true

false

false

false

false

false

true

true

false

true

true

.8182

.5155

.0361

.0334

.0210

.1818

.0491

.1237

.0173

.0024

m 1:1

m 1:2

m 1:3

m 1:4

m 1:5

m1:t+1 = P(et+1 |Xt+1 ) max (P(Xt+1 |xt ) · m1:t ) xt

27

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech as probabilistic inference

Speech as probabilistic inference Let us return to the question of how to recognize speech Speech signals are noisy, variable, ambiguous Classify to those words that maximize P(Words|signal)?? Use Bayes’ rule: P(Words|signal) = αP(signal|Words)P(Words) I.e., decomposes into acoustic model + language model The Words are the hidden state sequence, signal is the observation sequence We use Hidden Markov Models to model this

28

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1 0 −0.1 −0.2 −0.3

0 −0.1 −0.2

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

−0.5

0.2

0.1 0 −0.1 −0.2 −0.3

−0.4

0

−0.4

0

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

−0.5

0.2

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1 0 −0.1 −0.2 −0.3

0.1 0 −0.1 −0.2 −0.3

−0.4 −0.5

Signal from file − LEFT(6)

0.5

Signal from file − LEFT(5)

Signal from file − LEFT(4)

0.1

−0.3

−0.4 −0.5

Signal from file − LEFT(3)

0.5

Signal from file − LEFT(2)

Signal from file − LEFT(1)

The sound signal – Robustness

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

0.2

−0.5

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

0.2

0

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

0.2

0.1 0 −0.1 −0.2 −0.3

−0.4

0

0

−0.4

0

0.02

0.04

0.06

0.08

0.1 0.12 Time (s)

0.14

0.16

0.18

0.2

−0.5

Six different utterances of the word “Left” by the same speaker. Notice variations! 29

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Speech sounds Raw signal is the microphone displacement as a function of time; processed into overlapping 30ms frames, each described by features

Analog acoustic signal:

Sampled, quantized digital signal:

10

15

38

22

63

24

10

12

73

Frames with features: 52

47

82

89

94

11

Frame features are typically formants (peaks in the power spectrum) 30

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

What do we have so far? Speech is dynamic, and we must take that into account. Speech is noisy – literally! We must use a robust representation. “Robustification” using windowing. Possible short-term (inside-window) representation: Use frequency and amplitude of most important frequencies. Note that the most important frequencies are not necessarily those that come out first after sorting – we should rather look for a representation of the peaks (e.g., the single most important entity from each peak).

31

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Piecing it all together – The spectrogram A spectrogram is a plot of the the function s(t, ω) = |Y (ω, t)|/(2π) where Y (ω, t) is the discrete time Fourier transform in the window containing t, and | · | is the absolute value (of a potentially complex number). Sometimes, s(t, ω) = |Y (ω, t)|2 is used, but for us this difference is unimportant. Typically, the 3d plot is shown in 2d using colour-codes for the function value.

32

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

The spectrogram – Robustness

0.16

0.16

0.14

0.14

0.12

0.12

0.1

0.1

0.14

0.12

Time

Time

Time

0.1

0.08

0.08

0.08

0.06 0.06

0.06

0.04

0.04

0.02

0.02

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

0.04

0.02

0

4000

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

0.14 0.12 0.12 0.12 0.1

0.1

0.1 0.08 0.08

Time

Time

Time

0.08

0.06

0.06

0.06

0.04

0.04

0.02

0.02

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

0.04

0.02

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

Six different utterances of the word “Left” by the same speaker, represented by the spectrograms. 33

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

The spectrogram – Discriminative ability 3.5

3

Time

2.5

2

1.5

1

0.5

0

500

1000

1500 2000 2500 Frequency (Hz)

3000

3500

4000

Spectrogram of utterance of the four words “Start”, “Stop”, “Left”, and “Right”. 34

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Phone models Frame features in P(features|phone) summarized by an integer in [0 . . . 255] (using vector quantization); or the parameters of a mixture of Gaussians

Three-state phones: each phone has three phases (Onset, Mid, End) E.g., [t] has silent Onset, explosive Mid, hissing End ⇒ P(features|phone, phase)

Triphone context: each phone becomes n2 distinct phones, depending on the phones to its left and right E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!) Triphones useful for handling coarticulation effects: the articulators have inertia and cannot switch instantaneously between positions E.g., [t] in “eighth” has tongue against front teeth 35

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Phone model example Phone HMM for [m]:

0.9

0.3 0.7 Onset

0.4 0.1

Mid

0.6 End

FINAL

Output probabilities for the phone HMM:

36

Onset: C1: 0.5

Mid: C3: 0.2

End: C4: 0.1

C2: 0.2

C4: 0.7

C6: 0.5

C3: 0.3

C5: 0.1

C7: 0.4

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Word pronunciation models Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model 0.2

[ow]

1.0

[t]

0.5

[ey]

1.0

[m] 0.8

[ah]

1.0

[t] 0.5

[aa]

1.0

[ow]

1.0

P([towmeytow]|“tomato”) = P([towmaatow]|“tomato”) = 0.1 P([tahmeytow]|“tomato”) = P([tahmaatow]|“tomato”) = 0.4 Structure is created manually, transition probabilities learned from data 37

TDT4171 Artificial Intelligence Methods

Speech recognition

Speech sounds

Isolated words Phone models + word models fix likelihood P(e1:t |word) for isolated word P(word|e1:t ) = αP(e1:t |word)P(word) Prior probability P(word) by counting word frequencies P(e1:t |word) can be computed recursively: define ℓ1:t = P(Xt , e1:t ) and use the recursive update ℓ1:t+1 = Forward(ℓ1:t , et+1 ) P and then P(e1:t |word) = xt ℓ1:t (xt ) Isolated-word dictation systems with training reach 95% – 99% accuracy 38

TDT4171 Artificial Intelligence Methods

Speech recognition

Word sequences

Continuous speech Not just a sequence of isolated-word recognition problems! Adjacent words highly correlated Sequence of most likely words is not equal to most likely sequence of words Segmentation: there are few gaps in speech Cross-word coarticulation, e.g., “next thing” ≈ “nexing” in daily speech Continuous speech recognition is hard; currently the best systems manage 60% – 80% accuracy

39

TDT4171 Artificial Intelligence Methods

Speech recognition

Word sequences

Language model Prior probability of a word sequence is given by chain rule: P(w1 · · · wn ) =

n Y i =1

P(wi |w1 · · · wi −1 )

simplify using a Bigram model: P(wi |w1 · · · wi −1 ) ≈ P(wi |wi −1 ) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help, but only a little bit

40

TDT4171 Artificial Intelligence Methods

Speech recognition

Word sequences

Summary – Speech Since the mid-1970s, speech recognition has been formulated as probabilistic inference Evidence = speech signal, hidden variables = word and phone sequences “Context” effects (coarticulation etc.) are handled by augmenting state Variability in human speech (speed, timbre, etc., etc.) and background noise make continuous speech recognition in real settings an open problem

41

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Kalman filters Modelling systems described by a set of continuous variables, e.g., tracking a bird flying — Xt = X , Y , Z , X˙ , Y˙ , Z˙ . Also: Airplanes, robots, ecosystems, economies, chemical plants, planets, . . . “Noisy” observations, continuous variables, dynamic model Xt

X t+1

Xt

X t+1

Zt

Zt+1

Gaussian prior, linear Gaussian transition model and sensor model 42

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Continuous variables Need a way to define a conditional density function for child variable given continuous parents Most common is the linear Gaussian model, e.g.,: P(Xt = xt |Xt−1 = xt−1 )

= N(a · xt−1 + b, σ)(xt )   ! 1 1 xt − (a · xt−1 + b) 2 √ exp − = 2 σ σ 2π

Mean Xt varies linearly with Xt−1 , variance is fixed Linear variation and fixed variance may be unreasonable over the full range, but may work OK if the likely range of Xt is narrow 43

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Continuous variables (cont’d)

0.4 0.3 0.2 0.1 0 4 2 0 −2

Xt

−4

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Xt−1

All-continuous network with LG distributions ⇒ full joint distribution is a multivariate Gaussian

44

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Updating Gaussian distributions Prediction step: if P(Xt |e1:t ) is Gaussian, then prediction Z P(Xt+1 |xt )P(xt |e1:t ) d xt P(Xt+1 |e1:t ) = xt is Gaussian. If P(Xt+1 |e1:t ) is Gaussian, then the updated distribution P(Xt+1 |e1:t+1 ) = αP(et+1 |Xt+1 )P(Xt+1 |e1:t ) is Gaussian Hence P(Xt |e1:t ) is multivariate Gaussian N(µt , Σt ) for all t General (nonlinear, non-Gaussian) process: description of posterior grows unboundedly as t → ∞ 45

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Simple 1-D example Task: Measure the Norwegian population’s level of job satisfaction on a monthly basis Monthly target: Real number (value from −5 to +5, for instance) Indirect measurement: Ask a random subset of N people Modelling assumptions: The true value cannot be measured (N < 4.8 · 106 ), but the measurements (Zt ) are correlated with the true value (Xt ): P(zt |Xt = xt ) ∼ N(xt , σz2 ) The true level at time t is related to the level at time t − 1: P(xt |Xt−1 = xt−1 ) ∼ N(xt−1 , σx2 ) That is, we have a Gaussian Random Walk 46

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Simple 1-D example (cont’d) Gaussian random walk on X –axis, s.d. σx , sensor s.d. σz µt+1 =

(σt2 + σx2 )zt+1 + σz2 µt σt2 + σx2 + σz2

2 σt+1 =

(σt2 + σx2 )σz2 σt2 + σx2 + σz2

0.45 0.4 0.35

f (x1 |z1 = 2.5)

f (x)

0.3

f (x0 )

0.25 0.2 0.15

f (x1 )

0.1 0.05 0 -8 47

-6

-4

-2

0

z* 2 1 4

6

8

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

General Kalman update Transition and sensor models: P(xt+1 |xt ) = N(Fxt , Σx )(xt+1 ) P(zt |xt ) = N(Hxt , Σz )(zt ) F is the matrix for the transition; Σx the transition noise covariance H is the matrix for the sensors; Σz the sensor noise covariance Filter computes the following update: µt+1 = Fµt + Kt+1 (zt+1 − HFµt ) Σt+1 = (I − Kt+1 )(FΣt F⊤ + Σx ) Kt+1 = (FΣt F⊤ + Σx )H⊤ (H(FΣt F⊤ + Σx )H⊤ + Σz )−1 is the Kalman gain matrix Note! Σt and Kt are independent of observation sequence, so compute offline 48

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

2-D tracking example: filtering 2D filtering 12

true observed filtered

11

Y

10

9

8

7

6

8

10

12

14

16

18

20

22

24

26

X

49

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

2-D tracking example: smoothing 2D smoothing 12

true observed smoothed

11

Y

10

9

8

7

6

8

10

12

14

16

18

20

22

24

26

X

50

TDT4171 Artificial Intelligence Methods

Other dynamic models

Kalman Filters

Where Kalman Filtering falls apart Kalman Filters cannot be applied if the transition model is nonlinear Extended Kalman Filter models transition as locally linear around xt = µt . Fails if systems is locally unsmooth Switching Kalman Filter kan be used to handle discontinuities

51

TDT4171 Artificial Intelligence Methods

Other dynamic models

Dynamic Bayesian networks

Dynamic Bayesian networks Xt , Et contain arbitrarily many variables in a replicated Bayes net BMeter 1 P(R 0) 0.7

Rain 0

R0

P(R 1 )

t f

0.7 0.3

Rain 1 R1

P(U1 )

t f

0.9 0.2

Umbrella 1

Battery 0

Battery 1

X0

X1

XX t0

X1

Z1

52

TDT4171 Artificial Intelligence Methods

Other dynamic models

Dynamic Bayesian networks

DBNs vs. HMMs Every HMM is a single-variable DBN; every discrete DBN is an HMM Xt

X t+1

Yt

Y t+1

Zt

Z t+1

Sparse dependencies ⇒ exponentially fewer parameters . . . e.g., 20 state variables, three parents each

DBN has 20 × 23 = 160 parameters, HMM has 220 × 220 ≈ 1012 53

TDT4171 Artificial Intelligence Methods

Other dynamic models

Dynamic Bayesian networks

DBNs vs. Kalman filters Every Kalman filter model is a DBN, but few DBNs are KFs, as real world requires non-Gaussian posteriors: Where are my keys? What’s the battery charge? Does this system work? BMBroken 0

BMBroken 1 BMeter 1

Battery 0

E(Battery|...5555005555...)

5

Battery 1 E(Battery)

4

E(Battery|...5555000000...)

3

X0

X1

XX t0

X1

1

Z1

-1

2 P(BMBroken|...5555000000...)

0

54

P(BMBroken|...5555005555...) 15

20 25 Time step

TDT4171 Artificial Intelligence Methods

30

Summary

Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need Transition model P(Xt |Xt−1 ) Sensor model P(Et |Xt )

Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step Hidden Markov models have a single discrete state variable; used for speech recognition Kalman filters allow n state variables, linear Gaussian, O(n3 ) update Dynamic Bayes nets subsume HMMs, Kalman filters; exact update intractable; approximations exist 55

TDT4171 Artificial Intelligence Methods