Parameter Estimation for Linear Dynamical Systems. Zoubin Ghahramani. Georey E. Hinton. University of Toronto. Toronto, Canada M5S 1A4

Parameter Estimation for Linear Dynamical Systems Zoubin Ghahramani Geo rey E. Hinton Department of Computer Science University of Toronto 6 King's C...
Author: Cecily Nelson
25 downloads 0 Views 119KB Size
Parameter Estimation for Linear Dynamical Systems Zoubin Ghahramani Geo rey E. Hinton

Department of Computer Science University of Toronto 6 King's College Road Toronto, Canada M5S 1A4 Email: [email protected] Technical Report CRG-TR-96-2 February 22, 1996

Abstract

Linear systems have been used extensively in engineering to model and control the behavior of dynamical systems. In this note, we present the Expectation Maximization (EM) algorithm for estimating the parameters of linear systems (Shumway and Sto er, 1982). We also point out the relationship between linear dynamical systems, factor analysis, and hidden Markov models.

Introduction

The goal of this note is to introduce the EM algorithm for estimating the parameters of linear dynamical systems (LDS). Such linear systems can be used both for supervised and unsupervised modeling of time series. We rst describe the model and then brie y point out its relation to factor analysis and other data modeling techniques.

The Model

Linear time-invariant dynamical systems, also known as linear Gaussian state-space models, can be described by the following two equations:

x = Ax + w (1) y = Cx + v : (2) Time is indexed by the discrete index t. The output y is a linear function of the state, x , and the state at one time step depends linearly on the previous state. Both state and output noise, w and v , are zero-mean normally distributed random variables with covariance matrices Q t+1

t

t

t

t

t

t

t

t

t

and R, respectively. Only the output of the system is observed, the state and all the noise variables are hidden. Rather than regarding the state as a deterministic value corrupted by random noise, we combine the state variable and the state noise variable into a single Gaussian random 1

variable; we form a similar combination for the output. Based on (1) and (2) we can write the conditional densities for the state and output,  1  0 P (ytjxt) = exp 2 [yt C xt] R [yt C xt] (2) p= jRj = (3)  1  P (xtjxt ) = exp 2 [xt Axt ]0Q [xt Axt ] (2) k= jQj = (4) 1

1

2

1

1

1 2

2

1

1 2

A sequence of T output vectors (y ; y ; : : : ; yT ) is denoted by fyg; a subsequence (yt0 ; yt0 ; : : : ; yt1 ) by fygtt10 ; similarly for the states. By the Markov property implicit in this model, 1

2

+1

T Y

P (fxg; fyg) = P (x ) P (xtjxt )

T Y

1

1

t=2

t=1

P (ytjxt):

Assuming a Gaussian initial state density  1  0 P (x ) = exp 2 [x  ] V [x  ] (2) k= jV j Therefore, the joint log probability is a sum of quadratic terms, 1

log P (fxg; fyg) =

1

1

T  X 1

2 [yt

t=1 T

1

1

C x ]0R t

X 1

1

2

1

1

[yt C xt]



t=2

1=2

:

(6)

T log jRj 2



T 1 log jQj 2  ] 1 log jV j T (p + k) log 2: 2 2

0 [ x A x Axt ] t t ] Q [ xt 2 1

1

(5)

1

1

1 [x  ]0V [x (7) 2 Often the inputs to the system can also be observed. In this case, the goal is to model the input{output response of a system. Denoting the inputs by ut, the state equation is 1

1

1

1

1

1

1

x = Ax + B u + w : t+1

t

t

t

(8)

where B is the input matrix relating inputs linearly to states. We will present the learning algorithm for the output-only case, although the extensions to the input{output case are straightforward. If only the outputs of the system can be observed the problem can be seen as an unsupervised problem. That is, the goal is to model the unconditional density of the observations. If both inputs and outputs are observed, the problem becomes supervised, modeling the conditional density of the output given the input.

Related Methods

In its unsupervised incarnation, this model is an extension of maximum likelihood factor analysis (Everitt, 1984). The factor, xt, evolves over time according to linear dynamics. In factor analysis, a further assumption is made that the output noise along each dimension 2

is uncorrelated, i.e. that R is diagonal. The goal of factor analysis is therefore to compress the correlational structure of the data into the values of the lower dimensional factors, while allowing independent noise terms to model the uncorrelated noise. The assumption of a diagonal R matrix can also be easily incorporated into the estimation procedure for the parameters of a linear dynamical system. The linear dynamical system can also be seen as a continuous-state analogue of the hidden Markov model (HMM; see Rabiner and Juang, 1986, for a review). The forward part of the forward-backward algorithm from HMMs is computed by the well-known Kalman lter in LDSs; similarly, the backward part is computed by using Rauch's recursion (Rauch, 1963). Together, these two recursions can be used to solve the problem of inferring the probabilities probabilities of the states given the observation sequence (known in engineering as the smoothing problem). These posterior probabilities form the basis of the E step of the EM algorithm. Finally, linear dynamical systems can also be represented as graphical probabilistic models (sometimes referred to as belief networks). The Kalman-Rauch recursions are special cases of the probability propagation algorithms that have been developed for graphical models (Lauritzen and Spiegelhalter, 1988; Pearl, 1988).

The EM Algorithm

Shumway and Sto er (1982) presented an EM algorithm for linear dynamical systems where the observation matrix, C , is known. Since then, many authors have presented closely related models and extensions, also t with the EM algorithm (Shumway and Sto er, 1991; Kim, 1994; Athaide, 1995). Here we present a basic form of the EM algorithm with C unknown, an obvious modi cation of Shumway and Sto er's original work. This note is meant as a succinct review of this literature for those wishing to implement learning in linear dynamical systems. The E step of EM requires computing the expected log likelihood,

Q = E [log P (fxg; fyg)jfyg]: (9) This quantity depends on three expectations|E [xtjfyg], E [xtx0tjfyg], E [xtx0t jfyg]|which 1

we will denote by the symbols:

x^  E [x jfyg] P  E [x x0 jfyg] P  E [x x0 jfyg]: t

t

t

t t

t;t 1

(10) (11) (12)

t t 1

Note that the state estimate, x^t, di ers from the one computed in a Kalman lter in that it depends on past and future observations; the Kalman lter estimates E [xtjfygt ] (Anderson and Moore, 1979). We rst describe the M step of the parameter estimation algorithm before showing how the above expectations are computed in the E step. 1

3

1

The M step

The parameters of this system are A, C , R, Q,  , V . Each of these is re-estimated by taking the corresponding partial derivative of the expected log likelihood, setting to zero, and solving. This results in the following: 1

 Output matrix:

1

T T @Q = X 0 + X R CP = 0 R y ^ t tx t @C t t =1

=1

C

new

 Output noise covariance: @Q = T R @R

R

T X

=

t t

! ! X T t=1

1

Pt

(14) 

0 C^ 0 + 1 CP C 0 = 0 y y x y 2 2

t=1

new

y x^0

t=1

T  X 1

2

1

(13)

1

1

t t

t t

t

T X 1 = T (ytyt0 C x^tyt0 )

(15) (16)

new

t=1

 State dynamics matrix: T T X @Q = X Q Pt;t + Q APt = 0 @A t t 1

1

1

A

new

T X

=

t=2

Pt;t

!

T X

1

t=2

 State noise covariance: T @Q = T 1 Q 1 X @Q 2 2 t (Pt APt 1

Pt

1;t

!

1

new

Pt;t A0 + APt A0) = 0 1

1

Q

 Initial state mean:

T X =T11 Pt A t =2

@ Q = (x^ @ 1

(19)

1

=2

=2

1

4

new

T X t=2

1)V1 1 = 0

new = ^x1 1

(18)

1

! T T X X T 1 1 = 2 Q 2 Pt ;t Pt A t t =2

new

(17)

1

=2

=2

Pt

! 1;t

(20) (21) (22)

 Initial state covariance: @Q = 1 V 2 @V 1

1

1 (P 2

1

x^ 0  x^0 +  0 )

1

V

new 1

1

1

1

1

1

(23)

1

x^ x^0

=P

1

1

(24)

1

The above equations can be readily generalized to multiple observation sequences, with one subtlety regarding the estimate of the initial state covariance. Assume N observation sequences of length T , let ^xti be the estimate of state at time t given the i sequence, and ( )

th

x^ = N1 t

N X i=1

x^ : ( i)

t

Then the initial state covariance is

V

new 1

=P

N x^ x^ 0 + 1 X [x^ i N

1

1

1

x^ ] [x^

( ) 1

i=1

x^ ]0:

( i) 1

1

1

(25)

2

The E step Using x to denote E (x jfyg ), and V to denote Var(x jfyg ), we obtain the following  t



t

 t

1

Kalman lter forward recursions:

x

= V = Kt = xtt = Vtt = t 1 t t 1 t



t

1

Axtt AVtt A0 + Q Vtt C 0(CVtt C 0 + R) xtt + Kt(yt C xtt ) Vtt Kt CVtt ;

(26) (27) (28) (29) (30)

1 1 1 1

1

1

1

1

1

1

1

where x =  and V = V . Following Shumway and Sto er (1982), to compute x^t  xTt and Pt  VtT + xTt xTt one performs a set of backward recursions using 0 1

1

0

0 1

1

Jt

1

x

T t 1 T t 1

V

= Vtt A0(Vtt ) = xtt + Jt (xTt Axtt ) = Vtt + Jt (VtT Vtt )Jt0 : 1 1 1 1 1 1

1

(31) (32) (33)

1

1 1 1

1

1

1

We also require Pt;t  Vt;tT + xTt xTt , which can be obtained through the backward recursions (34) AVtt )Jt0 ; VtT ;t = Vtt Jt0 + Jt (Vt;tT T which is initialized VT;T = (I KT C )A VTT : 0

1

1

1

1

1

2

1 1

1

2

1 1

5

1

1 1

2

References Anderson, B. D. O. and Moore, J. B. (1979). Optimal Filtering. Prentice-Hall, Englewood Cli s, NJ. Athaide, C. R. (1995). Likelihood Evaluation and State Estimation for Nonlinear State Space Models. Ph.D. Thesis, Graduate Group in Managerial Science and Applied Economics, University of Pennsylvania, Philadelphia, PA. Everitt, B. S. (1984). An Introduction to Latent Variable Models. Chapman and Hall, London. Kim, C.-J. (1994). Dynamic linear models with Markov-switching. J. Econometrics, 60:1{22. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, pages 157{224. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Rabiner, L. R. and Juang, B. H. (1986). An Introduction to hidden Markov models. IEEE Acoustics, Speech & Signal Processing Magazine, 3:4{16. Rauch, H. E. (1963). Solutions to the linear smoothing problem. IEEE Transactions on Automatic Control, 8:371{372. Shumway, R. H. and Sto er, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3(4):253{264. Shumway, R. H. and Sto er, D. S. (1991). Dynamic linear models with switching. J. Amer. Stat. Assoc., 86:763{769.

6

Suggest Documents