[Kschischang et al, 2001; Taylor et al, 2006; Lorenz, 1963]

Dynamical Factor Graphs (DFG) for Time Series Modeling Piotr Mirowski Yann LeCun Courant Institute of Mathematical Sciences, New York University {mir...
Author: Thomasine Doyle
5 downloads 0 Views 2MB Size
Dynamical Factor Graphs (DFG) for Time Series Modeling

Piotr Mirowski Yann LeCun Courant Institute of Mathematical Sciences, New York University {mirowski,yann}@cs.nyu.edu http://cs.nyu.edu/~mirowski

Motivation for DFG • State-space model observation model g

Y(t-1)

Y(t)

Y(t+1)

Z(t-1)

Z(t)

Z(t+1)

dynamical model f

• Human MoCap – Few visible markers – Many (hidden) joint angles

• Unknown latent states • Potentially high-dimensional • Chaotic time series – Unobserved data continuous latent states – Complex, • Highly nonlinear dynamics deterministic or observation/control dynamics models (convolutional net) • Handle long sequences in linear time [Kschischang et al, 2001; Taylor et al, 2006; Lorenz, 1963]

2

Dynamical Factor Graph (DFG) n-dimensional observed variable observation model

Gaussian noise

Y(t-2)

Y(t-1)

Y(t)

Y(t+1)

observation model g Z(t-2) dynamical Z(t-1) model f

Z(t)

Z(t+1)

m-dimensional latent variable

dynamical model (1st order Markov) [Kschischang et al, 2001]

Gaussian noise 3

Dynamical Factor Graph (DFG) n-dimensional observed variable observation model

Gaussian noise

Y(t-2)

Y(t-1)

Y(t)

Y(t+1)

observation model g Z(t-2) dynamical Z(t-1) model f

Z(t)

Z(t+1)

m-dimensional latent variable time-embedded sequence of p latent variables dynamical model (p th order Markov) [Kschischang et al, 2001]

Gaussian noise 4

Highly nonlinear factors: convolutional networks • Higher-order nonlinearity than: – radial basis functions – single hidden-layer Perceptrons

• No closed-form parameter optimization  Use gradient-based techniques n-dimensional input with time embedding p=11: n×p

1×3 convolution (across time); time-step of 2

Layer 1 12 filters: n×5

Layer 2 12 filters: 1×3

n×12×3 convolution (across time, filters and components); time-step of 1

[LeCun et al, 1998; Mirowski et al, 2007, 2008, 2009]

Layer 3 Full connection: n-dimensional vector

12×3 full connection

5

Energy-based graph of a DFG Y(t-1) Eo(t-1)

Y(t)

observation energy Eo(t)

dynamical energy

Ed(t) observation parameters Wo

Z(t-1)

Z(t) dynamical parameters Wd

Learning and inference: deterministic gradient-based EM [Ghahramani & Roweis, 1999]

6

Inference of latent variables dynamical energy observation energy

Inference of latent variables Z = minimization w.r.t. Z of dynamical energy observation energy Total energy of the DFG given a sequence Y and model parameters Wo, Wd:

[Ghahramani & Roweis, 1999; Ranzato et al, 2007]

7

Learning of model parameters Loss function to minimize

e.g. L1 regularization of model parameters

Sparsity or smoothness penalties during state inference Deterministic gradient-based version of Expectation-Maximization E-step (latent variable inference) annealed gradient descent on minibatch of Z until convergence M-step (parameter learning) 1 step of stochastic gradient descent (diagonal Levenberg-Marquard) [LeCun et al, 1998b; Ghahramani & Roweis, 1999; Ranzato et al, 2007]

8

Smoothness penalty on latent variables • More latent variables than observations: m>n • Underconstrained latent variable inference →L1 regularization (enforce latent variable sparsity across time and dimensions) →Smoothness penalty (reduce high-freq noise)

9

Results1: asynchronous sine waves Data and problem Mixture of sources

Results Spectrum analysis Spectrum analysis of 5 latent states, inferred of 5 latent states, inferred on 400 training points on 3600 testing points

Hidden variables Z(t), Dynamical model: dimension m=5 5 independent AR(25) Smoothness penalty: 0.01 Sources separated Perfect reconstruction: observation SNR 64dB, dynamical SNR 54dB Outperforms Long Short-Term Memory in the prediction task [Wierstra et al, 2007]

10

2 Results :

Inferring the Lorenz chaotic attractor Data Lorenz dynamical model

Correlation dimension 2.06

Problem Partial observation Learn the DFG on train data with latent variables of dimension m=3 Results Latent state attractor inferred on test data is similar to Lorenz attractor Lorenz attractor reconstructed on latent variables Correlation dimension 1.88

1-step prediction error of -46.2dB smaller than in SVR (-41.6dB)

[Lorenz, 1963; Mattera et al, 1999]

11

Results3: CATS time series Data and problem CATS time series prediction competition Noisy chaotic time series (5000 points) with missing data (100 points) Results Predictions of 5 segments of missing data beat the CATS benchmark

[Lendasse et al, 2004]

12

Results4: missing MoCap markers Data

• Observations Y: 49-dimensional Motion Capture markers Problem

• Model missing data (e.g. occlusions…) – Test sequence: 260 frames – 2 subsequences of 65 frames with missing data: • Left leg • Entire upper body Approach

• Infer latent variables (E-step) on test sequence, (without gradient from missing Yi(t)), generate Y from Z [Taylor et al, 2006]

13

Results4: missing MoCap markers Original data

Reconstruction of missing upper body

Lower NMSE than nearest neighbors; Inferred smooth, realistic motion

Reconstruction of missing left leg

Results

14

Results4: missing MoCap markers Original data

Reconstruction of missing upper body

Lower NMSE than nearest neighbors; Inferred smooth, realistic motion

Reconstruction of missing left leg

Results

15

Results4: missing MoCap markers

Normalized joint angles over the full test sequence: original, DFG, nearest neighbors

16

Thank you 1.

Barber, D.: Dynamic bayesian networks with deterministic latent tables. In: Advances in Neural Information Processing Systems (2003)

2.

Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (1994)

3.

Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39 (1977)

4.

Ghahramani, Z., Roweis, S.: Learning nonlinear dynamical systems using an EM algorithm. In: Advances in Neural Information Processing Systems (1999)

5.

Kschischang, F., Frey, B., Loeliger, H.-A.: Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory 47 (2001)

6.

Ilin, A., Valpola, H., Oja, E.: Nonlinear dynamical Factor Analysis for State Change Detection. IEEE Transactions on Neural Networks 15(3) (2004)

7.

Lang, K., Hinton, G.: The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University (1988)

8.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, (1998a)

9.

LeCun, Y., Bottou, L., Orr, G., Muller, K.: Efficient backprop. In: Orr, G.B., Muller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, Springer (1998b)

10.

Lendasse, A., Oja, E., Simula, O.: Time series prediction competition: The CATS benchmark. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)

11.

Levin, E.: Hidden control neural architecture modeling of nonlinear time-varying systems and its applications. IEEE Transactions on Neural Networks 4 (1993)

12.

Lorenz, E.: Deterministic nonperiodic flow. Journal of Atmospheric Sciences 20 (1963)

13.

Mattera, D., Haykin, S.: Support vector machines for dynamic reconstruction of a chaotic system. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)

14.

Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Using support vector machines for time-series prediction. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)

15.

Sarkka, S., Vehtari, A., Lampinen, J.: Time series prediction by kalman smoother with crossvalidated noise density. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)

16.

Takens, F.: Detecting strange attractors in turbulence. Lecture Notes in Mathematics, vol. 898 (1981)

17.

Taylor, G., Hinton, G., Roweis, S.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems (2006)

18.

Wan, E.: Time series prediction by using a connectionist network with internal delay lines. In: Weigend, A.S., Gershenfeld, N.A. (eds.) Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley (1993)

19.

Wan, E., Nelson, A.: Dual kalman filtering methods for nonlinear prediction, estimation, and smoothing. In: Advances in Neural Information Processing Systems (1996)

20.

Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Advances in Neural Information Processing Systems (2006)

21.

Wierstra, D., Gomez, F., Schmidhuber, J.: Modeling systems with internal state using Evolino. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (2005)

22.

Williams, R., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Backpropagation: Theory, Architectures and Applications, Lawrence Erlbaum Associates (1995)

17