Dynamical Factor Graphs (DFG) for Time Series Modeling
Piotr Mirowski Yann LeCun Courant Institute of Mathematical Sciences, New York University {mirowski,yann}@cs.nyu.edu http://cs.nyu.edu/~mirowski
Motivation for DFG • State-space model observation model g
Y(t-1)
Y(t)
Y(t+1)
Z(t-1)
Z(t)
Z(t+1)
dynamical model f
• Human MoCap – Few visible markers – Many (hidden) joint angles
• Unknown latent states • Potentially high-dimensional • Chaotic time series – Unobserved data continuous latent states – Complex, • Highly nonlinear dynamics deterministic or observation/control dynamics models (convolutional net) • Handle long sequences in linear time [Kschischang et al, 2001; Taylor et al, 2006; Lorenz, 1963]
2
Dynamical Factor Graph (DFG) n-dimensional observed variable observation model
Gaussian noise
Y(t-2)
Y(t-1)
Y(t)
Y(t+1)
observation model g Z(t-2) dynamical Z(t-1) model f
Z(t)
Z(t+1)
m-dimensional latent variable
dynamical model (1st order Markov) [Kschischang et al, 2001]
Gaussian noise 3
Dynamical Factor Graph (DFG) n-dimensional observed variable observation model
Gaussian noise
Y(t-2)
Y(t-1)
Y(t)
Y(t+1)
observation model g Z(t-2) dynamical Z(t-1) model f
Z(t)
Z(t+1)
m-dimensional latent variable time-embedded sequence of p latent variables dynamical model (p th order Markov) [Kschischang et al, 2001]
Gaussian noise 4
Highly nonlinear factors: convolutional networks • Higher-order nonlinearity than: – radial basis functions – single hidden-layer Perceptrons
• No closed-form parameter optimization Use gradient-based techniques n-dimensional input with time embedding p=11: n×p
1×3 convolution (across time); time-step of 2
Layer 1 12 filters: n×5
Layer 2 12 filters: 1×3
n×12×3 convolution (across time, filters and components); time-step of 1
[LeCun et al, 1998; Mirowski et al, 2007, 2008, 2009]
Layer 3 Full connection: n-dimensional vector
12×3 full connection
5
Energy-based graph of a DFG Y(t-1) Eo(t-1)
Y(t)
observation energy Eo(t)
dynamical energy
Ed(t) observation parameters Wo
Z(t-1)
Z(t) dynamical parameters Wd
Learning and inference: deterministic gradient-based EM [Ghahramani & Roweis, 1999]
6
Inference of latent variables dynamical energy observation energy
Inference of latent variables Z = minimization w.r.t. Z of dynamical energy observation energy Total energy of the DFG given a sequence Y and model parameters Wo, Wd:
[Ghahramani & Roweis, 1999; Ranzato et al, 2007]
7
Learning of model parameters Loss function to minimize
e.g. L1 regularization of model parameters
Sparsity or smoothness penalties during state inference Deterministic gradient-based version of Expectation-Maximization E-step (latent variable inference) annealed gradient descent on minibatch of Z until convergence M-step (parameter learning) 1 step of stochastic gradient descent (diagonal Levenberg-Marquard) [LeCun et al, 1998b; Ghahramani & Roweis, 1999; Ranzato et al, 2007]
8
Smoothness penalty on latent variables • More latent variables than observations: m>n • Underconstrained latent variable inference →L1 regularization (enforce latent variable sparsity across time and dimensions) →Smoothness penalty (reduce high-freq noise)
9
Results1: asynchronous sine waves Data and problem Mixture of sources
Results Spectrum analysis Spectrum analysis of 5 latent states, inferred of 5 latent states, inferred on 400 training points on 3600 testing points
Hidden variables Z(t), Dynamical model: dimension m=5 5 independent AR(25) Smoothness penalty: 0.01 Sources separated Perfect reconstruction: observation SNR 64dB, dynamical SNR 54dB Outperforms Long Short-Term Memory in the prediction task [Wierstra et al, 2007]
10
2 Results :
Inferring the Lorenz chaotic attractor Data Lorenz dynamical model
Correlation dimension 2.06
Problem Partial observation Learn the DFG on train data with latent variables of dimension m=3 Results Latent state attractor inferred on test data is similar to Lorenz attractor Lorenz attractor reconstructed on latent variables Correlation dimension 1.88
1-step prediction error of -46.2dB smaller than in SVR (-41.6dB)
[Lorenz, 1963; Mattera et al, 1999]
11
Results3: CATS time series Data and problem CATS time series prediction competition Noisy chaotic time series (5000 points) with missing data (100 points) Results Predictions of 5 segments of missing data beat the CATS benchmark
[Lendasse et al, 2004]
12
Results4: missing MoCap markers Data
• Observations Y: 49-dimensional Motion Capture markers Problem
• Model missing data (e.g. occlusions…) – Test sequence: 260 frames – 2 subsequences of 65 frames with missing data: • Left leg • Entire upper body Approach
• Infer latent variables (E-step) on test sequence, (without gradient from missing Yi(t)), generate Y from Z [Taylor et al, 2006]
13
Results4: missing MoCap markers Original data
Reconstruction of missing upper body
Lower NMSE than nearest neighbors; Inferred smooth, realistic motion
Reconstruction of missing left leg
Results
14
Results4: missing MoCap markers Original data
Reconstruction of missing upper body
Lower NMSE than nearest neighbors; Inferred smooth, realistic motion
Reconstruction of missing left leg
Results
15
Results4: missing MoCap markers
Normalized joint angles over the full test sequence: original, DFG, nearest neighbors
16
Thank you 1.
Barber, D.: Dynamic bayesian networks with deterministic latent tables. In: Advances in Neural Information Processing Systems (2003)
2.
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (1994)
3.
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39 (1977)
4.
Ghahramani, Z., Roweis, S.: Learning nonlinear dynamical systems using an EM algorithm. In: Advances in Neural Information Processing Systems (1999)
5.
Kschischang, F., Frey, B., Loeliger, H.-A.: Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory 47 (2001)
6.
Ilin, A., Valpola, H., Oja, E.: Nonlinear dynamical Factor Analysis for State Change Detection. IEEE Transactions on Neural Networks 15(3) (2004)
7.
Lang, K., Hinton, G.: The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University (1988)
8.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, (1998a)
9.
LeCun, Y., Bottou, L., Orr, G., Muller, K.: Efficient backprop. In: Orr, G.B., Muller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, Springer (1998b)
10.
Lendasse, A., Oja, E., Simula, O.: Time series prediction competition: The CATS benchmark. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)
11.
Levin, E.: Hidden control neural architecture modeling of nonlinear time-varying systems and its applications. IEEE Transactions on Neural Networks 4 (1993)
12.
Lorenz, E.: Deterministic nonperiodic flow. Journal of Atmospheric Sciences 20 (1963)
13.
Mattera, D., Haykin, S.: Support vector machines for dynamic reconstruction of a chaotic system. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)
14.
Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Using support vector machines for time-series prediction. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)
15.
Sarkka, S., Vehtari, A., Lampinen, J.: Time series prediction by kalman smoother with crossvalidated noise density. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)
16.
Takens, F.: Detecting strange attractors in turbulence. Lecture Notes in Mathematics, vol. 898 (1981)
17.
Taylor, G., Hinton, G., Roweis, S.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems (2006)
18.
Wan, E.: Time series prediction by using a connectionist network with internal delay lines. In: Weigend, A.S., Gershenfeld, N.A. (eds.) Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley (1993)
19.
Wan, E., Nelson, A.: Dual kalman filtering methods for nonlinear prediction, estimation, and smoothing. In: Advances in Neural Information Processing Systems (1996)
20.
Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Advances in Neural Information Processing Systems (2006)
21.
Wierstra, D., Gomez, F., Schmidhuber, J.: Modeling systems with internal state using Evolino. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (2005)
22.
Williams, R., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Backpropagation: Theory, Architectures and Applications, Lawrence Erlbaum Associates (1995)
17