[Kschischang et al, 2001; Taylor et al, 2006; Lorenz, 1963]

Dynamical Factor Graphs (DFG) for Time Series Modeling Piotr Mirowski Yann LeCun Courant Institute of Mathematical Sciences, New York University {mir...

Author: Thomasine Doyle

5 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

pentoxifylline, superoxide dismutate, iontophoresis, radiation and ESWT (Hauck et al., 2006; Trost et al., 2007; Taylor & Levine, 2008; Safarinejad

Havlish, et al. v. bin Laden, et al. AFFIDAVIT INDEX

Mertens et al. Mertens et al. Trials 2013, 14:90

(Jourquin et al., 2010)

D.B. ET AL., APPELLANTS

LISA MADIGAN, et al.,

R. Albuquerque et al

Biswaranjan Ghosh et. al.,

M.N. COLABELLI et al

Araujo et al. (2015)

The HoloceneBattarbee et al

13 et al

A1 Shah et al

C. Torres et. al

Taylor et al. BMC Genetics 2013, 14:33

S13A1084. DEAL v. COLEMAN et al. S13A1085. KIA MOTORS MANUFACTURING GEORGIA, INC., et al. v. COLEMAN et al

van Euler and Klussman (4), van Eekelen et al. (5), and Harris et al. (6). Harris et al. (6), van Eekelen et al. (7), and Johnson and Zilva

2008 A.J.C. Berkhout et al

WALTER QUIROGA CASTRO ET AL

LLOYD v. PENNIE, et al

Peralta Vargas C. et al

Mohamed El-Sirafy et al

DEBOURDEAU, P, et al. Abstract

Dynamical Factor Graphs (DFG) for Time Series Modeling

Piotr Mirowski Yann LeCun Courant Institute of Mathematical Sciences, New York University {mirowski,yann}@cs.nyu.edu http://cs.nyu.edu/~mirowski

Motivation for DFG • State-space model observation model g

Y(t-1)

Y(t)

Y(t+1)

Z(t-1)

Z(t)

Z(t+1)

dynamical model f

• Human MoCap – Few visible markers – Many (hidden) joint angles

• Unknown latent states • Potentially high-dimensional • Chaotic time series – Unobserved data continuous latent states – Complex, • Highly nonlinear dynamics deterministic or observation/control dynamics models (convolutional net) • Handle long sequences in linear time [Kschischang et al, 2001; Taylor et al, 2006; Lorenz, 1963]

2

Dynamical Factor Graph (DFG) n-dimensional observed variable observation model

Gaussian noise

Y(t-2)

Y(t-1)

Y(t)

Y(t+1)

observation model g Z(t-2) dynamical Z(t-1) model f

Z(t)

Z(t+1)

m-dimensional latent variable

dynamical model (1st order Markov) [Kschischang et al, 2001]

Gaussian noise 3

Dynamical Factor Graph (DFG) n-dimensional observed variable observation model

Gaussian noise

Y(t-2)

Y(t-1)

Y(t)

Y(t+1)

observation model g Z(t-2) dynamical Z(t-1) model f

Z(t)

Z(t+1)

m-dimensional latent variable time-embedded sequence of p latent variables dynamical model (p th order Markov) [Kschischang et al, 2001]

Gaussian noise 4

Highly nonlinear factors: convolutional networks • Higher-order nonlinearity than: – radial basis functions – single hidden-layer Perceptrons

• No closed-form parameter optimization  Use gradient-based techniques n-dimensional input with time embedding p=11: n×p

1×3 convolution (across time); time-step of 2

Layer 1 12 filters: n×5

Layer 2 12 filters: 1×3

n×12×3 convolution (across time, filters and components); time-step of 1

[LeCun et al, 1998; Mirowski et al, 2007, 2008, 2009]

Layer 3 Full connection: n-dimensional vector

12×3 full connection

5

Energy-based graph of a DFG Y(t-1) Eo(t-1)

Y(t)

observation energy Eo(t)

dynamical energy

Ed(t) observation parameters Wo

Z(t-1)

Z(t) dynamical parameters Wd

Learning and inference: deterministic gradient-based EM [Ghahramani & Roweis, 1999]

6

Inference of latent variables dynamical energy observation energy

Inference of latent variables Z = minimization w.r.t. Z of dynamical energy observation energy Total energy of the DFG given a sequence Y and model parameters Wo, Wd:

[Ghahramani & Roweis, 1999; Ranzato et al, 2007]

7

Learning of model parameters Loss function to minimize

e.g. L1 regularization of model parameters

Sparsity or smoothness penalties during state inference Deterministic gradient-based version of Expectation-Maximization E-step (latent variable inference) annealed gradient descent on minibatch of Z until convergence M-step (parameter learning) 1 step of stochastic gradient descent (diagonal Levenberg-Marquard) [LeCun et al, 1998b; Ghahramani & Roweis, 1999; Ranzato et al, 2007]

8

Smoothness penalty on latent variables • More latent variables than observations: m>n • Underconstrained latent variable inference →L1 regularization (enforce latent variable sparsity across time and dimensions) →Smoothness penalty (reduce high-freq noise)

9

Results1: asynchronous sine waves Data and problem Mixture of sources

Results Spectrum analysis Spectrum analysis of 5 latent states, inferred of 5 latent states, inferred on 400 training points on 3600 testing points

Hidden variables Z(t), Dynamical model: dimension m=5 5 independent AR(25) Smoothness penalty: 0.01 Sources separated Perfect reconstruction: observation SNR 64dB, dynamical SNR 54dB Outperforms Long Short-Term Memory in the prediction task [Wierstra et al, 2007]

10

2 Results :

Inferring the Lorenz chaotic attractor Data Lorenz dynamical model

Correlation dimension 2.06

Problem Partial observation Learn the DFG on train data with latent variables of dimension m=3 Results Latent state attractor inferred on test data is similar to Lorenz attractor Lorenz attractor reconstructed on latent variables Correlation dimension 1.88

1-step prediction error of -46.2dB smaller than in SVR (-41.6dB)

[Lorenz, 1963; Mattera et al, 1999]

11

Results3: CATS time series Data and problem CATS time series prediction competition Noisy chaotic time series (5000 points) with missing data (100 points) Results Predictions of 5 segments of missing data beat the CATS benchmark

[Lendasse et al, 2004]

12

Results4: missing MoCap markers Data

• Observations Y: 49-dimensional Motion Capture markers Problem

• Model missing data (e.g. occlusions…) – Test sequence: 260 frames – 2 subsequences of 65 frames with missing data: • Left leg • Entire upper body Approach

• Infer latent variables (E-step) on test sequence, (without gradient from missing Yi(t)), generate Y from Z [Taylor et al, 2006]

13

Results4: missing MoCap markers Original data

Reconstruction of missing upper body

Lower NMSE than nearest neighbors; Inferred smooth, realistic motion

Reconstruction of missing left leg

Results

14

Results4: missing MoCap markers Original data

Reconstruction of missing upper body

Lower NMSE than nearest neighbors; Inferred smooth, realistic motion

Reconstruction of missing left leg

Results

15

Results4: missing MoCap markers

Normalized joint angles over the full test sequence: original, DFG, nearest neighbors

16

Thank you 1.

Barber, D.: Dynamic bayesian networks with deterministic latent tables. In: Advances in Neural Information Processing Systems (2003)

2.

Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (1994)

3.

Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39 (1977)

4.

Ghahramani, Z., Roweis, S.: Learning nonlinear dynamical systems using an EM algorithm. In: Advances in Neural Information Processing Systems (1999)

5.

Kschischang, F., Frey, B., Loeliger, H.-A.: Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory 47 (2001)

6.

Ilin, A., Valpola, H., Oja, E.: Nonlinear dynamical Factor Analysis for State Change Detection. IEEE Transactions on Neural Networks 15(3) (2004)

7.

Lang, K., Hinton, G.: The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University (1988)

8.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, (1998a)

9.

LeCun, Y., Bottou, L., Orr, G., Muller, K.: Efficient backprop. In: Orr, G.B., Muller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, Springer (1998b)

10.

Lendasse, A., Oja, E., Simula, O.: Time series prediction competition: The CATS benchmark. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)

11.

Levin, E.: Hidden control neural architecture modeling of nonlinear time-varying systems and its applications. IEEE Transactions on Neural Networks 4 (1993)

12.

Lorenz, E.: Deterministic nonperiodic flow. Journal of Atmospheric Sciences 20 (1963)

13.

Mattera, D., Haykin, S.: Support vector machines for dynamic reconstruction of a chaotic system. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)

14.

Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Using support vector machines for time-series prediction. In: Scholkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods: Support Vector Learning, MIT Press (1999)

15.

Sarkka, S., Vehtari, A., Lampinen, J.: Time series prediction by kalman smoother with crossvalidated noise density. In: Proceedings of IEEE International Joint Conference on Neural Networks (2004)

16.

Takens, F.: Detecting strange attractors in turbulence. Lecture Notes in Mathematics, vol. 898 (1981)

17.

Taylor, G., Hinton, G., Roweis, S.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems (2006)

18.

Wan, E.: Time series prediction by using a connectionist network with internal delay lines. In: Weigend, A.S., Gershenfeld, N.A. (eds.) Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley (1993)

19.

Wan, E., Nelson, A.: Dual kalman filtering methods for nonlinear prediction, estimation, and smoothing. In: Advances in Neural Information Processing Systems (1996)

20.

Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Advances in Neural Information Processing Systems (2006)

21.

Wierstra, D., Gomez, F., Schmidhuber, J.: Modeling systems with internal state using Evolino. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (2005)

22.

Williams, R., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Backpropagation: Theory, Architectures and Applications, Lawrence Erlbaum Associates (1995)

17