Dynamic Systems and Control Final Project

Dynamic Systems and Control Final Project Due Date: September 1st, 2015 1 LQG 1. In a fully observable, deterministic, discrete-time linear dynamica...
Author: Collin Watts
4 downloads 0 Views 195KB Size
Dynamic Systems and Control Final Project Due Date: September 1st, 2015

1

LQG 1. In a fully observable, deterministic, discrete-time linear dynamical system with dynamics xt+1 = Axt + But , suppose that the state cost is time-variant with linear and constant terms ct (xt , ut ) = 21 x|t Qt xt + 21 u|t Rut + qt| xt + ct (0, 0). (a) Write the Bellman equation for the cost-to-go of this system. (b) Assume that the cost-to-go at time t + 1 has the form Vt+1 (xt+1 ) = 12 x|t+1 St+1 xt+1 + s|t+1 xt+1 + Vt+1 (0). Solve the Bellman equation and write the optimal control signal ut in the form ut (xt ) = Lt xt + ut (0). (c) Substitute the optimal control in the Bellman equation to find an expression for Vt in the same form as for Vt+1 . Write the recursive expressions for St , st and Vt (0). (d) Implement a backward algorithm to compute the optimal value function, given by its parameters St , st and Vt (0). Assume that the cost-to-go at time T is 0. The function lqr will receive as a first argument an object with the following members (in Python: dict with the following keys): n - integer, dimension of the world state; m - integer, dimension of the control signal; A - n × n real matrix (in python: NumPy array), uncontrolled state dynamics; B - n × m real matrix, state control;

1

Q - T × n × n array of real numbers, Q[t] (in Matlab: Q(t+1, :, :)) is the state cost Hessian in time 0 ≤ t < T; R - m × m real matrix, control cost; q - T × n array of real numbers, q[t] (in Matlab: q(t+1, :)) is the linear state cost coefficient in time 0 ≤ t < T; c0 - T-sized array of real numbers, c0[t] (in Matlab: c0(t+1)) is the constant state cost coefficient in time 0 ≤ t < T. The function will receive as a second argument the horizon T, an integer. The function will return V, an object with the members (in Python: dict with the keys) S, s and V0, which are respectively a T × n × n, T × n and T-sized arrays of real numbers, such that S[t], s[t] and V0[t] (in Matlab: S(t+1, :, :) etc.) are St , st and Vt (0), for 0 ≤ t < T. Submit your code. (e) Implement a forward algorithm to compute the optimal mean trajectory. The function trajectory will receive as a first argument an object with with the same members as the first argument for lqr. The function will receive as a second argument an n-dimensional vector x0, the initial state. The function will receive as a third argument an object V with the same members as the output of lqr. The function will receive as a fourth argument the horizon T, an integer. The function will return x, a T × n array of real numbers, such that x[t] (in Matlab: x(t+1, :)) is the optimal mean state in time 0 ≤ t < T. The function will also return u, a T × m array of real numbers, such that u[t] (in Matlab: u(t+1, :)) is the optimal control in time 0 ≤ t < T. Submit your code. 2. (a) Consider an uncontrolled linear dynamical system xt+1 = Axt + ξt ; yt = Cxt + ψt ;

ξt ∼ N (0, Σξt ) ψt ∼ N (0, Σψt ),

with independent Gaussian noise variables and state dimension n. Write the expression for the observability matrix O such that    yt   yt+1      E   .  xt  = Oxt .   ..   yt+n−1 2

(b) Suppose that Σξτ = Σψτ = 0 for t − n ≤ τ ≤ t − 1, and that rank A = n. Write a necessary and sufficient condition on O, so that xt can be completely predicted from yt−n:t−1 . A system satisfying this condition is called observable. Hint: write a linear matrix equation involving xt and yt−n:t−1 . What is the condition for this equation to have a unique solution? 3. Implement the Kalman filter, a forward algorithm to compute the belief covariance Σt in a linear dynamical system xt+1 = Axt + ξt ; yt = Cxt + ψt ;

ξt ∼ N (0, Σξ ) ψt ∼ N (0, Σψ ),

with independent Gaussian noise variables. Assume that x0 is known, i.e. Σ0 = C(x0 |y0 ) = 0. The function lqe will receive as a first argument an object with the following members (in Python: dict with the following keys): n - integer, dimension of the world state; k - integer, dimension of the observation; A - n × n real matrix (in python: NumPy array), uncontrolled state dynamics; C - k × n real matrix, state observation; covxi - n × n real matrix, process noise covariance; covpsi - k × k real matrix, observation noise covariance. The function will receive as a second argument the horizon T, an integer. The function will return Sigma, a T×n×n array of real numbers, such that Sigma[t] (in Matlab: Sigma(t+1, :, :)) is the posterior covariance Σt = C(xt |y0:t ). Submit your code. 4. The position of a drone in time t is given by a vector zt with 2 dimensions, one horizontal and one vertical (we ignore the other horizontal dimension to better visualize the trajectory). Its dynamics in discrete time are given by zt+1 = zt + vt , for a velocity vt . The dynamics of the velocity are given by vt+1 = vt + Ft , where Ft is the sum of 3 forces operating on the drone at time t Ft = Ftg + Ftd + ut . The forces are: 3

Gravity - the constant vector Ftg



 0 = . −10

Drag - linear in the velocity of the drone relative to the air. If the wind has velocity wt at time t, then Ftd = −0.04(vt − wt ). (This is a good approximation of the real physics in slow speeds, except that in reality the constant is 1000 times smaller, for a drone weighing 1kg.) The dynamics of the wind velocity are given by wt+1 = 0.9wt + ωt ;

ωt ∼ N (0, Σω ),

with independent noise variables with covariance   0.1 0 Σω = . 0 0.02 Lift - the control force ut . The energy cost rate of exerting force ut is quadratic in its size ce (ut ) = 12 kut k2 . The cost rate also has a component which is quadratic in the distance from the position zt to a target position z¯t cst (zt ) = 12 kzt − z¯t k2 . The drone has accelerometers that measure the force Ft . They automatically cancel out the gravity and the control, getting the measurement Fˆt = Ftd + φt ;

φt ∼ N (0, Σφ ),

with independent noise variables with covariance   0.1 0 Σφ = . 0 0.1 (a) What is the state xt of the system? Of the many ways to define it, choose one which includes zt , as well as any other variable needed to compute the next state. Try to choose a natural parameterization of the problem, and to avoid redundancy (i.e. A should be full-rank). (b) Write the dynamics of the state you defined. What are the matrices A, B, Σξ and R? What are the matrix Qt , the vector qt and the constant ct (0, 0) as in question 1, in terms of z¯t ? 4

(c) Write the measurement yt = Fˆt as a linear function of the state xt , plus Gaussian noise. What are the matrices C and Σψ ? (d) What is the column-space of the controllability matrix, i.e. the controllable subspace? What is the row-space of the observability matrix, i.e. the observable subspace? Is the system controllable? Is it observable? (e) Plan a trajectory that follows a vertical circle of radius 1, with period T , i.e. with   cos(2π Tt ) . z¯t = sin(2π Tt )   0 At time t = 0 the drone is in position z0 = , in rest (v0 = 0) and 0 there’s no wind (w0 = 0). For each of T = 4, 8, 16, 32 and 64, do the following. i. Show in the same plot the target trajectory and the optimal mean trajectory for 3 periods. ii. Scale the control cost, relative to the state cost, by 2α , with α each of −3, −2, −1, 0, 1, 2, 3. Plot the total state cost and the total control cost as a function of α. For each T between 4 and 100, in strides of 4, compute the average (per time step) extra cost due to the partial observability of the process. Plot this average extra cost as a function of the period T .

2

TD-learning 1. Implement a one-step simulation of an MDP. In the next questions, this will be the only allowed way to access the MDP. The function step will receive as a first argument an object with the following members (in Python: dict with the following keys): s0 - integer, the initial state of an episode; p - sizeS × sizeA × sizeS array of real numbers (in python: NumPy array), such that p[s, a, s_] = p(s_|s, a) is the probability of reaching state s_ after taking action a in state s; c - sizeS × sizeA array or real numbers, such that c[s, a] is the cost of taking action a in state s. The function will receive as a second argument an integer s, the current state. The function will receive as a third argument an integer a, the action. The function will return an integer s_, a realization of the state following state s and action a. The action a = −1 is a special reset action, after which s_ = s0. 5

sf

s0

The function will also return a real number c_, the cost of taking action a in state s. This is meaningless if a = −1. Submit your code. 2. Read about SARSA, n-step SARSA and SARSA(λ) (chapters 6 and 7 in Sutton & Barto). 3. Implement SARSA, on-policy TD learning with on-line updating. Whenever you choose an action, do it according to a softmax policy. That is, in state s choose a with probability proportional to eβQ(s,a) . Let β (the ”inverse temperature”) grow like 0.1m, with m the number of episodes completed so far. The function sarsa will receive as a first argument an object as in the previous question. The function will receive as a second and third arguments integers sizeS and sizeA, the sizes of the sets of states and of actions. The absorbing state will always be s = 0 (in Matlab: s = 1). The function will receive as a fourth argument a real number r, the update look-ahead. If r > 1, it must be an integer, and the function performs n-step SARSA with n = r. If 0 ≤ r ≤ 1, the function performs SARSA(λ) with λ = r. Hint: in the n-step case, don’t forget to update the last n states when the episode ends. The function will receive as a fifth argument an integer M, the number of episodes to run. The function will return Q, a M × sizeS × sizeA array of real numbers, such that Q[m, s, a] is the action-value Q(s, a) at the end of episode m. Submit your code. 4. Maze Revisited: implement a model of the following maze. Each tile is a state, and the tile marked sf is an absorbing state. The actions Right, Up, Left, Down take you to the appropriate next state, with success probability of 0.8 if there’s no wall, 0 if there is a wall. The action fails and remains in the same state, with probability 0.2 if there’s no wall, 1 if there is a wall. Each step costs 1 if it succeeds, 6 if it fails. There’s no discount.

6

Run n-step SARSA for 50 episodes with n = 1, 2, 4 and 8. After each episode, find the optimal policy for the learned Q, and compute its value (you can reuse your solution or the skeleton of question 2 in exercise 4). Plot this value against the episode count m. Show the results for the 4 values of n on the same plot. Run SARSA(λ) for 50 episodes with λ from 0 to 1 in strides of 0.01. Compute the value of the optimal policy for the final learned Q. Plot this value against λ. Submit your code.

7