Max-plus Stochastic Control

Max-plus Stochastic Control Wendell H. Fleming Brown University, Providence, RI 02912 USA Abstract. Max-plus stochastic processes are counterparts of ...
Author: Avis Lindsey
0 downloads 0 Views 152KB Size
Max-plus Stochastic Control Wendell H. Fleming Brown University, Providence, RI 02912 USA Abstract. Max-plus stochastic processes are counterparts of Markov diffusion processes governed by Ito sense stochastic differential equations. In this framework, expectations are linear operations with respect to max-plus arithmetic. Max-plus stochastic control problems are considered, in which a minimizing control enters the state dynamics and running cost. The minimum max-plus expected cost is equal to the upper Elliott-Kalton value of an associated differential game. Key words. Max-plus stochastic control, differential games. AMS Subject ClassiÞcation. 35F20, 60H10, 93E20.

1

Introduction.

The Maslov idempotent calculus provides a framework in which a variety of asymptotic problems, including large deviations for stochastic processes, can be considered [15]. The asymptotic limit is typically described through a deterministic optimization problem. However, the limit still retains a “stochastic” interpretation, if probabilities are assigned which are additive with respect to “max-plus” addition and expectations are deÞned so as to be linear with respect to max-plus addition and scalar multiplication. Instead of “idempotent probability” we will use the term “max-plus probability”. There is an extensive literature on max-plus probability and max-plus stochastic processes. See [1][2][14][15][17] and references cited there. The max-plus framework is also important for certain problems in discrete mathematics, and in computer science applications [2][14]. In [7] some elements of a max-plus stochastic calculus were considered, including max-plus martingales and stochastic differential equations. In the present paper, a brief introduction to max-plus stochastic control is given. As formulated in Section 2, the state xs which is being controlled satisÞes a differential equation (2). The state dynamics depend on a minimizing control us and also on a disturbance vs . The max-plus probability law of disturbance trajectories is given by a likelihood function Q(v. ) of the form (3). The objective is to choose a minimizing control, depending on past information, to minimize the max-plus expectation of a cost function Z of the form (5). Associated with a max-plus stochastic control problem is a two controller, zero sum differential game in which the disturbance vs is the maximizing control. The upper value function V satisÞes the upper Isaacs PDE (14) in the viscosity sense. At least formally, an optimal feedback control u∗ is obtained via (17). Under additional (rather restrictive) assumptions this is made precise in Section 3.

2

Wendell H. Fleming

In Section 4 strictly progressive strategies for the minimizing controller are deÞned. These are max-plus analogues of progressively measurable controls in the theory of controlled Markov diffusions. Theorem 4.1 states that V equals the max-plus optimal cost function W . The lower game value V would be obtained if the minimizing controller were allowed to use strategies which are progressive rather than strictly progressive. We call the difference V − V an instantaneous information gap. In Section 5 we recall a result according to which a max-plus stochastic control problem is the small-noise -intensity limit of risk sensitive stochastic control problems. This result is a control counterpart of results in the Freidlin-Wentzell theory of large deviations for small random perturbations of dynamical systems.

2

Max-plus control problem.

Let IR denote the real numbers and IR− = IR∪{−∞}. Let Ω be some “sample space”, and let Q be a IR− -valued function on Ω with sup Q(ω) = 0. ω∈Ω

We call Q(ω) the likelihood of ω and Q the max-plus probability density function. The max-plus expectation of a IR− -valued Z on Ω is E + (Z) = sup [Z(ω) + Q(ω)]

(1)

ω∈Ω

provided the right side is not +∞. E + is a linear operation with respect to max-plus addition and scalar multiplication. The max-plus probability P + (A) of a set A ⊂ Ω is obtained when Z(ω) = 0 for ω ∈ A and Z(ω) = −∞ for ω ∈ Ω/A. We consider max-plus stochastic control problems which are formulated as follows. Let t denote an initial time and T a Þnal time, with 0 ≤ t < T . Let xs denote the state and us the control at time s, with xs ∈ IRn , us ∈ U, t ≤ s ≤ T . Moreover, let vs denote a “disturbance” at times s, with vs ∈ V. The state dynamics are the differential equation dxs = F (xs , us , vs ) , t ≤ s ≤ T, ds with initial data xt = x. The likelihood of disturbance v. is Z T q (vs ) ds. Q (v. ) = −

(2)

(3)

t

We make the following assumptions: (A1) U ⊂ IRk , V ⊂ IRm and U, V are compact. (A2) F is of class C 1 on IRn × U × V, and F together with its Þrst order

Max-plus Stochastic Control

3

partial derivatives is bounded. (A3) q is continuous on V and maxv∈V q (v) = 0. Note that in this model, “max-plus randomness” enters via the disturbances v. The sample space is Ω = L∞ ([t, T ] ; V) with likelihood of v. given by (3). In [7] the special case F (x, u, v) = f (x, u) + σ (x, u) v is considered. In that case, (2) can be written equivalently as dxs = f (xs , us ) ds + σ (xs , us ) dws , Z s ws = vr dr.

(4)

t

In the terminology of [17], ws is a max-plus independent increment process. In Section 5 we will take V = IRm , Ω = L2 ([t, T ] ; IRm ) and −Q (v. ) half the L2 norm. In that case, (4) is the max-plus analogue of an Ito-sense stochastic differential equation in which ws is replaced by a Brownian motion Bs . In the terminology of [17], ws is a max-plus independent increment process. In Section 5 we will take V = IRm , Ω = L2 ([t, T ] ; IRm ) and −Q (v. ) half the L2 norm. In that case, (4) is the max-plus analogue of an Ito-sense stochastic differential equation in which ws is replaced by a Brownian motion Bs . Let l (x, u) , g (x) denote running cost and terminal cost functions, such that: (A4) l, g are of class C 1 and bounded together with their Þrst order partial derivatives. Let Z T Z= l (xs , us ) ds + g (xT ) . (5) t

The control problem is to Þnd a control us which minimizes the max-plus + expectation Etx (Z). (The subscripts t, x refer to the initial data xt = x for (2).) The control us is chosen using “complete information” about the past up to time s, in a sense which must be made precise. In a way similar to the theory of controlled Markov diffusions [10], two kinds of control strategies are considered. In Section 3 feedback control policies based on current states are considered. Then in Section 4 strictly progressive control strategies are considered, which correspond to progressively measurable control processes in the theory of controlled Markov diffusions. By (1), (5) + Z = sup P (t, x ; u. , v. ) Etx

(6)

v.

P (t, x ; u. , v. ) =

Z

t

T

[l (xs , us ) − q (vs )] ds + g (xT ) .

(7)

This suggests that the max-plus stochastic control problem is closely related to a two-controller, zero sum differential game. In this differential game

4

Wendell H. Fleming

the minimizing controller chooses us . The maximizing controller chooses a disturbance vs . The game dynamics are (2) and the game payoff is (7). Under assumptions (A.1)-(A.4) this game has upper and lower Elliott-Kalton + (Z) sense values, V (t, x) and V (t, x). It will be seen that the inÞmum of Etx among strictly progressive strategies for the minimizing controller equals the upper game value V (t, x). See Theorem 4.1.

3

Feedback control policies.

In this section we consider controls which are feedback functions of time and state: us = u (s, xs ) .

(8)

To avoid various technical difficulties, we consider only control policies which are Lipschitz continuous functions from [0, T ] × IRn into U . Let + Z, J (t, x ; u) = Etx

(9)

and substitute (8) in the state dynamics (2) and in (5). By (6), J (t, x ; u) is the optimal cost function for the following control problem: choose a disturbance control v. to maximize P (t, x ; u, v. ). Let H u (x, p) = max [F (x, u, v) · p + l (x, u) − q (v)] . v∈V

(10)

Then J (t, x ; u) is the unique bounded, uniformly continuous viscosity solution to the dynamic programming PDE 0=

∂J + H u (x, Dx , J) , 0 ≤ t < T ∂t

(11)

with the terminal data J (T, x ; u) = g (x) , x ∈ IRn .

(12)

See [10,Chapt. 2] [3,Chap. 3]. In (11), Dx J denotes the gradient vector. Let ¯ (x, p) = min H u (x, p) . H u∈U

(13)

The upper value function V (t, x) is the unique bounded, uniformly continuous viscosity solution to 0=

¢ ¡ ∂V ¯ x, Dx V +H ∂t

V (T, x) = g (x) .

(14) (15)

Max-plus Stochastic Control

5

See [6][3,Chap. 8]. Equation (14) is called the upper Isaacs PDE. Since ¯ ≤ H u a comparison principle for viscosity sub and supersolutions to (14)— H (15)[10, p. 86][3,p. 152] gives that V (t, x) ≤ J (t, x ; u) .

(16)

If there exists u∗ such that equality holds in (16), then u∗ is an optimal feedback control policy. The following is a sufficient condition that an optimal continuous u∗ exists. Suppose that V is a classical solution to (14)—(15), such that V and Dx V are continuous and bounded. Moreover, suppose that u∗ is Lipschitz continuous, where ¡ ¢ u∗ (t, x) ∈ arg min H u x, Dx V (t, x) . (17) u∈U

When u = u∗ , V is a classical solution to (11)—(12), hence also a viscosity solution. Since J (t, x ; u∗ ) is the unique bounded, uniformly continuous viscosity solution to (11)—(12) V (t, x) = J (t, x ; u∗ ) .

(18)

These assumptions on V and u∗ are quite restrictive. In many examples from control theory and differential games, value functions are not of class C 1 and optimal feedback control policies must be discontinuous.

4

Strictly progressive strategies.

Let us recall the Elliott-Kalton deÞnition of lower and upper game values V (t, x), V (t, x). See [5]. A strategy α for the minimizing controller maps L∞ ([t, T ] ; V) into L∞ ([t, T ] ; U). Moreover, vr = v˜r for almost all r ∈ [t, s] implies α (v. )r = α (˜ v. )r for almost all r ∈ [t, s]. We call such a strategy progressive. By deÞnition V (t, x) = inf sup P (t, x ; α (v. ) , v. ) α

(19)

v.

where the inf is taken over progressive strategies α. By (6), this is the same + as the inÞmum over progressive α of the max-plus expectation Etx Z. Similarly, a strategy β for the maximizing controller is a progressive map from L∞ ([t, T ] ; U ) into L∞ ([t, T ] ; V), and V (t, x) = sup inf P (t, x ; u. , β (u. )) . β

u.

(20)

A strategy α for the minimizing controller is called strictly progressive if: for every strategy β for the maximizing controller, the equations u. = α (v. ) , v. = β (u. )

(21)

6

Wendell H. Fleming

have a solution uˆ. , vˆ. . This implies that u ˆ. is a Þxed point of the composite map α ◦ β. At an intuitive level, a strictly progressive strategy α allows the minimizer to choose us knowing vr for t ≤ r < s, while a progressive strategy allows knowledge of vs also. Examples 4.1—4.3 provide some useful classes of strictly progressive strategies. Example 4.1. Let t = t0 < t1 < · · · < tN = T be a partition of the interval [t, T ]. Suppose that vr = v˜r for almost all r ∈ [t, tj ] implies α (v. )r = α (v˜. )r for almost r ∈ [t, tj+1 ], j = 0, 1, · · · , N − 1. Then α is strictly progressive. For j = 0, ur = u ˆr is chosen open loop on the initial interval [t, t1 ) and vˆr = β (ˆ u. )r on [t, t1 ). Then u ˆr and vˆr are chosen on successive intervals [tj , tj+1 ) such that (21) holds. Example 4.2. Suppose that there is a “time delay” δ > 0 such that α (v. )s depends on vr only for r ≤ s−δ. We may assume that δ = N −1 (T − t) for some integer N and take tj = jN −1 (T − t) in Example 4.1. Example 4.3. Let uj : IRn → U and let α (v. )s = uj (xj ) for tj ≤ s < tj+1 , j = 0, 1, · · · , N − 1,

(22)

where xj = xtj and xs satisÞes (2) on [t, T ] with xt = x. Then α is strictly progressive. On [t, t1 ), us = u0 (x) and us is deÞned on later intervals [tj , tj+1 ) by us = uj (xj ). Example 4.4. In this example, α is progressive but not strictly progressive. Let α (v. )s = φ (vs ) where φ is Borel measurable and not constant. There is a partition U = U1 ∪ U2 with U1 open, such that V1 = φ−1 (U1 ) and V2 = φ−1 (U2 ) are both nonempty. Choose vi ∈ Vi for i = 1, 2 and ψ such that ψ (u) = v2 if u ∈ U1 , ψ (u) = v1 if u ∈ U2 . The composite φ ◦ ψ has no Þxed point. Let β (u. )s = ψ (us ). Then (α ◦ β) (v. )s = (φ ◦ ψ) (vs ). If (21) holds, then uˆs is a Þxed point of φ ◦ ψ which is a contradiction. Let + Z, W (t, x) = inf Etx α

(23)

where the inf is taken among all strictly progressive strategies α. Theorem 4.1. W (t, x) = V (t, x). Let us only sketch a proof of Theorem 4.1, and refer to [7] for details. To show that W ≥ V , given θ > 0 there exists a strategy β for the maximizing controller such that P (t, x ; u. , β (u. )) ≥ V (t, x) − θ

(24)

for all controls u. for the minimizer. Given any strictly progressive α, let uˆ. , vˆ. be a solution of (21). Then + Etx Z = sup P (t, x ; α (v. ) , v. ) v.

≥ P (t, x ; α (ˆ v. ) , vˆ. )

= P (t, x ; u ˆ. , β (ˆ u. )) ≥ V (t, x) − θ.

Max-plus Stochastic Control

7

Since θ and strictly progressive α are arbitrary, W ≥ V . The inequality W ≤ V is proved using time discretizations and viscosity solution arguments. Given a positive integer N , consider t = tj = N j −1 T , j = 0, 1, · · · , N . Let V N (tj , x) denote the sup in (20) when us is required to be constant on each interval [ti , ti+1 ) with i ≥ j. By a construction of Nisio, V N (tj , x) = FN V N (tj+1 , ·) (x) "Z # h FN ψ (x) = inf sup [` (xs , u) − q (vs )] ds + ψ (xh ) , u∈U v.

(25)

0

where h = hN = N −1 T and xs satisÞes (2.2) with us = u and x0 = x. By a viscosity solution technique due to Souganidis [18,19][11], V N (tj , x) tends to V (t, x) uniformly on compact sets as N → ∞, tj → t. Using (25), given θ > 0 there is a strictly progressive α of the form in Example 4.3 such that sup P (0, x ; α (v. ) , v. ) < V N (0, x) + θ.

(26)

v.

See [7,Section 7]. Let N → ∞, θ → 0 to obtain W (0, x) ≤ V (0, x). Since W and V depend on the time difference T − t, this implies W ≤ V . Remark 4.1. We call the difference V (t, x) − V (t, x) an instantaneous + information gap. It equals the difference in the inÞmum over α of Etx (Z) depending on whether progressive or strictly progressive strategies α for the minimizing controller are used. If in (10)—(13) min max = max min, then V = V and the instantaneous information gap is 0. The equality min max = max min in (10)—(13) is the Isaacs minimax condition.

5

Risk sensitive control limits.

Risk sensitive control theory provides a link between stochastic and deterministic (min − max) approaches to disturbances in control systems. Let us merely indicate this connection, in the context of controlled Markov diffusions subject to small noise intensity Brownian motion disturbances. For ε > 0, let Xsε be a controlled Markov diffusion satisfying the stochastic differential equation dXsε = f (Xsε , Us ) ds + ε1/2 dBs ,

t ≤ s ≤ T,

(27)

with initial data Xtε = x. In (27) the control process Us is progressively measurable with respect to a family of σ-algebras to which the Brownian motion Bs is adapted, and Us ∈ U . Let Zε =

Z

t

T

` (Xsε , Us ) ds + g (XTε ) .

(28)

8

Wendell H. Fleming

We again assume that U is compact. Moreover, f satisÞes the same assumptions on IRn × U as in (A2) and `, g satisfy (A4). The risk sensitive control problem is to choose U. to minimize £ ¡ ¢¤ (29) J ε (t, x ; U. ) = Etx exp ε−1 Z ε

where exp denotes the exponential function. The value function for this risk sensitive control problem is Φε (t, x) = inf J ε (t, x; U. ) U.

(30)

For the corresponding max-plus stochastic control problem we take V = IRn , q(v) = 1/2|v|2 and F (x, u, v) = f (x, u) + v. This has the form (4), where for simplicity we now take σ to be the identity matrix. Since IRn is not compact, as required in assumption (A1), a “cut-off” argument in which IRn is replaced by {v : |v| ≤ M } is needed to obtain Theorem 4.1 in this case. See [7,Sec. 7]. The upper differential game value V (t, x) (which equals W (t, x) by Theorem 4.1) is obtained from the risk sensitive control value function in the following deterministic limit: V (t, x) = lim ε log Φε (t, x) . ε→0

(31)

See [8][12] and also related results on Þnite and inÞnite time horizons in [4][9][13][16]

References 1. M. Akian, J.-P. Quadrat and M. Viot , Bellman processes, in Lecture Notes in Control and Info. Sci. No. 199, eds. G. Cohen and J.-P. Quadrat, Springer Verlag, 1994. 2. F. Baccelli, G. Cohen, G.J. Olsder and J.-P. Quadrat, Synchronization and Linearity: an Algebra for Discrete Event Systems, John Wiley and Sons, 1992. 3. M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations, Birk¨ auser, 1997. 4. A. Bensoussan and H. Nagai, Min-max characterization of a small noise limit of risk-sensitive control, SIAM J. Control Optim. 35 (1997) 1093—1115. 5. R. J. Elliott and N. J. Kalton, The existence of value in differential games, Mem. Amer. Math. Soc. 126 (1972). 6. L.C. Evans and P.E. Souganidis, Differential games and representation formulas for solutions of Hamilton-Jacobi-Isaacs equations, Indiana Univ. Math. J. 33 (1984) 773—797. 7. W.H. Fleming, Max-plus stochastic processes and control, preprint. 8. W.H. Fleming and W.M. McEneaney, Risk sensitive control and differential games, Lecture Notes in Control and Inform. Sci. 184, Springer-Verlag, New York, (1992), 185—197.

Max-plus Stochastic Control

9

9. W.H. Fleming and W.M. McEneaney, Risk sensitive control on an inÞnite time horizon, SIAM J. Control Optim. 33 (1995) 1881—1915. 10. W.H. Fleming and H.M. Soner, Controlled Markov Processes and Viscosity Solutions, Springer-Verlag, 1993. 11. W.H. Fleming and P.E. Souganidis, On the existence of value functions of two-player, zero-sum stochastic differential games, Indiana Univ. Math. J. 38 (1989) 293—314. 12. M.R. James, Asymptotic analysis of nonlinear risk-sensitive control and differential games, Math. Control Signals Systems 5 (1992) 401—417. 13. H. Kaise and H. Nagai, Ergodic type Bellman equations of risk-sensitive control with large parameters and their singular limits, Asymptotic Analysis 20 (1999) 279—299. 14. G.L. Litvinov and V.P. Maslov, Correspondence Principle for Indempotent Calculus and Some Computer Applications, in Idempotency, J. Gunawardena ed, Publ. Newton Inst. 11, Cambridge Univ. Press, 1998, pp. 420—443. 15. V.P. Maslov and S.M. Samborskii, eds, Idempotent Analysis, Advances in Soviet Math. No. 13, Amer. Math. Soc., 1992 16. H. Nagai, Bellman equations of risk-sensitive control, SIAM J. Control Optimiz. 34 (1996) 74—101. ´ 17. J.-P. Quadrat, Min-plus probability calculus, Actes 26 eme Ecole de Printemps d’Informatique Theorique, Noirmoutier, 1998. 18. P.E. Souganidis, Approximation schemes for viscosity solutions of HamiltonJacobi equations, J. Differential Eqns. 59 (1985) 1—43. 19. P.E. Souganidis Approximation schemes for viscosity solutions of HamiltonJacobi equations with applications to differential games, J. Nonlinear Anal., T.M.A.9 (1985) 217—257.